---
jupytext:
  formats: md:myst
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Practice: Read corrupted files

This practical is an extension to the parsing exercises done previously in
`Read/write files` and `functions (very basic`. You will practice the following:

- write code in scripts,
- use ipython and execute Python programs with the command python3,
- use objects of simple types (numbers, str, list, etc.),
- index and slice,
- use loops and conditions,
- try, except,
- read and write in text files.

We will write scripts that read a file (or a set of files) with a predefined format and
compute simple quantities (sum, average, number) from the values in the files.

```{exercise-start} Parse a file with comments: file_with_comment_col0.txt
---
label: exercise-file-comments
---
```

Contrary to the previous exercises, the files contains some comments (*i.e.* lines
starting with a `#`). Adapt previous script so that we do not consider these lines (see
file `file_with_comment_col0.txt`).

:::{admonition} Example of output
```bash
python3 reading_file_with_comment.py
file = ../data/file_with_comment_col0.txt
nb = 100; total = 53.29; avg = 0.53
```
:::

To complicate things further, another file contains comments in the middle of the line
(see *e.g.* `file_with_comment_anywhere.txt` that contains some comments that mainly
prevent the string to float conversion.

Adapt script `reading_file_with_comment` to handle this format.

:::{admonition} Example of output
```bash
python3 step1.1.py
file = "../data/file_with_comment_col0.txt"
nb = 100  ; sum = 53.29  ; avg = 0.53
file = "../data/file_with_comment_anywhere.txt"
nb = 96   ; sum = 51.65  ; avg = 0.54
# total over all files:
nb = 196  ; sum = 104.93 ; avg = 0.54
```
:::

```{exercise-end}
```

```{solution-start}  exercise-file-comments
---
class: dropdown
---
```

```{code-cell}
#!/usr/bin/env python3
"""Computes basic statistics on file that contains a set of lines, each line
containing one float and possibly some comments in the middle of the line.

"""


def compute_stats(file_name):
    """
    computes the statistics of data in file_name

    :param file_name: the name of the file to process
    :type file_name: str
    :return: the statistics
    :rtype: a tuple (number, sum, average)
    """
    sum_ = 0.0
    number = 0
    with open(file_name) as file:
        for line in file:
            if line.startswith("#"):
                continue
            if "#" in line:
                line = line.split("#", 1)[0]
            elem = float(line)
            sum_ += elem
            number += 1

    return number, sum_, float(sum_ / number)


base_path = "../common/data_read_files"
file_names = [
    f"{base_path}/file_with_comment_col0.txt",
    f"{base_path}/file_with_comment_anywhere.txt",
]

numbers = []
sums = []

for file_name in file_names:
    len_file, sum_file, avg_file = compute_stats(file_name)
    numbers.append(len_file)
    sums.append(sum_file)

    print(
        f'file = "{file_name}"\nnb = {len_file:5}; '
        f"sum = {sum_file:7.2f}; avg = {sum_file / len_file:5.2f}"
    )


all_sum = sum(sums)
all_numbers = sum(numbers)
all_avg = all_sum / all_numbers
print(
    "# total over all files:\n"
    f"nb = {all_numbers}; sum = {all_sum:.2f}; avg = {all_avg:.2f}"
)
```

```{solution-end}
```

```{exercise-start} Parse more complicated files
---
label: exercise-complicated-files
---
```

As a last exercise, we now have to deal with several columns on each line,

```bash
p1=0.7742 p2=0.74973 p3=0.77751
p1=0.7493 p2=0.34762 p3=0.44521
p1=0.4261 p3=0.88275 p2=0.74016
```

- Write a function that compute statistics separately for p1, p2, p3
- *BONUS*: When parsing `file_mut_cols_with_error`, print to the screen the lines which
  contain errors.

:::{admonition} Example of output
```bash
python3 step2.0.py
p1 in ../data/file_mut_cols.txt - nb: 25, sum: 12.72, avg: 0.51
p2 in ../data/file_mut_cols.txt - nb: 25, sum: 12.72, avg: 0.51
p3 in ../data/file_mut_cols.txt - nb: 25, sum: 12.72, avg: 0.51
Unexpected field p7=0.213607026802 at line 23 of ../data/file_mut_cols_with_error.txt
p1 in ../data/file_mut_cols_with_error.txt - nb: 25, sum: 12.82, avg: 0.51
p2 in ../data/file_mut_cols_with_error.txt - nb: 23, sum: 11.35, avg: 0.49
p3 in ../data/file_mut_cols_with_error.txt - nb: 23, sum: 11.69, avg: 0.51
```
:::

```{exercise-end}
```

```{solution-start}  exercise-complicated-files
---
class: dropdown
---
```

```{code-cell}
#!/usr/bin/env python3


def compute_p1p2p3_stats(file_name):
    """
    Computes the statistics of data in a file stored in 3 fields.
    Each field of the form key=val where key is p1, p2 or p3, and val is a float.

    :param file_name: the name of the file to process
    :type file_name: str
    :return: A tuple containing the statistics for each field under the form of a tuple (number, sum, average)
    """
    p1_sum = 0.0
    p2_sum = 0.0
    p3_sum = 0.0

    p1_number = 0
    p2_number = 0
    p3_number = 0
    with open(file_name) as handle:
        for i, line in enumerate(handle):
            if line.startswith("#"):
                continue

            fields = line.strip().split()
            for field in fields:
                if field.startswith("p1="):
                    p1_sum = p1_sum + float(field[3:])
                    p1_number = p1_number + 1
                    continue

                if field.startswith("p2="):
                    p2_sum = p2_sum + float(field[3:])
                    p2_number = p2_number + 1
                    continue

                if field.startswith("p3="):
                    p3_sum = p3_sum + float(field[3:])
                    p3_number = p3_number + 1
                    continue

                print(f"Unexpected field {field} at line {i} of {file_name}")

    return (
        (p1_number, p1_sum, p1_sum / p1_number),
        (p2_number, p2_sum, p2_sum / p2_number),
        (p3_number, p3_sum, p3_sum / p3_number),
    )


base_path = "../common/data_read_files"
file_names = [
    f"{base_path}/file_mut_cols.txt",
    f"{base_path}/file_mut_cols_with_error.txt",
]

for file_name in file_names:
    p1_result, p2_result, p3_result = compute_p1p2p3_stats(file_name)
    print(
        f"p1 in {file_name} - nb: {p1_result[0]}, sum: {p1_result[1]:.2f}, avg: {p1_result[2]:.2f}"
    )
    print(
        f"p2 in {file_name} - nb: {p2_result[0]}, sum: {p2_result[1]:.2f}, avg: {p2_result[2]:.2f}"
    )
    print(
        f"p3 in {file_name} - nb: {p3_result[0]}, sum: {p3_result[1]:.2f}, avg: {p3_result[2]:.2f}"
    )
```

```{solution-end}
```
