Practice: Read corrupted files#
This practical is an extension to the parsing exercises done previously in
Read/write files and functions (very basic). You will practice the following:
write code in scripts,
use ipython and execute Python programs with the command python3,
use objects of simple types (numbers, str, list, etc.),
index and slice,
use loops and conditions,
try, except,
read and write in text files.
We will write scripts that read a file (or a set of files) with a predefined format and compute simple quantities (sum, average, number) from the values in the files.
Exercise 22 (Parse a file with comments: file_with_comment_col0.txt)
Contrary to the previous exercises, the files contains some comments (i.e. lines
starting with a #). Adapt previous script so that we do not consider these lines (see
file file_with_comment_col0.txt). You can use the skeleton for this problem:
/common/data_read_files/your_solutions/treat_files_comments.py.
Example of output
python3 reading_file_with_comment.py
file = ../file_with_comment_col0.txt
size = 100; total = 53.29; avg = 0.53
To complicate things further, another file contains comments in the middle of the line
(see e.g. file_with_comment_anywhere.txt that contains some comments that mainly
prevent the string to float conversion).
Adapt the script so that the comments are not taken into account.
Example of output
python3 step1.1.py
file = "../file_with_comment_col0.txt"
size = 100 ; sum = 53.29 ; avg = 0.53
file = "../file_with_comment_anywhere.txt"
size = 96 ; sum = 51.65 ; avg = 0.54
# total over all files:
size = 196 ; sum = 104.93 ; avg = 0.54
Solution to Exercise 22 (Parse a file with comments: file_with_comment_col0.txt)
Show the solution
#!/usr/bin/python3
"""computes basic statistics (size, sum and average) on files
containing lines with one float and possibly some comments in
the middle of the line.
"""
def compute_stats(path):
"""
computes the statistics of data in a file.
:param path: the name of the file to process
:type path: str
:return: the statistics
:rtype: a tuple (size, sum, average)
"""
sum_ = 0.0
size = 0
with open(path, encoding="utf-8") as file:
for line in file:
if line.startswith("#"):
continue
if "#" in line:
line = line.split("#", 1)[0]
elem = float(line)
sum_ += elem
size += 1
return size, sum_, float(sum_ / size)
file_paths = [
"../file_with_comment_col0.txt",
"../file_with_comment_anywhere.txt",
]
sizes = []
sums = []
for file_path in file_paths:
len_file, sum_file, avg_file = compute_stats(file_path)
sizes.append(len_file)
sums.append(sum_file)
print(
f'file = "{file_path}"\nnb = {len_file:5}; '
f"sum = {sum_file:7.2f}; avg = {sum_file / len_file:5.2f}"
)
all_sum = sum(sums)
all_size = sum(sizes)
all_avg = all_sum / all_size
print(
"# total over all files:\n"
f"size = {all_size}; sum = {all_sum:.2f}; avg = {all_avg:.2f}"
)
Exercise 23 (Parse more complicated files)
As a last exercise, we now have to deal with several columns on each line,
p1=0.7742 p2=0.74973 p3=0.77751
p1=0.7493 p2=0.34762 p3=0.44521
p1=0.4261 p3=0.88275 p2=0.74016
Write a function that compute statistics separately for p1, p2, p3
BONUS: When parsing
file_mut_cols_with_error, print to the screen the lines which contain errors.
Example of output
python3 step2.0.py
p1 in ../file_mut_cols.txt - size: 25, sum: 12.72, avg: 0.51
p2 in ../file_mut_cols.txt - size: 25, sum: 12.72, avg: 0.51
p3 in ../file_mut_cols.txt - size: 25, sum: 12.72, avg: 0.51
Unexpected field p7=0.213607026802 at line 23 of ../file_mut_cols_with_error.txt
p1 in ../file_mut_cols_with_error.txt - size: 25, sum: 12.82, avg: 0.51
p2 in ../file_mut_cols_with_error.txt - size: 23, sum: 11.35, avg: 0.49
p3 in ../file_mut_cols_with_error.txt - size: 23, sum: 11.69, avg: 0.51
Solution to Exercise 23 (Parse more complicated files)
Show the solution
#!/usr/bin/env python3
def compute_p1p2p3_stats(path):
"""
Computes the statistics of data in a file stored in 3 fields.
Each field of the form key=val where key is p1, p2 or p3, and val is a float.
:param path: the name of the file to process
:type path: str
:return: A tuple containing the statistics for each field under the form
of a tuple (size, sum, average)
"""
p1_sum = 0.0
p2_sum = 0.0
p3_sum = 0.0
p1_size = 0
p2_size = 0
p3_size = 0
with open(path, encoding="utf-8") as handle:
for i, line in enumerate(handle):
if line.startswith("#"):
continue
fields = line.strip().split()
for field in fields:
if field.startswith("p1="):
p1_sum = p1_sum + float(field[3:])
p1_size = p1_size + 1
continue
if field.startswith("p2="):
p2_sum = p2_sum + float(field[3:])
p2_size = p2_size + 1
continue
if field.startswith("p3="):
p3_sum = p3_sum + float(field[3:])
p3_size = p3_size + 1
continue
print(f"Unexpected field {field} at line {i} of {path}")
return (
(p1_size, p1_sum, p1_sum / p1_size),
(p2_size, p2_sum, p2_sum / p2_size),
(p3_size, p3_sum, p3_sum / p3_size),
)
file_paths = [
"../file_mut_cols.txt",
"../file_mut_cols_with_error.txt",
]
for file_path in file_paths:
p1_result, p2_result, p3_result = compute_p1p2p3_stats(file_path)
print(
f"p1 in {file_path} - size: {p1_result[0]}, "
f"sum: {p1_result[1]:.2f}, avg: {p1_result[2]:.2f}"
)
print(
f"p2 in {file_path} - size: {p2_result[0]}, "
f"sum: {p2_result[1]:.2f}, avg: {p2_result[2]:.2f}"
)
print(
f"p3 in {file_path} - size: {p3_result[0]}, "
f"sum: {p3_result[1]:.2f}, avg: {p3_result[2]:.2f}"
)