More data structures#
Education objectives
hash table
setnotion of complexity
dict
standard type set#
Unordered collections of unique elements. Sets are mutable. Not all elements can be part of a set, they must be hashable, a term we will define later.
s0 = set()
{1, 1, 1, 3}
{1, 3}
set([1, 1, 1, 3])
{1, 3}
s1 = {1, 2}
s2 = {2, 3}
print(s1.intersection(s2))
print(s1.union(s2))
{2}
{1, 2, 3}
set: lookup#
Lookup of an element in the set is algorithmically efficient (complexity O(1)), i.e. faster than a look up in a list or a tuple (complexity O(size iterable)). Consider the following example where we look for the last element of a list and a set:
size = 10000
large_list = list(range(size))
large_set = set(large_list)
%timeit size-1 in large_list
%timeit size-1 in large_set
115 μs ± 5.3 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
79.4 ns ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
Back to the “find the removed element” problem#
from random import shuffle, randint
size = 20
number = randint(0, size - 1)
print("integer remove from the list:", number)
numbers = list(range(size))
numbers.remove(number)
shuffle(numbers)
print("shuffled list: ", numbers)
integer remove from the list: 0
shuffled list: [14, 5, 6, 4, 1, 18, 15, 3, 8, 2, 17, 19, 9, 7, 11, 13, 12, 16, 10]
Could the problem be solved using set ?
What is the complexity of this solution ?
A possible solution :
full_set = set(range(size))
changed_set = set(numbers)
ns = full_set - changed_set
ns.pop()
0
Complexity#
line 1: n insertions –> O(n)
line 2 : n insertions –> O(n)
line 3: one traversal O(n), with one lookup at each time (O(1) -> O(n)
=> Complexity of the whole algorithm : O(n)
Complexity of the “sum” solution : one traversal for the computation of the sum O(n) with sum at each step O(1) => O(n).
What is a hashtable?#
To be in a set, an object must be hashable, which means a function (called the hash function) is able to assign an index (i.e. a number) to this object. In the lookup example, we look for the presence of an element in a set. To do that, the hash function assign an index to this object and look if the index is present in the set. A set is simply a hash table. https://en.wikipedia.org/wiki/Hash_table
Some elements are not hashable, like list because they are mutable. Imagine you want to
make a set of unique lists:
wrong_set = set(([1, 2], [2, 3], [1, 2]))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[8], line 1
----> 1 wrong_set = set(([1, 2], [2, 3], [1, 2]))
TypeError: unhashable type: 'list'
Iteration on set is non-reproducible#
A set is an unordered sequence. The unordered nature can lead to unexpected and
non-reproducible result. Let us create a set containing the first letters of the
alphabet.
s_letters = set("abcde")
s_letters
{'a', 'b', 'c', 'd', 'e'}
Now, let’s see what happens when we inter on the set. We can just do that by creating a list from it.
list(s_letters)
['b', 'e', 'c', 'd', 'a']
Is the list ordered? If it is not, it’s surprising but normal: the iteration order is non-deterministic
Worse, exactly the same code can lead to different outputs!
!python3 -c "print(list(set('abcde')))"
['b', 'e', 'd', 'c', 'a']
!python3 -c "print(list(set('abcde')))"
['d', 'a', 'c', 'e', 'b']
!python3 -c "print(list(set('abcde')))"
['d', 'a', 'c', 'e', 'b']
When a program iterates over a set, the iteration order is non-deterministic, making the
program’s behavior non-reproducible. Since reproducibility is often a desirable
characteristic, one solution is to use the built-in sorted() function to convert the
set to a sorted list with consistent ordering. However, this comes at the cost of sorting
performance.
!python3 -c "print(sorted(set('abcde')))"
['a', 'b', 'c', 'd', 'e']
!python3 -c "print(sorted(set('abcde')))"
['a', 'b', 'c', 'd', 'e']
!python3 -c "print(sorted(set('abcde')))"
['a', 'b', 'c', 'd', 'e']
dict: set of “key: value” pairs#
The dictionary (dict) is a very important data structure in Python. All namespaces are
(nearly) dictionaries and “Namespaces are one honking great idea – let’s do more of
those!” (The zen of Python).
A dict is a hashtable (a set) + associated values.
my_dict = {}
my_dict["b"] = 2
my_dict["a"] = 1
print(my_dict)
{'b': 2, 'a': 1}
my_dict = {"a": 1, "b": 2, 0: False, 1: True}
print(my_dict)
{'a': 1, 'b': 2, 0: False, 1: True}
Tip: parallel between dict and list#
You can first think about dict as a super list which can be indexed with other
objects than integers (and in particular with str).
my_list = ["value0", "value1"]
my_list.append("value2")
print(my_list)
['value0', 'value1', 'value2']
my_list[1]
'value1'
my_dict = {"key0": "value0", "key1": "value1"}
my_dict["key2"] = "value2"
print(my_dict)
{'key0': 'value0', 'key1': 'value1', 'key2': 'value2'}
my_dict["key1"]
'value1'
dict: public methods#
# dict have 11 public methods (with a list comprehension)
print([name for name in dir(my_dict) if not name.startswith("__")])
['clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']
dict: different ways to loops over a dictionary#
# loop with items
for key, value in my_dict.items():
print(key, value)
key0 value0
key1 value1
key2 value2
# loop with values
for value in my_dict.values():
print(value)
value0
value1
value2
# loop with keys
for key in my_dict.keys():
print(key)
key0
key1
key2
Note
Note that (since Python 3.6) the iteration order for a dict is deterministic (in
contrast as for set). dict conserves the insertion order.
for key, value in {2: 2, 1: 1, 3: 3}.items():
print(key, value)
2 2
1 1
3 3
Dict comprehension#
grades = {"Albert": 19, "Martin": 10, "Sue": 17}
# Add 1 to everyone
new_grades = {name: grade + 1 for name, grade in grades.items()}
print(new_grades)
{'Albert': 20, 'Martin': 11, 'Sue': 18}
Exercise 15
Write a function that returns a dictionary containing the number of occurrences of letters in a text.
text = "abbbcc"
Solution to Exercise 15
text = "abbbcc"
def count_elem(sequence):
"""returns a dictionary with the number of occurrences
of each element in the sequence
"""
frequencies = {}
for letter in sequence:
if letter not in frequencies:
frequencies[letter] = 1
else:
frequencies[letter] += 1
return frequencies
print(f"Text to analyze: {text} \nFreq of each letter: {count_elem(text)}")
# other solution, with count(), set and dictionary comprehension:
def count_elem_with_comprehension(sequence):
"""returns a dictionary with the number of occurrences
of each element in the sequence
"""
vocabulary = set(sequence)
return {letter: sequence.count(letter) for letter in vocabulary}
test_text = "abbbcc"
test_dict = count_elem_with_comprehension(test_text)
print(test_dict)
assert test_dict["a"] == 1
assert test_dict["b"] == 3
assert test_dict["c"] == 2
Text to analyze: abbbcc
Freq of each letter: {'a': 1, 'b': 3, 'c': 2}
{'c': 2, 'b': 3, 'a': 1}