More data structures

More data structures#

Education objectives

hash table
set
notion of complexity
dict

standard type `set`#

Unordered collections of unique elements. Sets are mutable. Not all elements can be part of a set, they must be hashable, a term we will define later.

s0 = set()

{1, 1, 1, 3}

{1, 3}

set([1, 1, 1, 3])

{1, 3}

s1 = {1, 2}
s2 = {2, 3}
print(s1.intersection(s2))
print(s1.union(s2))

{2}
{1, 2, 3}

`set`: lookup#

Lookup of an element in the set is algorithmically efficient (complexity O(1)), i.e. faster than a look up in a list or a tuple (complexity O(size iterable)). Consider the following example where we look for the last element of a list and a set:

size = 10000
large_list = list(range(size))
large_set = set(large_list)

%timeit size-1 in large_list
%timeit size-1 in large_set

128 μs ± 1.03 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
74.4 ns ± 0.689 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Back to the “find the removed element” problem#

from random import shuffle, randint

size = 20
number = randint(0, size - 1)
print("integer remove from the list:", number)
numbers = list(range(size))
numbers.remove(number)
shuffle(numbers)
print("shuffled list: ", numbers)

integer remove from the list: 15
shuffled list:  [19, 12, 11, 6, 3, 16, 17, 18, 9, 5, 0, 7, 14, 13, 1, 4, 8, 2, 10]

Could the problem be solved using set ?
What is the complexity of this solution ?

A possible solution :

full_set = set(range(size))
changed_set = set(numbers)
ns = full_set - changed_set
ns.pop()

Complexity#

line 1: n insertions –> O(n)
line 2 : n insertions –> O(n)
line 3: one traversal O(n), with one lookup at each time (O(1) -> O(n)

=> Complexity of the whole algorithm : O(n)

Complexity of the “sum” solution : one traversal for the computation of the sum O(n) with sum at each step O(1) => O(n).

What is a hashtable?#

To be in a set, an object must be hashable, which means a function (called the hash function) is able to assign an index (i.e. a number) to this object. In the lookup example, we look for the presence of an element in a set. To do that, the hash function assign an index to this object and look if the index is present in the set. A set is simply a hash table. https://en.wikipedia.org/wiki/Hash_table

Some elements are not hashable, like list because they are mutable. Imagine you want to make a set of unique lists:

wrong_set = set(([1, 2], [2, 3], [1, 2]))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 wrong_set = set(([1, 2], [2, 3], [1, 2]))

TypeError: unhashable type: 'list'

Iteration on `set` is non-reproducible#

A set is an unordered sequence. The unordered nature can lead to unexpected and non-reproducible result. Let us create a set containing the first letters of the alphabet.

s_letters = set("abcde")
s_letters

{'a', 'b', 'c', 'd', 'e'}

Now, let’s see what happens when we inter on the set. We can just do that by creating a list from it.

list(s_letters)

['a', 'e', 'b', 'c', 'd']

Is the list ordered? If it is not, it’s surprising but normal: the iteration order is non-deterministic

Worse, exactly the same code can lead to different outputs!

!python3 -c "print(list(set('abcde')))"

['c', 'e', 'd', 'b', 'a']

!python3 -c "print(list(set('abcde')))"

['d', 'c', 'b', 'e', 'a']

!python3 -c "print(list(set('abcde')))"

['d', 'b', 'c', 'a', 'e']

When a program iterates over a set, the iteration order is non-deterministic, making the program’s behavior non-reproducible. Since reproducibility is often a desirable characteristic, one solution is to use the built-in sorted() function to convert the set to a sorted list with consistent ordering. However, this comes at the cost of sorting performance.

!python3 -c "print(sorted(set('abcde')))"

['a', 'b', 'c', 'd', 'e']

!python3 -c "print(sorted(set('abcde')))"

['a', 'b', 'c', 'd', 'e']

!python3 -c "print(sorted(set('abcde')))"

['a', 'b', 'c', 'd', 'e']

`dict`: set of “key: value” pairs#

The dictionary (dict) is a very important data structure in Python. All namespaces are (nearly) dictionaries and “Namespaces are one honking great idea – let’s do more of those!” (The zen of Python).

A dict is a hashtable (a set) + associated values.

my_dict = {}
my_dict["b"] = 2
my_dict["a"] = 1
print(my_dict)

{'b': 2, 'a': 1}

my_dict = {"a": 1, "b": 2, 0: False, 1: True}
print(my_dict)

{'a': 1, 'b': 2, 0: False, 1: True}

Tip: parallel between `dict` and `list`#

You can first think about dict as a super list which can be indexed with other objects than integers (and in particular with str).

my_list = ["value0", "value1"]
my_list.append("value2")
print(my_list)

['value0', 'value1', 'value2']

my_list[1]

'value1'

my_dict = {"key0": "value0", "key1": "value1"}
my_dict["key2"] = "value2"
print(my_dict)

{'key0': 'value0', 'key1': 'value1', 'key2': 'value2'}

my_dict["key1"]

'value1'

`dict`: public methods#

# dict have 11 public methods (with a list comprehension)
print([name for name in dir(my_dict) if not name.startswith("__")])

['clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

`dict`: different ways to loops over a dictionary#

# loop with items
for key, value in my_dict.items():
    print(key, value)

key0 value0
key1 value1
key2 value2

# loop with values
for value in my_dict.values():
    print(value)

value0
value1
value2

# loop with keys
for key in my_dict.keys():
    print(key)

key0
key1
key2

Note

Note that (since Python 3.6) the iteration order for a dict is deterministic (in contrast as for set). dict conserves the insertion order.

for key, value in {2: 2, 1: 1, 3: 3}.items():
    print(key, value)

2
1
3

Dict comprehension#

grades = {"Albert": 19, "Martin": 10, "Sue": 17}
# Add 1 to everyone
new_grades = {name: grade + 1 for name, grade in grades.items()}
print(new_grades)

{'Albert': 20, 'Martin': 11, 'Sue': 18}

Exercise 16

Write a function that returns a dictionary containing the number of occurrences of letters in a text.

text = "abbbcc"

Solution to Exercise 16

text = "abbbcc"


def count_elem(sequence):
    """returns a dictionary with the number of occurrences
    of each element in the sequence
    """

    frequencies = {}

    for letter in sequence:
        if letter not in frequencies:
            frequencies[letter] = 1
        else:
            frequencies[letter] += 1
    return frequencies


print(f"Text to analyze: {text} \nFreq of each letter: {count_elem(text)}")

# other solution, with count(), set and dictionary comprehension:


def count_elem_with_comprehension(sequence):
    """returns a dictionary with the number of occurrences
    of each element in the sequence
    """

    vocabulary = set(sequence)
    return {letter: sequence.count(letter) for letter in vocabulary}


test_text = "abbbcc"
test_dict = count_elem_with_comprehension(test_text)
print(test_dict)

assert test_dict["a"] == 1
assert test_dict["b"] == 3
assert test_dict["c"] == 2

Text to analyze: abbbcc 
Freq of each letter: {'a': 1, 'b': 3, 'c': 2}
{'c': 2, 'a': 1, 'b': 3}

More data structures

Contents

More data structures#

standard type set#

set: lookup#