Strings, bytes and immutability#

Education objectives

  • str, bytes

  • immutability

  • methods

  • sequence, indexing and slicing

  • string formatting and f-strings

str, a type for text#

s = "hello"
# fmt: off
s = 'hello'
# fmt: on

s = (
    "How is it possible to write a very very "
    "very long string with lines limited to 79 characters?"
)

s = """Strings on
more than
one line.
"""
print(s)
Strings on
more than
one line.

Big difference between Python 2 and Python 3.

In Python 3, str are unicode and there is another type bytes.

Methods of the type str#

Objects of built-in types have methods associated with their type (object oriented programming). The built-in function dir returns a list of name of the attributes. For a string, these attributes are python system attributes (with double-underscores) and several public methods:

# create a nicer "pretty print" function
from pprint import pprint
from functools import partial

pprint = partial(pprint, width=82, compact=True)
s = "abcdef"
pprint(dir(s))
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__',
 '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__',
 '__getnewargs__', '__getstate__', '__gt__', '__hash__', '__init__',
 '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__',
 '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',
 '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__',
 '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode',
 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum',
 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower',
 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust',
 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix',
 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split',
 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper',
 'zfill']

Let’s hide the internal variables starting and ending with __ (for now you don’t need to understand the code used for that).

pprint([name_attr for name_attr in dir(s) if not name_attr.startswith("__")])
['capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs',
 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii',
 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable',
 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans',
 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex',
 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith',
 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

To access an attribute of an object (here, the method str.startswith), we use the dot:

s.startswith("a")
True

To access the documentation on this attribute, use the question mark:

s.startswith?
Docstring:
S.startswith(prefix[, start[, end]]) -> bool

Return True if S starts with the specified prefix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
prefix can also be a tuple of strings to try.
Type:      builtin_function_or_method

function str.format#

Docstring:
S.format(*args, **kwargs) -> str

Return a formatted version of S, using substitutions from args and kwargs.
The substitutions are identified by braces ('{' and '}').
value = 1.23456789
"This represents {} eqCO2".format(value)
'This represents 1.23456789 eqCO2'
"This represents {:.4f} eqCO2".format(value)
'This represents 1.2346 eqCO2'

Use > to align different values

value2 = 12345.6
"This represents {:>10.4f} eqCO2".format(value)
"This represents {:>10.4f} eqCO2".format(value2)
'This represents 12345.6000 eqCO2'
"This represents {:.4e} eqCO2 (scientific notation)".format(value)
'This represents 1.2346e+00 eqCO2 (scientific notation)'
print("{}\t{}\t{}".format(1, 2, 3))
1	2	3

New in Python 3.6: format strings#

a = 1.23456789
f"This represents {value} eqCO2"
'This represents 1.23456789 eqCO2'
f"This represents {value:.4f} eqCO2"
'This represents 1.2346 eqCO2'
f"This represents {value:8.4f} eqCO2"
'This represents   1.2346 eqCO2'
f"This represents {value:.4e} eqCO2 (scientific notation)"
'This represents 1.2346e+00 eqCO2 (scientific notation)'
print(f"{1}\t{1 + 1}\t{2 + 1}")
1	2	3
f"{value = }, {2 * value = :.3f}"
'value = 1.23456789, 2 * value = 2.469'

Strings are immutable “sequences”.#

  • lookup

s = "abcdef"
print("a" in s)
print("hello" not in s)
True
True
  • We can get an element of a string (index starts from 0):

print(s[0])
a
  • since strings are immutable, they can not be modified inplace. If we try, we get an error:

s[0] = "b"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 s[0] = "b"

TypeError: 'str' object does not support item assignment
  • since strings are sequences, they can be “sliced” (we will soon study in details this powerful notation):

s[1:3]
'bc'
  • it is very simple to manipulate strings in many ways:

print((s.capitalize() + " " + s.upper() + "\n") * 4)
Abcdef ABCDEF
Abcdef ABCDEF
Abcdef ABCDEF
Abcdef ABCDEF

Byte strings for binary data#

Strings are designed to store character data. In essence, they are sequences of Unicode characters (aka. code points). This is very useful to store human-readable text in all languages.

However, you may need to work with raw data, as a sequence of bytes. Since version 3, Python provides the bytes type for this kind of data.

From str to bytes and vice versa#

You can create a byte string from a character string with the bytes(string, encoding) constructor. You have to specify a text encoding because the same characters may be encoded in multiple ways.

# A message containing non-english characters
message = "Ça sent le sucré salé"

# Most text is UTF-8 encoded nowadays
print(f"UTF-8: {bytes(message, 'utf-8')}")

# You can also use the str's "encode" method
encoded_message_utf8 = message.encode("utf-8")
# In the olden days, other encodings ran rampant, like latin1
encoded_message_latin1 = message.encode("latin1")

print(f"Latin1: {encoded_message_latin1}")
UTF-8: b'\xc3\x87a sent le sucr\xc3\xa9 sal\xc3\xa9'
Latin1: b'\xc7a sent le sucr\xe9 sal\xe9'

You can also create a character string from a byte string with the bytes.decode(encoding) method.

# We must use the same encoding that was used to encode the message
print(encoded_message_utf8.decode("utf-8"))
Ça sent le sucré salé
# Otherwise, things will not go well
print(encoded_message_latin1.decode("utf-8"))
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[25], line 2
      1 # Otherwise, things will not go well
----> 2 print(encoded_message_latin1.decode("utf-8"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: invalid continuation byte

Differences between str and bytes#

The main difference between the two sorts of strings is the type of object that they contain.

Lengths may vary.

# Counts the characters
print(f"Length of text : {len(message)}")
# Some characters are encoded as multiple bytes
print(f"Length of bytes : {len(encoded_message_utf8)}")
Length of text : 21
Length of bytes : 24

Elements are not of the same type. bytes elements are integers, whereas str elements are str (Python has no native notion of “character”).

# The 20th character
print(f"21st character : {message[20]} (type {type(message[20])})")
print(f"21st byte : {encoded_message_utf8[20]} (type {type(encoded_message_utf8[20])})")
21st character : é (type <class 'str'>)
21st byte : 97 (type <class 'int'>)

Todo

Check before December 2025

  • First intro immutability

  • First sequence: indexing, slicing, lookup (keyword in)

  • Notions of objects and methods

  • str (unicode)

  • bytes (utf-8, latin1, …)