Python Strings

String Basics

A Python string, like 'Hello' stores text as a sequence of individual characters. Text is central to many computations - urls, chat messages, the underlying HTML code that makes up web pages.

Python strings are written between single quote marks like 'Hello' or alternately they can be written in double quote marks like "There".

a = 'Hello'
b = "isn't"

Each character in a string is drawn from the unicode character set, which includes the "characters" or pretty much every language on earth, plus many emojis. See the unicode section below for more information.

String len()

The len() function returns the length of a string, the number of chars in it. It is valid to have a string of zero characters, written just as '', called the "empty string". The length of the empty string is 0. The len() function in Python is omnipresent - it's used to retrieve the length of every data type, with string just a first example.

>>> s = 'Python'
>>> len(s)
6
>>> len('')   # empty string
0

Convert Between Int and String

The formal name of the string type in Python is "str". The str() function serves to convert many values to a string form. This code computes the str form of the number 123:

>>> str(123)
'123'

Looking carefully at the values, 123 is a number, while '123' is a string length-3, made of the three chars '1' '2' and '3'.

Going the other direction, the formal name of the integer type is "int", and the int() function takes in a value and tries to convert it to be an int value:

>>> int('1234')
1234
>>> int('xx1234')   # fails due to extra chars
ValueError: invalid literal for int() with base 10: 'xx1234'

String Indexing [ ]

Chars in a string are numbered with zero-based indexing, so the first char is at index 0, the next index 1, and the last char is at index len-1. Access the individual characters using square brackets, e.g. s[0] is the first char.

alt: string 'Python' shown with index numbers 0..5

>>> s = 'Python'
>>> len(s)
6
>>> s[0]   # Access char at index 0
'P'
>>> s[1]
'y'
>>> s[5]
'n'
>>> s[6]
IndexError: string index out of range
>>> s[0] = 'x'   # no, string is immutable
TypeError: 'str' object does not support item assignment

Accessing a too large index number is an error. Strings are immutable, so the chars in a string can never be changed. Instead, code can create a new string with different chars, as with + in the next section.

String +

The + operator combines (aka "concatenates") two strings to make a bigger string. This creates a new string to represent the result, leaving the original strings unchanged. (See the working with immutable below.)

>>> s1 = 'Hello'
>>> s2 = 'There'
>>> s3 = s1 + ' ' + s2
>>> s3
'Hello There'
>>> s1
'Hello'

Concatenate + only works with 2 or more strings, not for example to concatenate a string and an int. Call the str() function to make a string out of an int, then concatenation works.

>>> 'score:' + 6
TypeError: can only concatenate str (not "int") to str
>>> 'score:' + str(6)
'score:6'

String Functions

Here are the most commonly used string functions.

String in

The in operator checks, True or False, if something appears anywhere in a string. In this and other string comparisons, characters much match exactly, so 'a' matches 'a' but does not match 'A'. (Mnemonic: this is the same word "in" as used in the for-loop.)

>>> 'b' in 'abcd'
True
>>> 'B' in 'abcd'
False
>>> 'aa'  in 'iiaaii'  # test string can be any length
True
>>> 'aaa' in 'iiaaii'
False
>>> '' in 'abcd'       # empty string in always True
True

Character Tests: s.isalpha() s.isdigit() s.isspace()

The characters that make up a string can be divided into several categories or "character classes":

alt: divide chars into alpha (lower/upper), digit, space, and misc leftovers

alphabetic chars - e.g. 'abcXYZ' used to write words. Alphabetic chars are further divided into upper and lowercase versions (the details depend on the particular unicode alphabet).

digit chars - e.g. '0' '1' .. '9' to write numbers

space chars - e.g. space ' ' newline '\n' and tab '\t'

Then there are all the other miscellaneous characters like '$' '^' '<' which are not alphabetic, digit, or space.

These test functions return True if all the chars in s are in the given class:

s.isalpha() - True for alphabetic "word" characters like 'abcXYZ' (applies to alphabetic chars in other unicode alphabets too like 'Σ')

s.isdigit() - True if all chars in s are digits '0..9'

s.isspace() - True for whitespace char, e.g. space, tab, newline

s.isupper(), s.islower() - True for uppercase / lowercase alphabetic chars. False for other characters like '4' and '$' which do not have upper/lower versions.

>>> 'a'.isalpha()
True
>>> '$'.isalpha()
False
>>> 'a'.islower()
True
>>> 'a'.isupper()
False
>>> s = '\u03A3'  # Unicode Sigma char
>>> s
'Σ'
>>> s.isalpha()
True
>>> '6'.isdigit()
True
>>> 'a'.isdigit()
False
>>> '$'.islower()
False
>>> ' '.isspace()
True
>>> '\n'.isspace()
True
>>> ''.isalpha()  # empty string is False
False             # for these tests

Unicode aside: In the roman a-z alphabet, all alphabetic chars have upper/lower versions. In some alphabets, there are chars which are alphabetic, but which are not classed as upper or lower.

Change Case s.upper() s.lower()

s.lower() - returns a new version of s where each char is converted to its lowercase form, so 'A' becomes 'a'. Chars like '$' are returned unchanged. The original s is unchanged - a good example of strings being immutable. (See the working with immutable below.) Each unicode alphabet includes its own rules about upper/lower case.

s.upper() - returns an uppercase version of s

>>> s = 'Python123'
>>> s.lower()
'python123'
>>> s.upper()
'PYTHON123'
>>> s
'Python123'

Test s.startswith() s.endswith()

These convenient functions return True/False depending on what appears at one end of a string. These are convenient when you need to check for something at an end, e.g. if a filename ends with '.html'. Style aside: a good example of a well-named function, making the code where it is called very readable.

s.startswith(x) - True if s start with string x

s.endswith(x) - True if s ends with string x

>>> 'Python'.startswith('Py')
True
>>> 'Python'.startswith('Px')
False
>>> 'resume.html'.endswith('.html')
True

Search s.find()

s.find(x) - searches s left to right, returning the int index where string x appears, or -1 if not found. Use s.find() to compute the index where a substring first appears.

>>> s = 'Python'
>>> s.find('y')
1
>>> s.find('tho')
2
>>> s.find('xx')
-1

The in test just reports True/False if a string is in there. The find() function tells you where it is.

There are some more rarely used variations of s.find(): s.find(x, start_index) - which begins the search at the given index instead of at 0; s.rfind(x) does the search right-to-left from the end of the string.

Strip Whitespace s.strip()

s.strip() - return a version of s with the whitespace characters (space, tab, newline) from the very start and very end of the string all removed. Handy to clean up strings parsed out of a file or read from a user in a dialog box.

>>> '   hi there  \n'.strip() 
'hi there'

String s.replace()

s.replace(old, new) - returns a version of s where all occurrences of old have been replaced by new. Does not pay attention to word boundaries, just replaces every instance of old in s. Replacing with the empty string effectively deletes the matching strings.

>>> 'this is it'.replace('is', 'xxx')
'thxxx xxx it'
>>> 'this is it'.replace('is', '')
'th  it'

Working With Immutable x = change(x)

Strings are "immutable", meaning the chars in a string never change. Instead of changing a string, code creates new strings.

Suppose we have a string, and want to change it to uppercase and add an exclamation mark at its end, so 'Hello' becomes 'HELLO!'.

This code looks like it will work, but it does not:

>>> s = 'Hello'
>>> s.upper()  # compute upper, but does not store it
'HELLO'
>>> s          # s is not changed
'Hello'

The correct form computes the uppercase form, and also stores it back in the s variable, a sort of x = change(x) pattern:

>>> s = 'Hello'
>>> s = s.upper()  # compute upper, store in s
>>> s = s + '!'    # add !, store in s
>>> s              # s is the new, computed string
'HELLO!'

Backslash Special Chars

A backslash \ in a string literal in your code "escapes" a special char we wish to include in the string, such as a quote or \n newline. Common backslash escapes:

\'   # single quote
\"   # double quote
\\   # a backslash
\n   # newline char

A string using '\n':

a = 'First line\nSecond line\nThird line\n'

Python strings can be written within triple ''' or """, in which case they can span multiple lines. This is useful for writing longer blocks of text.

a = """First line
Second line
Third line
"""

Format Strings

A format string is a relatively new way to convert a python value into a string. The format string begins with a literal 'f' before the first quote mark, for example f'format string of {n}'

Within the format string, curly braces hold an expression such as a local variable, and the result of evaluating that expression is pasted pasted into the string at that spot — e.g. f'Value: {n}' yields the string 'Value: 10', assuming the variable n is currently 10.

>>> name = 'Sally'
>>> 
>>> f'Name: {name}'              # Access a var
'Name: Sally'
>>>
>>> scores = [19, 34, 22]
>>> f'Max score: {max(scores)}'  # Call a function
'Max score: 34'

The expression in a format string can also be a function call, such as the{max(scores)} example above calling the max() function.

To include a literal { or } in the string, double them up - so {{ to make one {

>>> f'A curly brace: {{'
'A curly brace: {'

String Format - Float Digits

When placing a floating point value into a string, by default you get 15+ digits which is kind of ugly and unreadable. Therefore, it's very common to want to reduce the number of digits shown, and the format string has an option to do this.

Within the curly braces, a colon : added to the end of a value enables format options. For floating point, adding :.4 after the floating point value will limit the number digits to around 4.

>>> x = 2 / 3
>>> x
0.6666666666666666
>>>
>>> f'val: {x}'            # {x} - too many digits
'val: 0.6666666666666666'
>>> f'val: {x:.4}'         # {x:.4} - limit digits
'val: 0.6667'

The above will vary from 4 digits if there are leading or trailing zeros. To output exactly 4 digits (e.g. for a series of lines with all numbers the same length), add an 'f' ("fixed") at the end of the format :.4f

>>> f'{0.025:.4}'     # approximately 4 digits
'0.025'
>>> f'{0.025:.4f}'    # fixed to 4 digits
'0.0250'

String Format - Hexadecimal and Binary

By default, converting an int to a string works in decimal. With a format string, the :x option expresses the value in hexadecimal and :b in binary:

>>> n = 215
>>> f'{n}'     # decimal
'215'
>>> f'{n:x}'   # hexadecimal
'd7'
>>> f'{n:b}'   # binary
'11010111'

The format option :, adds comma separators for thousands:

>>> n = 1000000
>>> f'{n:,}'
'1,000,000'

The list of format options is surprisingly large, see the python.org format examples

String s.format() Function

The string format() function is an older way to paste values into a string. It will likely be supplanted by the format string above. Like format strings, this uses '{}' to mark where things go, and then the value expressions are listed inside the format() function call.
>>> 'Count: {}'.format(67)
'Count: 67'
>>> 'Count: {} and word: {}'.format(67, 'Yay')
'Count: 67 and word: Yay'

String Loops

The standard for i/range/len loop goes through the index numbers for all the characters in a string. With this form, the loop can access each character as s[i] in the loop.
for i in range(len(s)):
    # use s[i] in here

A plain for-loop can loop over the chars of a string directly, with no need to call range() or len() or use square brackets. This simpler form accesses the chars of the string, but the index numbers are not available:

for ch in s:
    # use ch in here

list('abc') of a string yields a list ['a', 'b', 'c'] of its chars.

More details at official Python String Docs

String Slices

alt: string 'Python' shown with index numbers 0..5

The slice syntax is a powerful way to refer to sub-parts of a string instead of just 1 char: s[ start : end ] returns a substring from s beginning at the start index, running up to but not including the end index. If the start index is omitted, starts from the beginning of the string. If the end index is omitted, runs through the end of the string. If the start index is equal to the end index, the slice is the empty string.

>>> s = 'Python'
>>> s[2:4]
'th'
>>> s[2:]
'thon'
>>> s[:5]
'Pytho'
>>> s[4:4]  # start = end: empty string
''

If the end index is too large (out of bounds), the slice just runs through the end of the string. This is one case where Python is permissive about wrong/out-of-bounds indexes. Similarly, if the start index is greater or equal to the end index, the slice is just the empty string.

>>> s = 'Python'
>>> s[2:999]
'thon'
>>> s[3:2]  # zero chars
''

Negative Slices

This is a slightly advanced shortcut you do not need to use, but it is handy sometimes. Negative numbers also work within [ ] and slices: -1 is the last char in the string, and -2 is the next to last char, and so on. So for example with s = 'Python', s[-1] is 'n' and s[-2] is 'o'. This is convenient when you want to refer chars near the end of the string, and it works in slices too.

>>> s = 'Python'
>>> s[-1]
'n'
>>> s[-2]
'o'
>>> s[-2:]
'on'
>>> s[:-1]
'Pytho'

Mnemonic: The last char in a string is at index len(s) - 1. So -1 is a shorthand for len(s) - 1, -2 is for len(s) - 2, and so on.

String split()

str.split(',') is a string function which divides a string up into a list of string pieces based on a "separator" parameter that separates the pieces:

>>> 'a,b,c'.split(',')
['a', 'b', 'c']
>>> 'a:b:c'.split(':')
['a', 'b', 'c']

A returned piece will be the empty string if we have two separators next to each other, e.g. the '::' in the example below, or the separator is at the very start or end of the string:

>>> ':a:b::c:'.split(':')
['', 'a', 'b', '', 'c', '']

Special whitespace: split with no parameters at all is a special mode which splits on all whitespace chars (space, tab, newline), and it groups multiple whitespace together, and it ignore whitespace at the very beginning and end. So it's a simple way to break a line of text into "words" based on whitespace (note how the punctuation is lumped onto each "word"):

>>> 'Hello there,     he said.\n'.split()
['Hello', 'there,', 'he', 'said.']

File strategy: it's common to use "for line in f" to loop over the lines in a file and then "line.strip()" and "line.split()" to process each line. Often each line in a data file is set up with spaces or commas in a way that works perfectly with split(), like this

>>> line = 'n,37.22,wsmith\n'
>>> line = line.strip()    # trim off \n
>>> line.split(',')
['n', '37.22', 'wsmith']

String join()

','.join(lst) is a string function which is approximately the opposite of split: take a list of strings parameter and forms it into a big string, using the string as a separator:

>>> ','.join(['a', 'b', 'c'])
'a,b,c'

The elements in the list should be strings, and join just puts them all together to make one big string. Mnemonic: both split() and join() are both noun.verb on string.

Unicode Characters

In the early days of computers, the simple ASCII character encoding was very common, encoding the roman a-z alphabet. ASCII is simple, requiring just 1 byte to store 1 character, but it cannot handle characters of other languages.

Each character in a Python string is a unicode character, so characters for all languages are supported. Also, many emoji have been added to unicode as their own sort of alphabet.

Every unicode character is defined by a unicode "code point" which is basically a big int value that identifies that character. Unicode characters can be written using the "hex" version of their code point, e.g. 03A3 is the Sigma character Σ, and 2665 is the heart emoji character ♥.

Hexadecimal aside: hexadecimal is a way of writing an int in base-16 using the digits 0-9 plus the letters A-F, like this: 03A3 or 03a3. Two hex digits together like A3 or FF represent the value stored in one byte, so hex is a traditional way to write out the value of a byte. When you look up a unicode char or emoji on the web, typically you will see the code point written out in hex, like 1F644, the eye-roll emoji 🙄.

You can write a unicode char inside a Python string with a \u followed by the 4 hex digits of its code point, like '\u03A3' to insert a 'Σ' at that spot.

>>> s = 'hi \u03A3'
>>> s
'hi Σ'
>>> len(s)
4
>>> s[0]
'h'
>>> s[3]
'Σ'
>>>
>>> s = '\u03A9'  # upper case omega
>>> s
'Ω'
>>> s.lower()     # compute lowercase
'ω'
>>> s.isalpha()   # isalpha() knows about unicode
True
>>>
>>> 'I \u2665'
'I ♥'

Notice that the string 'hi \u03A3' is length 4 - the \u03A3 is a lengthy way of making a single character in the string.

For a code point with more than 4-hex-digits, use \U (uppercase U) followed by 8 digits with leading 0's as needed, like the fire emoji 1F525, and the inevitable 1F4A9.

>>> 'the place is on \U0001F525'
'the place is on 🔥'
>>> s = 'oh \U0001F4A9'
>>> len(s)
4

Not all computers have the ability to display all unicode chars, so the display of a string may fall back to something like \x0001F489 - showing you the hex digits for the char, even though the char can't be drawn on screen.

 

Copyright 2020 Nick Parlante