Python Strings

String Basics

A Python string, like 'Hello' stores text as a sequence of individual characters. Text is central to many computations - urls, chat messages, the underlying HTML code that makes up web pages.

Python strings are written between single quote marks like 'Hello' or alternately they can be written in double quote marks like "There".

a = 'Hello'
b = "isn't"

Each character in a string is drawn from the unicode character set, which includes the "characters" or pretty much every language on earth, plus many emojis. See the unicode section below for more information.

String len()

The len() function returns the length of a string, the number of chars in it. It is valid to have a string of zero characters, written just as '', called the "empty string". The length of the empty string is 0. The len() function in Python is omnipresent - it's used to retrieve the length of every data type, with string just a first example.

>>> s = 'Python'
>>> len(s)
6
>>> len('')   # empty string
0

Convert Between Int and String

The formal name of the string type in Python is "str". The str() function serves to convert many values to a string form. This code computes the str form of the number 123:

>>> str(123)
'123'

Looking carefully at the values, 123 is a number, while '123' is a string length-3, made of the three chars '1' '2' and '3'.

Going the other direction, the formal name of the integer type is "int", and the int() function takes in a value and tries to convert it to be an int value:

>>> int('1234')
1234
>>> int('xx1234')   # fails due to extra chars
ValueError: invalid literal for int() with base 10: 'xx1234'

String Indexing [ ]

Chars are accessed with zero-based indexing with square brackets, so the first char is at index 0, the next index 1, and the last char is at index len-1.

alt: string 'Python' shown with index numbers 0..5

>>> s = 'Python'
>>> len(s)
6
>>> s[0]   # Access char at index 0
'P'
>>> s[1]
'y'
>>> s[5]
'n'
>>> s[6]
IndexError: string index out of range
>>> s[0] = 'x'   # no, string is immutable
TypeError: 'str' object does not support item assignment

Accessing a too large index number is an error. Strings are immutable, so the chars in a string can never be changed. Instead, code can create a new string with different chars, as with + in the next section.

String +

The + operator combines (aka "concatenates") two strings to make a bigger string. This creates new strings to represent the result, leaving the original strings unchanged. (See the working with immutable below.)

>>> s1 = 'Hello'
>>> s2 = 'There'
>>> s3 = s1 + ' ' + s2
>>> s3
'Hello There'
>>> s1
'Hello'

Concatenate + only works with 2 or more strings, not for example to concatenate a string and an int. Call the str() function to make a string out of an int, then concatenation works.

>>> 'score:' + 6
TypeError: can only concatenate str (not "int") to str
>>> 'score:' + str(6)
'score:6'

String Functions

Here are the most commonly used string functions.

String in

The in operator checks, True or False, if something appears anywhere in a string. In this and other string comparisons, characters much match exactly, so 'a' matches 'a' but does not match 'A'. (Mnemonic: this is the same word "in" as used in the for-loop.)

>>> 'b' in 'abcd'
True
>>> 'B' in 'abcd'
False
>>> 'aa'  in 'iiaaii'  # test string can be any length
True
>>> 'aaa' in 'iiaaii'
False
>>> '' in 'abcd'       # empty string in always True
True

Character Tests: s.isalpha() s.isdigit() s.isspace()

The characters that make up a string can be divided into several categories or "character classes":

alt: divide chars into alpha (lower/upper), digit, space, and misc leftovers

alphabetic chars - e.g. 'abcXYZ' used to write words. Alphabetic chars are further divided into upper and lowercase versions (the details depend on the particular unicode alphabet).

digit chars - e.g. '0' '1' .. '9' to write numbers

space chars - e.g. space ' ' newline '\n' and tab '\t'

Then there are all the other miscellaneous characters like '$' '^' '<' which are not alphabetic, digit, or space.

These test functions return True if all the chars in s are in the given class:

s.isalpha() - True for alphabetic "word" characters like 'abcXYZ' (applies to alphabetic chars in other unicode alphabets too like 'Σ')

s.isdigit() - True if all chars in s are digits '0..9'

s.isspace() - True for whitespace char, e.g. space, tab, newline

s.isupper(), s.islower() - True for uppercase / lowercase alphabetic chars. False for other characters like '9' and '$' which do not have upper/lower versions.

>>> 'a'.isalpha()
True
>>> '$'.isalpha()
False
>>> 'a'.islower()
True
>>> 'a'.isupper()
False
>>> s = '\u03A3'  # Unicode Sigma char
>>> s
'Σ'
>>> s.isalpha()
True
>>> '6'.isdigit()
True
>>> 'a'.isdigit()
False
>>> '$'.islower()
False
>>> ' '.isspace()
True
>>> '\n'.isspace()
True
>>> ''.isalpha()  # empty string is False
False             # for these tests

Unicode aside: In the roman a-z alphabet, all alphabetic chars have upper/lower versions. In some alphabets, there are chars which are alphabetic, but which do not have upper/lower versions.

Change Case s.upper() s.lower()

s.lower() - returns a new version of s where each char is converted to its lowercase form, so 'A' becomes 'a'. Chars like '$' are unchanged. The original s is unchanged - a good example of strings being immutable. (See the working with immutable below.) Each unicode alphabet includes its own rules about upper/lower case.

s.upper() - returns an uppercase version of s

>>> s = 'Python123'
>>> s.lower()
'python123'
>>> s.upper()
'PYTHON123'
>>> s
'Python123'

Test s.startswith() s.endswith()

These convenient functions return True/False depending on what appears at one end of a string. These are convenient when you need to check for something at an end, e.g. if a filename ends with '.html'.

s.startswith(x) - True if s start with string x

s.endswith(x) - True if s ends with string x

>>> 'Python'.startswith('Py')
True
>>> 'Python'.startswith('Px')
False
>>> 'resume.html'.endswith('.html')
True

Search s.find()

s.find(x) - searches s left to right, returning the int index where string x appears, or -1 if not found. Use s.find() to compute the index where a substring first appears.

>>> s = 'Python'
>>> s.find('y')
1
>>> s.find('tho')
2
>>> s.find('xx')
-1

The in test just reports True/False if a string is in there. The find() function tells you where it is.

There are some more rarely used variations of s.find(): s.find(x, start_index) - which begins the search at the given index instead of at 0; s.rfind(x) does the search right-to-left from the end of the string.

Strip Whitespace s.strip()

s.strip() - return a version of s with the whitespace characters (space, tab, newline) from the very start and very end of the string all removed. Handy to clean up strings parsed out of a file or read from a user in a dialog box.

>>> '   hi there  \n'.strip() 
'hi there'

String s.replace()

s.replace(old, new) - returns a version of s where all occurrences of old have been replaced by new. Does not pay attention to word boundaries, just replaces every instance of old in s. Replacing with the empty string effectively deletes the matching strings.

>>> 'this is it'.replace('is', 'xxx')
'thxxx xxx it'
>>> 'this is it'.replace('is', '')
'th  it'

Working With Immutable x = change(x)

Strings are "immutable", meaning the chars in a string never change. Instead of changing a string, code creates new strings.

Suppose we have a string, and want to change it to uppercase and add an exclamation mark at its end, so 'Hello' becomes 'HELLO!'.

This code looks like it will work, but it does not:

>>> s = 'Hello'
>>> s.upper()  # compute upper, but does not store it
'HELLO'
>>> s          # s is not changed
'Hello'

The correct form computes the uppercase form, and also stores it back in the s variable, a sort of x = change(x) pattern:

>>> s = 'Hello'
>>> s = s.upper()  # compute upper, store in s
>>> s = s + '!'    # add !, store in s
>>> s              # s is the new, computed string
'HELLO!'

Backslash Special Chars

A backslash \ in a string literal in your code "escapes" a special char we wish to include in the string, such as a quote or \n newline. Common backslash escapes:

\'   # single quote
\"   # double quote
\\   # a backslash
\n   # newline char

A string using \n:

a = 'First line\nSecond line\nThird line\n'

Python strings can be written within triple ''' or """, in which case they can span multiple lines. This is useful for writing longer blocks of text.

a = """First line
Second line
Third line
"""

String s.format()

The string format() function is a handy way to paste values into a string. It uses the special marker {} within a string to mark where things go, like this:

>>> 'Count: {}'.format(67)
'Count: 67'
>>> 'Count: {} and word: {}'.format(67, 'Yay')
'Count: 67 and word: Yay'

The older approach would be to compute str(67) manually and use + to put the result string together. The str.format() function is a more convenient tool for that situation.

For floating point values, typically you do not want to print all 15 digits of a float value. The format marker {:.4g} means print at most 4 digits to the right of the decimal; "g" here is the general format, that works for float and int values as appropriate.

>>> 2/3   # has lots of digits
0.6666666666666666
>>> 'val: {:.4g}'.format(2/3)
'val: 0.6667'
>>> 'val: {:.2g}'.format(2/3)
'val: 0.67'
>>> 'val: {:.2g}'.format(45)
'val: 45'

There are many, many other options for format markers, but {:.4g} is a good one to know for the common situation of printing float values.

String Loops

The standard for i/range() loop goes through all index numbers for s. With this form, the loop can access each character and its index in the loop:

for i in range(len(s)):
    # use s[i] in here

The for loop can also loop over the chars of a string directly, with no need to call range() or len(). This simpler form accesses the chars of the string, but without the index numbers:

for char in s:
    # use char in here

list('abc') of a string yields a list ['a', 'b', 'c'] of its chars.

More details at official Python String Docs

String Slices

alt: string 'Python' shown with index numbers 0..5

The slice syntax is a powerful way to refer to sub-parts of a string instead of just 1 char: s[ start : end ] returns a substring from s beginning at the start index, running up to but not including the end index. If the start index is omitted, starts from the beginning of the string. If the end index is omitted, runs through the end of the string. If the start index is equal to the end index, the slice is the empty string.

>>> s = 'Python'
>>> s[2:4]
'th'
>>> s[2:]
'thon'
>>> s[:5]
'Pytho'
>>> s[4:4]  # start = end: empty string
''

If the end index is too large (out of bounds), the slice just runs through the end of the string. This is one case where Python is permissive about wrong/out-of-bounds indexes. Similarly, if the start index is greater or equal to the end index, the slice is just the empty string.

>>> s = 'Python'
>>> s[2:999]
'thon'
>>> s[3:2]  # zero chars
''

Negative Slices

This is a slightly advanced shortcut you do not need to use, but it is handy sometimes. Negative numbers also work within [ ] and slices: -1 is the last char in the string, and -2 is the next to last char, and so on. So for example with s = 'Python', s[-1] is 'n' and s[-2] is 'o'. This is convenient when you want to refer chars near the end of the string, and it works in slices too.

>>> s = 'Python'
>>> s[-1]
'n'
>>> s[-2]
'o'
>>> s[-2:]
'on'
>>> s[:-1]
'Pytho'

Mnemonic: The last char in a string is at index len(s) - 1. So -1 is a shorthand for len(s) - 1, -2 is for len(s) - 2, and so on.

String split()

str.split(',') is a string function which divides a string up into a list of string pieces based on a "separator" parameter that separates the pieces:

>>> 'a,b,c'.split(',')
['a', 'b', 'c']
>>> 'a:b:c'.split(':')
['a', 'b', 'c']

A returned piece will be the empty string if we have two separators next to each other, e.g. the '::' in the example below, or the separator is at the very start or end of the string:

>>> ':a:b::c:'.split(':')
['', 'a', 'b', '', 'c', '']

Special whitespace: split with no parameters at all is a special mode which splits on all whitespace chars (space, tab, newline), and it groups multiple whitespace together, and it ignore whitespace at the very beginning and end. So it's a simple way to break a line of text into "words" based on whitespace (note how the punctuation is lumped onto each "word"):

>>> 'Hello there,     he said.\n'.split()
['Hello', 'there,', 'he', 'said.']

File strategy: it's common to use "for line in f" to loop over the lines in a file and then "line.strip()" and "line.split()" to process each line. Often each line in a data file is set up with spaces or commas in a way that works perfectly with split(), like this

>>> line = 'n,37.22,wsmith\n'
>>> line = line.strip()    # trim off \n
>>> line.split(',')
['n', '37.22', 'wsmith']

String join()

','.join(lst) is a string function which is approximately the opposite of split: take a list of strings parameter and forms it into a big string, using the string as a separator:

>>> ','.join(['a', 'b', 'c'])
'a,b,c'

The elements in the list should be strings, and join just puts them all together to make one big string. Mnemonic: both split() and join() are both noun.verb on string.

Unicode Characters

In the early days of computers, the simple ASCII character encoding was very common, encoding the roman a-z alphabet. ASCII is simple, requiring just 1 byte to store 1 character, but it cannot handle characters of other languages.

Each character in a Python string is a unicode character, so characters for all languages are supported. Also, many emoji have been added to unicode as their own sort of alphabet.

Every unicode character is defined by a unicode "code point" which is basically a big int value that identifies that character. Unicode characters can be written using the "hex" version of their code point, e.g. 03A3 is the Sigma character Σ, and 2665 is the heart emoji character ♥.

Hexadecimal aside: hexadecimal is a way of writing an int in base-16 using the digits 0-9 plus the letters A-F, like this: 03A3 or 03a3. Two hex digits together like A3 or FF represent the value stored in one byte, so hex is a traditional way to write out the value of a byte. When you look up a unicode char or emoji on the web, typically you will see the code point written out in hex, like 1F644, the eye-roll emoji 🙄.

You can write a unicode char inside a Python string with a \u followed by the 4 hex digits of its code point, like '\u03A3' to insert a 'Σ' at that spot.

>>> s = 'hi \u03A3'
>>> s
'hi Σ'
>>> len(s)
4
>>> s[0]
'h'
>>> s[3]
'Σ'
>>>
>>> s = '\u03A9'  # upper case omega
>>> s
'Ω'
>>> s.lower()     # compute lowercase
'ω'
>>> s.isalpha()   # isalpha() knows about unicode
True
>>>
>>> 'I \u2665'
'I ♥'

Notice that the string 'hi \u03A3' is length 4 - the \u03A3 is a lengthy way of making a single character in the string.

For a code point with more than 4-hex-digits, use \U (uppercase U) followed by 8 digits with leading 0's as needed, like the fire emoji 1F525, and the inevitable 1F4A9.

>>> 'the place is on \U0001F525'
'the place is on 🔥'
>>> s = 'oh \U0001F4A9'
>>> len(s)
4

Not all computers have the ability to display all unicode chars, so the display of a string may fall back to something like \x0001F489 - showing you the hex digits for the char, even though the char can't be drawn on screen.

 

Copyright 2020 Nick Parlante