Programming in Python for Peak Performance

Education

Oct 27

Written By Stephen

Wherein I cover a few ways to write your Python code without using a for loop

Poor Python Practices Prevent Peak Performance

An email goes around, showing timing comparisons between a number of codes -- Python, C/C++, Fortran, and the new kid on the block, Julia -- on a handful of benchmarks. The email proclaims Julia the new king of scientific computing, because it knocks the Mandelbrot set out of the park and goes through arrays like crazy compared to Python. Of course C and Fortran still eat Julia's lunch on speed of running, but it would have taken the developer a month to write in C what took a few days in Julia, and that's a very relevant factor when you think of things in terms of "time until I solve my problem". Ergo, Julia will inevitably replace Python as the tool of choice for scientific computing.

That was eight years ago, four years after Patil and Hammerbacher started calling the job "data scientist", and when I was asked in a job interview "What does Big Data mean to you?". Python stands undefeated.

This email, or blog post, or whatever does have a kernel of truth to it. If there's one consistent knock against Python it's that it's quite slow. This is not unwarranted, the raw Python language can be very slow thanks to its being dynamically typed and interpreted, which creates an awful lot of overhead to figure out what a variable is before you can act on it. Throw this into a for loop, and it's a recipe for slow.

Just look at this Mandelbrot set example:

Program your Python with Proper Process in Mind

It is true that Python has worse runtime performance than compiled languages like C or Fortran. This runtime hit comes from many of the features that make Python so popular: the dynamic typing and automatic memory management. There's a way around each, but it involves using other, more performance-optimized Python modules instead of the standard Python library.

Objects, Objects Everywhere

In Python, everything is an object. E ver y thing. So a list of floats? Every float is an object. This has profound implications for memory footprints. A simple integer takes up 28 bytes in Python. If you don't believe me, just run

In [1]: import sys
In [2]: sys.getsizeof(155)
Out[2]: 28

They're telling you the integer is huge. For comparison, a 64bit integer in C should be... 8 bytes, or 64 bits. So already the memory footprint is triple what it would be for C.

Then we get to lists of numbers. A Python list is an array of pointers, so each pointer has some non-zero size on top of the rather large footprint of the integer itself, and then Python lists are not naturally contiguous in memory, so the pointer is running everywhere looking for the integers. This is one reason iteration over lists in Python is so blasted slow -- it's basically one continuous cache miss.

To understand why contiguous memory is so important, imagine that reading from memory involves a gerbil scampering from one memory location to another. If it helps, imagine the gerbil is wearing a white Oxford button-down and a black tie. The memory location comes in when the program asks for a value in memory, and the gerbil has to scamper to that location. Obviously, the most efficient way for the gerbil to run around is if the locations in memory are next to each other. That's what we mean by sequential memory. If the locations are all over the place, we have to wait for the gerbil to cover a lot of unnecessary ground, slowing down the memory retrieval. If the location in memory is on another floor in the memory office building, the gerbil has to take the elevator, which can take a while. That's a cache miss. I shared this mostly because I like the gerbil metaphor.

This is true all over Python, but it's not the only performance limiter. There's another thing intrinsic to Python that slows it down, the Global Interpreter Lock.

The Global Interpreter Lock -- Performance's Nemesis

The original reason for the GIL was that Python's memory management system is not thread-safe. Python counts the number of references to an object, and when that number drops to zero it releases that memory. This works fine if one thread is operating at a time, but if you have multiple threads working simultaneously there's a chance for a race condition and having some truly bizarre memory bugs crop up.

In default mode, Python forbids multiple threads with the Global Interpreter Lock (GIL). The GIL makes loops "safe" in the sense that it prevents the race conditions, but it also means you're only using a fraction of the available threads on your processor to run through the loop.

Old Type, New Type, Red Type, Blue Type

Python is dynamically typed, which means that just because I start a for loop with a being a list of floats does not mean I will end with same. This is very different from C++, which is statically typed -- when I say std::vector<float> a (10000); I know that a will be a vector of floats now and forever. Because of the dynamic typing in Python, many optimizations available with C++ are not there for Python, because you have to

check what the types are for every variable in the body of a loop at every iteration,
check that everything in the body of the loop is allowed with the types of all the variables, and
make sure there isn't some sort of race condition where multiple threads are motoring through the loop and changing the types or otherwise modifying the data.

In C++, if I write

for (int i = 0, i < 10000; i++) {
    c[i] = a[i] + b[i];
}

a compiler can look at a , b and c at compile time, figure out that they're all the correct array types, then when the program runs hand off chunks of the arrays to multiple threads, allowing each thread to operate independently, which is called automatic vectorization. Lots of HPC codes rely on this to wring maximum performance with minimal code.

Python cannot do this, because it cannot know a priori if iteration 577 will change the type of a and change what the plus operation means. This extremely dynamic behavior is a nightmare for compilers and interpreters, and is a huge performance bottleneck for Python.

Performant Techniques: A Farewell to Alliteration

The inherent structures of Python, that make it so pleasant to program in, are arrayed against performance. Fortunately, one of Python's other great strengths, extensibility, can rescue it. There are a number of standard Python modules that can bring performance up and make Python a viable code for scientific software.

NumPy/SciPy -- Fortran-fast array operations

Perhaps the easiest way to boost Python performance is to use NumPy arrays. NumPy and its sister module the SciPy library are the core ingredients of scientific Python computing, the mirepoix of your high performance ingredients, if you will. NumPy is built atop BLAS and LAPACK, two Fortran linear algebra libraries that have been optimized to within an inch of their lives for forty years. Because the code underlying NumPy is compiled Fortran, the GIL doesn't affect performance, and the end result is impressive.

Consider this simplest timing test, adding two large arrays. On my laptop, running

def sum_for(x, y):
    c = [0]*len(x)
    for idx in range(len(x)):
        c[idx] = x[idx]+y[idx]
    return c

on two 10,000 Python element lists takes 1 msec. Running the same loop on identical NumPy arrays takes 4 msec. On the other hand, running

def sum_np(x, y):
    return x + y

on two NumPy lists takes 6 microseconds -- it's over 150x faster! Such is the power of using compiled Fortran to do the heavy number crunching, the access to multiple threads, vectorization, and the more efficient memory usage.

Basically all the Python scientific libraries are built to be NumPy aware. SciPy (confusingly, there is a SciPy library that includes lots of special functions, interpolation routines, optimization algorithms, physical constants etc. and then there is SciPy, an ecosystem that includes the SciPy library, NumPy, Pandas, Matplotlib, and a bunch of other libraries.) is NumPy aware, and offers fast functions for vectorized, multithreaded computation.

If you've been using Python for numerical applications at all, you're probably already using NumPy and SciPy. As such, this is not a wild surprise. It is so prevalent at this point that everything I wrote is likely common knowledge, but it is the foundation on which all your most efficient Python programming gets built.

Python's biggest strength is the large number of libraries that you can use. Use libraries that are built for efficient number-crunching when you need efficient number-crunching.

numexpr -- Memory-efficient array computations

Okay, we're using NumPy now, and we're using the broadcasting rules to compute $a \times \exp (b \times c)$ on three arrays, $a$, $b$, and $c$, which all contain 10,000 elements. Our code might look like this

import numpy as np
from numpy import exp

a = np.random.random(10000)
b = np.random.random(10000)
c = np.random.random(10000)

y = a*exp(b*c)

and we find that evaluating the last line takes 94 microseconds on my laptop. Instead, we try using the numexpr library

import numexpr

y = numexpr.evaluate('a*exp(b*c)')

and it runs in 52 microseconds. What? How did that take half as much time?

When you write a * exp(b*c) and Python goes to evaluate it, it first computes b*c and stores that in a separate array, which it then exponentiates, returning a new separate array, then finally multiplies that array by a to return the final array. So that's two temporary arrays that have to be written in memory, and depending on the size of the arrays may not fit in cache (There's a good description of how the runtime performance is related to array size versus cache misses here, if you want to start learning to think in these terms. ). The alternative would be to create the final array of correct size, then loop over each element -- a very C-like approach -- which doesn't require any excess memory but which, without pre-compiling the for loop, will be exceedingly slow.

What numexpr does is create a compiled function and then automatically breaks the arrays into cache-sized chunks and then allocates a minimal number of required intermediate arrays to evaluate the expression. This sort of careful memory-efficient code is the kind of thing C or Fortran is built for, and numexpr handles it automatically. Numexpr claims the possibility of 20x speed-up, with typical values in the 2x to 4x range. So if you're writing numerically intensive Python and all your functions are supported in numexpr, this can be a really lightweight, easy way to speed things up around the edges.

This does have its limits, of course. If I'm evaluating something very simple, like a + b, numexpr is coming out about 5x slower than just straight up NumPy. There's overhead associated with compiling a function, so before you just assume numexpr is the solution to all your performance problems, check and make sure you will benefit by dropping into IPython and running some quick tests with %timeit to see if it makes sense.

Successive array operations are memory-inefficient in Python. numexpr parses simple operations and runs them in a way that is much more memory-efficient.

Numba -- jit compilation for faster loops

Sometimes you just need a for loop. Whether it's the Mandelbrot set example we used above, or some other iterative approach where the next entry in your array of outputs is some complex function of inputs, or you have some other non-vectorizable operation like assigning charge to a grid in particle-in-cell simulations, programming will require you to use a loop in your Python code. But that doesn't mean it has to be a slow loop.

As we discussed earlier, the primary reason loops are so slow in Python is the GIL that forbids multithreading and the dynamic typing demands type checking at every iteration, things that are not issues in pre-compiled code where the types are known and aren't going to change over the course of the loop. Numba is a just-in-time (JIT) Python compiler that allows you to produce pre-compiled functions with static types that allows for high performance optimization. The sacrifice is that, as a JIT compiler, every time you run the code it recompiles. So there's some initial overhead to running your code that wasn't there before, but for big, intensive applications Numba can make Python competitive with compiled languages like C and Fortran without the developer time associated.

Watch, our Mandelbrot set example before. To run the Python code

def mandelbrot(c):
    z = 0
    for idx in range(1, 10000):
        z = z*z + c

took 733 microseconds, which isn't that long. But by simply adding a Numba decorator

from numba import jit

@jit(nopython=True)
def mandelbrot(c):
z = 0
for idx in range(1, 10000):
    z = z*z + c

after compilation the function takes about 124 nanoseconds, 6,000x faster than the raw Python! This is an extreme example (I am gobsmacked myself at this result), and our earlier vectorization example for NumPy -- adding two NumPy arrays -- Numba runs slower because the compilation takes up time and then Numba offers no particular advantages. But for code that demands iteration, the compilation from Numba bypasses the GIL and type checking slowdowns, enabling compiled code speeds by just adding a decorator to the functions.

Sometimes loops are inevitable. Numba solves many problems -- releasing the GIL, building on type to avoid repeated type checking -- that slow down loops, when it can be used.

hdf5 -- read/write large datasets in binary

A few years ago I attended a virtual seminar on how to make the most of the high-performance computing facilities at NERSC, and was gobsmacked to learn that about 80% of jobs running at NERSC dump their data in ASCII format. This is a catastrophic waste of time both writing out and reading in data compared to what you should be using, which is a binary format.

Think of it this way. When you read in an ASCII file, Python has to read in unicode and translate that to floats or ints or something. Python does this natively, but it's not fast. In the Python standard library, you also have to iterate line by line through the text file, which demands a for loop, and you know how we feel about that. Pandas does better for structured data like CSV files, but files with ASCII are huge compared to a binary format -- consider that the number "255", the largest number that can be represented in 8 bits. It requires three ASCII characters at 8 bits each, meaning that storing 255 is 3 times larger in ASCII than in a binary format. This gets worse for floating point numbers, big numbers, etc.

tl;dr, your ASCII files with numbers in them are comically large.

To see this, let's try writing and reading a big numpy array, in this case a 100,000 x 100 array. We compare the various write times and file sizes in the table below. The conclusion is text files are comically large and slow, CSV files are smaller but still large, and the binary formats are both significantly smaller and much faster to write. The last format, hdf5, is over 100,000 times faster to write and factors of 3-30 smaller size than a TXT or CSV file, so maybe we should pay attention to that.

Method	file size	run time
`np.savetxt` .txt file, NumPy	2 GB	8.5 sec
`np.savetxt` .csv file, NumPy	250 MB	8.5 sec
`df.to_csv` .csv file, Pandas	193 MB	16.8 sec
`np.save` .npy file, NumPy	80 MB	173 msec
`h5f.create_dataset` h5 file, h5py	80 MB	79.3 $\mu$sec

File I/O being a bottleneck to performance has been a long-running problem in the high-performance computing world, and one of the standard tools for writing and later reading data has been the Hierarchical Data Format hdf5. The great thing about hdf5 is you can store pretty much anything in it, including Python objects, NumPy arrays, etc., as well as adding metadata to what you are storing so the next person to open the file knows what you put in there.

This being a scientific tool and our subject being Python, there's a few modules for that, including h5py and PyTables. For data scientists not looking to add a new dependency to their project, Pandas can even read and write in hdf5 by building on the PyTables module.

NumPy also has the ability to store arrays in .npy files, which again for writing and reading is significantly faster (think 10x or more) than with CSV files. .npy files are going to be NumPy arrays, so they won't have the full flexibility or ability to document as hdf5, but again if you're already operating

Don't read or write your large data in TXT or CSV formats. Use a binary format like hdf5 or .npy instead -- these can be tens or hundreds of times faster.

Improving Python Performance Requires Puny Programming Permutations

Writing fast Python does not require that you write Python-wrapped C or Fortran libraries that you call to, although you can certainly do this and it has become a common practice in scientific computing circles to wrap legacy codes in Python rather than attempt to rewrite the entire code base. For data science types or other scientists that aren't wrestling with large legacy code bases and just want to crunch some numbers to get an answer, Python is a great option for implementing the analysis quickly and getting to the results. Writing that Python to be fast -- in many cases almost as fast as its compiled breathren -- doesn't require a lot of expertise, and can be summarized in five basic rules:

Use NumPy arrays for number crunching
Never use loops where vectorized array operations are available
If you have many arrays in your vectorized operation, run time tests with numexpr
If you must use a loop, consider using a jit compiler like Numba
Write your data in a binary file format

These rules are not onerous, and require using only four Python modules -- NumPy/SciPy, numexpr, Numba, and h5py/PyTables. Some of these require almost no changes to your code, just switching to NumPy arrays and vectorized operations is enough. Others, like switching to a binary file format, will require rewriting your code's file I/O, but is likely worth it if you are reading and writing large datasets.

I should mention my biases and omissions. I did not talk about GPU programming in Python, which can be accomplished with Numba, yes, but also PyCUDA or PyOpenCL depending on which one you want to use. This is because I have done little to no GPU coding. I have also omitted memory-parallel computations that would need to pass data between processors, and which you could use mpi4py on. I have used this, but memory-parallel programming is a whole different beast to deal with, and has a lot of issues you don't run into with simple multithreaded applications where messages don't have to be passed. It frankly isn't an issue for most of the data-intensive applications, like deep neural networks, people are worried about today.

$\setCounter{0}$

PythonHigh Performance Computingnumbanumexprpandashdf5

Stephen