Jiaaro

…on life, the universe, and everything

Python performance the easy(ish) way

Ctypes is great. Let's start with a simple (and trite) example: Summing a range of numbers.

Here it is in python:

def sumrange(arg):
    return sum(xrange(arg))

Very nice! But let's say we're summing a LOT of numbers, like 0 through 10**8 (that's 100,000,000).

>>> %timeit sumrange(10**2)
1000000 loops, best of 3: 1.42 us per loop

>>> %timeit sumrange(10**8)
1 loops, best of 3: 1.13 s per loop

>>> %timeit sumrange(10**10)
1 loops, best of 3: 701 s per loop

Well that's no fun. Let's try something else:

def sumrange2(arg):
    x = i = 0
    while i < arg:
        x += i
        i += 1
    return x

How'd we do?

>>> %timeit sumrange2(10**2)
100000 loops, best of 3: 14.9 us per loop

>>> %timeit sumrange2(10**8)
1 loops, best of 3: 16.5 s per loop

Wow... that's even worse... (I dare not attempt this one with 10**10) so how do we speed it up? (don't suggest math tricks... this is the the new world of computing!)

*edit*: yes I know there's a constant time algoritm, n*(n+1)/2 . That isn't the purpose of this post

how about ctypes?

sumrange.c

#include <stdio.h>

unsigned long long sumrange(unsigned long long arg)
{
    unsigned long long i, x;
    x = 0;

    for (i = 0; i < arg; i++) {
        x = x + i;
    }
    return x;
}

compile it:

$ gcc -shared -Wl,-install_name,sumrange.so -o sumrange.so -fPIC sumrange.c -O2

and the python code...

import ctypes
sumrange_ctypes = ctypes.CDLL('/path/to/sumrange.so').sumrange
sumrange_ctypes.restype = ctypes.c_ulonglong

I learned on StackOverflow that you have to set the restype or else python will assume it's an int (which will overflow)

And the winner is...?

>>> %timeit sumrange_ctypes(10**8)
1000000 loops, best of 3: 590 ns per loop

Wow that's fast... too fast. Let's experiment

>>> %timeit sumrange_ctypes(10**2)
1000000 loops, best of 3: 601 ns per loop

>>> %timeit sumrange_ctypes(10**10)
1000000 loops, best of 3: 592 ns per loop

>>> %timeit sumrange_ctypes(10**16)
1000000 loops, best of 3: 602 ns per loop

It seems gcc has optized this into a constant time algorithm LOL. Let's try without the optimize flag (for the record I tried with -O1 and it was still contant time)

$  gcc -shared -Wl,-install_name,sumrange.so -o sumrange.so -fPIC sumrange.c

... and in python (ipython in my case)...

>>> %timeit sumrange_ctypes(10**2)
1000000 loops, best of 3: 807 ns per loop

>>> %timeit sumrange_ctypes(10**8)
1 loops, best of 3: 214 ms per loop

>>> %timeit sumrange_ctypes(10**10)
1 loops, best of 3: 3.01 s per loop

Roundup!

iterations      |   10**2       10**8       10**10
--------------------------------------------------
pure python     |   1.42 us     1.13 s      701 s
ctypes          |   807 ns      214 ms      3.01 s

That is a hell of a performance boost!

For nodejs hackers, the node equivalent of ctypes is FFI (Foreign Function Interface): https://github.com/rbranson/node-ffi