Time-profiling code

Scientists "know" that faster programs is better. But what does this mean?

First, most of the time you will spend more time writing programs than running them. All the previous talks optimize writing, debugging, collaborating, and maintaining programs and is the most important thing you can do.

But as for optimizing program speed, there is one saying you see everywhere: Premature optimization is the root of all evil. This means, first you want to make a good program, then profile it to see where the slow parts are, then optimize just those parts. This talk is about profiling: the process of systematically examining program execution times to learn the bottlenecks


Profiling and optimizing

There is a famous saying:

Premature optimization is the root of all evil

What this saying means is that you should not try to optimize your program before writing it. Programs are very complex interactions between processor architectures, memory, multiple layers of caches, compilers, and the instructions themselves. If you try to guess where the slow part is, you will probably get it wrong and end up wasting your time.


Wikipedia: Profiling is dynamic program analysis of time, space, or other resource usage.

What you can't see, you can't improve

Example from my work

Example output (click link):

Example of gprof2dot profile.  Click for full image.
  • I find one slow branch that takes most time.

This is a directed graph showing the flow of time through the program. At the top is the entry point, and looking down you see the flow of time (in fraction of total time) distributed among all functions. Colors serve as a visual aid to see where the hot spots are. Arrows point from callers to callees, and include number of times called.

This is not "a profile". This is one representation of a profile.

How to collect the profiling data

Remember, the whole point of this is that you should write good programs first, and then profile. Of course, sometimes you will profile during development, but don't go crazy sacrificing readability for optimizations. Chances are that will be refactored out later anyway.

Profiling in different languages

Collecting profiling information

In the first step, we run the program using the cProfile module. This just stores up the profile information, and we will examine it in the next step.

Change this:

$ python SCRIPT.py arg1 arg2 ....

To this:

$ python -m cProfile -o profile.out  SCRIPT.py arg1 arg2 ....


The next step is to visualize or analyze the data in profile.out.

I personally prefer first running and storing the profile in profile.out, and then visualizing, to be better. If the call graph is not useful enough, I can visualize it again using different options or examine it using the command line for more details. Also, if I make a change, I can compare the new and old profiles to see how it affected things. This is important!

python -m MODULE is the same as "python /path/to/the/MODULE.py". It is a common shortcut.

This step works on any platform.

Sample output: profile-growsf.out

Visualizing the profile information (gprof2dot)

Example output (click link):

Example of gprof2dot profile.  Click for full image.

This assumes that you are on Linux, or some operating system with pipes, and graphviz installed. There are graphviz tools for other platforms.

If you want to save the image, change |display to > profile.png.

Some nomenclature

What to look for in a profile

There are also tools for line-based, instead of function-based, profiling. However, due to the overheads of Python it's not common there.

I don't have magic suggestions on how to improve things. After seeing enough profiles, and a future optimization talk, you will gain intuition on how to do things. Most importantly, by examining profiles before and after your changes, you will be in a position to know what works and what doesn't.

Example: profile before and after optimization

Here, you see two call graphs: before and after optimization.

Call graph before optimizing Call graph after optimizing
Profile sources: before and after.

Example: calling external processes

Call graph after optimizing

Profile source: profile-external.out

On the left of this figure, we see various external community detection methods running using the subprocess module.

Example: using a library for work

Call graph after optimizing


Examining profile.out from the command line

You can examine the raw profile.out data using the command line, without making a picture.

$ python -m pstats profile.out

% strip           # make output names shorter
% sort time       # Sort by time
% stats 15        # Print top 15 lines


  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1000020  119.340    0.000  134.635    0.000 cluster.py:59(_triangles_and_degree_iter)
      21   53.178    2.532   53.178    2.532 {time.sleep}
     381   18.685    0.049   18.685    0.049 {cPickle.loads}
      20    9.450    0.473   13.629    0.681 cmtyembed.py:67(nembed_m)
10999400    7.203    0.000    7.203    0.000 graph.py:294(__getitem__)

Available commands:

shorten filenames (recommended)
sort [ time | cumul | other ]
sort the data by total time, cumulative time, or any of the options.
print N
print first N entries
callees [funcname]
Print functions which funcname called and time spent in each - only time spent in direct calls from funcname
callers [funcname]
Print functions which called funcname and how much time was spent in calls from each function.

Let's study the difference in total time and cumulative time by looking at profile-external.out.

Sorting by total time (sort time), we see that internal subprocess calls take up most of the time, and this is expected. I scan down the list of functions until I get to the first function that I actually wrote, and it is run_louvain. I see that this takes up only 9 seconds out of 803 total seconds. So I consider this program to be written well enough, since I can't change subprocess (perhaps I could call the program in a way that doesn't read in data, but I don't want to do that now.)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6565  803.707    0.122  803.707    0.122 {posix.waitpid}
     6565   61.486    0.009   61.486    0.009 {posix.read}
     6565   40.397    0.006   40.397    0.006 {posix.fork}
    19726   31.600    0.002   31.600    0.002 {open}
    36865    9.419    0.000    9.419    0.000 {posix.access}
     5050    9.411    0.002  134.704    0.027 algorithms.py:1136(run_louvain)
    13022    6.612    0.001    6.612    0.001 {posix.remove}
     2534    5.746    0.002    5.746    0.002 {method 'close' of 'file' objects}
        5    5.569    1.114 1050.805  210.161 p1-data.py:21(run)
  2987405    5.558    0.000    5.558    0.000 misc.py:31(is_string_like)
     5050    4.332    0.001   14.017    0.003 algorithms.py:1183(read_cmtys_and_return)
    18622    3.723    0.000    3.723    0.000 {posix.lstat}
     1515    3.401    0.002  991.595    0.655 algorithms.py:134(__init__)
2820992/1515    2.968    0.000    7.946    0.005 copy.py:145(deepcopy)
  1895261    2.957    0.000    6.497    0.000 pajek.py:215(make_qstr)

Now, let's sort it by cumulative time (sort cumul). You see that the <module> takes up all the time, as you expect.

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.184    0.184 1051.619 1051.619 p1-data.py:1(<module>)
     5    5.569    1.114 1050.805  210.161 p1-data.py:21(run)
  1515    3.401    0.002  991.595    0.655 algorithms.py:134(__init__)
  6565    1.226    0.000  927.022    0.141 algorithms.py:327(call_process)
  6565    0.279    0.000  908.277    0.138 subprocess.py:485(call)
 13130    0.119    0.000  865.313    0.066 subprocess.py:475(_eintr_retry_call)
  6565    0.126    0.000  803.927    0.122 subprocess.py:1286(wait)
  6565  803.707    0.122  803.707    0.122 {posix.waitpid}
   505    0.985    0.002  771.701    1.528 algorithms.py:765(run)
   505    1.059    0.002  149.446    0.296 algorithms.py:1118(run)
  5050    9.411    0.002  134.704    0.027 algorithms.py:1136(run_louvain)
  6565    0.230    0.000  103.976    0.016 subprocess.py:619(__init__)
  6565    1.258    0.000  103.684    0.016 subprocess.py:1111(_execute_child)
  6565   61.486    0.009   61.486    0.009 {posix.read}
  6565   40.397    0.006   40.397    0.006 {posix.fork}

I just know what files most of these functions are in. If you see a file name that you don't recognize, like misc.py, restart the profile browser and don't run strip and you will see full file names.


Profile C code

Any decent language will have profiling facilities. For C:

C code must be compiled with profile support, and then it automatically appears when you run it, unlike Python where you run it differently to invoke the profiling hooks.

C (and other compatible compiled codes), in general, has a whole lot more instrumentation capabilities.

gprof example profile output (C code)

int f(x) { y(x);  x+=1;  y(x); }  // f is fast, but calls y twice
int y(z) {                        // y is slow
  for ( ; z>=0 ; z-- ) {
} }

int main() {                      // main has a loop, calls f and y
  int i;
  for (i=0; i<10000; i++) {
} }


  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
101.15      0.62     0.62    30000    20.57    20.57  y
  0.00      0.62     0.00    10000     0.00    41.13  f
% time
Self explanatory, fraction of time in this function.
self seconds
Seconds spent in this functions code.
total seconds
Seconds spent in a function and functions called by this function.
As we can see, this is pretty similar to the output from the pstats command line browser. You can also use gprof2dot on gmon.out, as well as lots of other tools.

Stochastic vs deterministic profiling

Everything in this talk uses deterministic profiling, and probably it is the main thing you will use. However, you should know that there is a wide variety of techniques behind profiling, including some serious tools for dynamic program analysis. If you ever have a program with mainly small, fast function calls, consider stochastic profiling.

Profiling from the Python shell (and ipython)

To profile something from the Python shell, or only one function within a program:

import cProfile
cProfile.run('func()', 'filename.out')

IPython has a shortcut for running this. I would usually save it to another file and visualize with gprof2dot.py.

%prun [-s sort-key] [-D filename.out] [statement]
These tools can make and print the text-based profile all in one go. Perhaps that is useful sometimes from the command line for quick things. For big things, I'd generally prefer to make and save to a file for further analysis.

Other profiling tools

oprofile is a neat kernel-based profiler. It can profile everything on your system, and make line-based profiles. (Example line profile and summary report)

Memory profiling is tricky in Python. Since objects have shared ownership, you can't tie them to specific locations in code so easily. I have rarely needed to use memory profiling in Python.

How to use your profile: Actually optimizing your code




Line profiling in Python

Memory profiling in Python