Scientists "know" that faster programs is better. But what does this mean?
First, most of the time you will spend more time writing programs than running them. All the previous talks optimize writing, debugging, collaborating, and maintaining programs and is the most important thing you can do.
But as for optimizing program speed, there is one saying you see everywhere: Premature optimization is the root of all evil. This means, first you want to make a good program, then profile it to see where the slow parts are, then optimize just those parts. This talk is about profiling: the process of systematically examining program execution times to learn the bottlenecks
There is a famous saying:
Premature optimization is the root of all evil
What this saying means is that you should not try to optimize your program before writing it. Programs are very complex interactions between processor architectures, memory, multiple layers of caches, compilers, and the instructions themselves. If you try to guess where the slow part is, you will probably get it wrong and end up wasting your time.
Wikipedia: Profiling is dynamic program analysis of time, space, or other resource usage.
What you can't see, you can't improve
Example output (click link):
- I find one slow branch that takes most time.
This is a directed graph showing the flow of time through the program. At the top is the entry point, and looking down you see the flow of time (in fraction of total time) distributed among all functions. Colors serve as a visual aid to see where the hot spots are. Arrows point from callers to callees, and include number of times called.
This is not "a profile". This is one representation of a profile.
Remember, the whole point of this is that you should write good programs first, and then profile. Of course, sometimes you will profile during development, but don't go crazy sacrificing readability for optimizations. Chances are that will be refactored out later anyway.
In the first step, we run the program using the cProfile module. This just stores up the profile information, and we will examine it in the next step.
Change this:
1 | $ python SCRIPT.py arg1 arg2 ....
|
To this:
1 | $ python -m cProfile -o profile.out SCRIPT.py arg1 arg2 ....
|
Explanation:
The next step is to visualize or analyze the data in profile.out.
I personally prefer first running and storing the profile in profile.out, and then visualizing, to be better. If the call graph is not useful enough, I can visualize it again using different options or examine it using the command line for more details. Also, if I make a change, I can compare the new and old profiles to see how it affected things. This is important!
python -m MODULE is the same as "python /path/to/the/MODULE.py". It is a common shortcut.
This step works on any platform.
Sample output: profile-growsf.out
gprof2dot.py is a call graph visualizer. It is my main tool for profile visualization.
It takes the profile.out and converts it to a call graph in the graphviz language, which can then be visualized or analyzed in different ways. You then use graphviz (dot) to make a picture.
This command runs everything and displays the output all at once.
1 | $ python gprof2dot.py -f pstats profile.out | dot -Tpng | display
|
This creates and displays the PNG all in one go.
Download gprof2dot.py (it's a single script file) and get help here: https://code.google.com/p/jrfonseca/wiki/Gprof2Dot
Example output (click link):
This assumes that you are on Linux, or some operating system with pipes, and graphviz installed. There are graphviz tools for other platforms.
If you want to save the image, change |display to > profile.png.
There are also tools for line-based, instead of function-based, profiling. However, due to the overheads of Python it's not common there.
I don't have magic suggestions on how to improve things. After seeing enough profiles, and a future optimization talk, you will gain intuition on how to do things. Most importantly, by examining profiles before and after your changes, you will be in a position to know what works and what doesn't.
Here, you see two call graphs: before and after optimization.
Profile sources: before and after.
Profile source: profile-external.out
On the left of this figure, we see various external community detection methods running using the subprocess module.
You can examine the raw profile.out data using the command line, without making a picture.
1 2 3 4 5 | $ python -m pstats profile.out
% strip # make output names shorter
% sort time # Sort by time
% stats 15 # Print top 15 lines
|
Output:
ncalls tottime percall cumtime percall filename:lineno(function) 1000020 119.340 0.000 134.635 0.000 cluster.py:59(_triangles_and_degree_iter) 21 53.178 2.532 53.178 2.532 {time.sleep} 381 18.685 0.049 18.685 0.049 {cPickle.loads} 20 9.450 0.473 13.629 0.681 cmtyembed.py:67(nembed_m) 10999400 7.203 0.000 7.203 0.000 graph.py:294(__getitem__)
Available commands:
Let's study the difference in total time and cumulative time by looking at profile-external.out.
Sorting by total time (sort time), we see that internal subprocess calls take up most of the time, and this is expected. I scan down the list of functions until I get to the first function that I actually wrote, and it is run_louvain. I see that this takes up only 9 seconds out of 803 total seconds. So I consider this program to be written well enough, since I can't change subprocess (perhaps I could call the program in a way that doesn't read in data, but I don't want to do that now.)
ncalls tottime percall cumtime percall filename:lineno(function) 6565 803.707 0.122 803.707 0.122 {posix.waitpid} 6565 61.486 0.009 61.486 0.009 {posix.read} 6565 40.397 0.006 40.397 0.006 {posix.fork} 19726 31.600 0.002 31.600 0.002 {open} 36865 9.419 0.000 9.419 0.000 {posix.access} 5050 9.411 0.002 134.704 0.027 algorithms.py:1136(run_louvain) 13022 6.612 0.001 6.612 0.001 {posix.remove} 2534 5.746 0.002 5.746 0.002 {method 'close' of 'file' objects} 5 5.569 1.114 1050.805 210.161 p1-data.py:21(run) 2987405 5.558 0.000 5.558 0.000 misc.py:31(is_string_like) 5050 4.332 0.001 14.017 0.003 algorithms.py:1183(read_cmtys_and_return) 18622 3.723 0.000 3.723 0.000 {posix.lstat} 1515 3.401 0.002 991.595 0.655 algorithms.py:134(__init__) 2820992/1515 2.968 0.000 7.946 0.005 copy.py:145(deepcopy) 1895261 2.957 0.000 6.497 0.000 pajek.py:215(make_qstr)Now, let's sort it by cumulative time (sort cumul). You see that the <module> takes up all the time, as you expect.
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.184 0.184 1051.619 1051.619 p1-data.py:1(<module>) 5 5.569 1.114 1050.805 210.161 p1-data.py:21(run) 1515 3.401 0.002 991.595 0.655 algorithms.py:134(__init__) 6565 1.226 0.000 927.022 0.141 algorithms.py:327(call_process) 6565 0.279 0.000 908.277 0.138 subprocess.py:485(call) 13130 0.119 0.000 865.313 0.066 subprocess.py:475(_eintr_retry_call) 6565 0.126 0.000 803.927 0.122 subprocess.py:1286(wait) 6565 803.707 0.122 803.707 0.122 {posix.waitpid} 505 0.985 0.002 771.701 1.528 algorithms.py:765(run) 505 1.059 0.002 149.446 0.296 algorithms.py:1118(run) 5050 9.411 0.002 134.704 0.027 algorithms.py:1136(run_louvain) 6565 0.230 0.000 103.976 0.016 subprocess.py:619(__init__) 6565 1.258 0.000 103.684 0.016 subprocess.py:1111(_execute_child) 6565 61.486 0.009 61.486 0.009 {posix.read} 6565 40.397 0.006 40.397 0.006 {posix.fork}I just know what files most of these functions are in. If you see a file name that you don't recognize, like misc.py, restart the profile browser and don't run strip and you will see full file names.
References:
- https://docs.python.org/2/library/profile.html
- Better tutorial: ???
Any decent language will have profiling facilities. For C:
C code must be compiled with profile support, and then it automatically appears when you run it, unlike Python where you run it differently to invoke the profiling hooks.
C (and other compatible compiled codes), in general, has a whole lot more instrumentation capabilities.
1 2 3 4 5 6 7 8 9 10 11 12 | int f(x) { y(x); x+=1; y(x); } // f is fast, but calls y twice
int y(z) { // y is slow
for ( ; z>=0 ; z-- ) {
0;
} }
int main() { // main has a loop, calls f and y
int i;
for (i=0; i<10000; i++) {
f(i);
y(i);
} }
|
Output:
% cumulative self self total time seconds seconds calls us/call us/call name 101.15 0.62 0.62 30000 20.57 20.57 y 0.00 0.62 0.00 10000 0.00 41.13 f
As we can see, this is pretty similar to the output from the pstats command line browser. You can also use gprof2dot on gmon.out, as well as lots of other tools.
Everything in this talk uses deterministic profiling, and probably it is the main thing you will use. However, you should know that there is a wide variety of techniques behind profiling, including some serious tools for dynamic program analysis. If you ever have a program with mainly small, fast function calls, consider stochastic profiling.
To profile something from the Python shell, or only one function within a program:
1 2 | import cProfile
cProfile.run('func()', 'filename.out')
|
IPython has a shortcut for running this. I would usually save it to another file and visualize with gprof2dot.py.
1 | %prun [-s sort-key] [-D filename.out] [statement]
|
These tools can make and print the text-based profile all in one go. Perhaps that is useful sometimes from the command line for quick things. For big things, I'd generally prefer to make and save to a file for further analysis.
oprofile is a neat kernel-based profiler. It can profile everything on your system, and make line-based profiles. (Example line profile and summary report)
Memory profiling is tricky in Python. Since objects have shared ownership, you can't tie them to specific locations in code so easily. I have rarely needed to use memory profiling in Python.
Reference: http://se.mathworks.com/help/matlab/ref/profile.html
Tutorial: http://se.mathworks.com/help/matlab/matlab_prog/profiling-for-improving-performance.html
Example:
profile on # Code to be profiled here profile viewer # stop profiler, view it p = profile('info'); profsave(p,'profile.html')
There is a package line_profiler: https://github.com/rkern/line_profiler
There is a lot of overhead, so you must specify which functions to profile!
Run program with kernprof.py
1 | $ kernprof -l program.py
|
Decorate functions to profile
1 2 3 4 | @profile
def bottleneck_function(...):
....
....
|
IPython: %lprun magic command.
RAM usage takes time. Reduce memory usage to improve performance (and scale up).
Heapy:
1 2 3 | from guppy import hpy
h = hpy()
print h.heap()
|
memory_profiler
1 | python -m memory_profiler example.py
|
1 2 3 4 | @profile
def bottleneck_function(...):
....
....
|