Scientists "know" that faster programs is better. But what does this mean?
First, you will usually be spending more time writing programs than running them. All the previous talks optimize writing, debugging, collaborating, and maintaining programs and is the most important thing you can do.
But as for optimizing program speed, there is one saying you see everywhere: Premature optimization is the root of all evil. This means, first you want to make a good program, then profile it to see where the slow parts are, then optimize just those parts. This talk is about profiling: the process of systematically examining program execution times to learn the bottlenecks
There is a famous saying:
Premature optimization is the root of all evil - Donald Knuth, 1974
What this saying means is that you should not try to optimize your program before writing it. Programs are very complex interactions between processor architectures, memory, multiple layers of caches, compilers, and the instructions themselves. If you try to guess where the slow part is, you will probably get it wrong and end up wasting your time. Worse yet, you will end up having unreadable and unmaintainable optimized code, and end up working slower in the future.
Wikipedia: Profiling is dynamic program analysis of time, space, or other resource usage.
What you can't see, you can't improve. Just like in science.
On Linux, anything can be run using the time shell utility
1 2 3 4 5 | $ time ./research-code -x 1 InFile1
real 16m37.055s
user 6m13.786s
sys 1m25.012s
|
This tells you how much total time a program took (divided into user and system call time)
We have no idea how that time was divided up in the program. Can we do better?
We find one slow branch that takes most time. To optimize, we focus only on that branch.
This is a directed graph showing the flow of time through the program. At the top is the entry point, and looking down you see the flow of time (in fraction of total time) distributed among all functions. Colors serve as a visual aid to see where the hot spots are. Arrows point from callers to callees, and include number of times called.
This is not "a profile". This is one representation of a profile.
Here, you see two call graphs: before and after optimization.
Profile sources: before and after.
Profile source: profile-external.out
On the left of this figure, we see various external community detection methods running using the subprocess module.
Remember, the whole point of this is that you should write good programs first, and then profile. Of course, sometimes you will profile during development, but don't go crazy sacrificing readability for optimizations. Chances are that will be refactored out later anyway.
To do the most basic C profiling, we need to compile with special options in order to instrument the code.
Compile with the -g -pg options.
Run code normally: ./a.out
A file gmon.out is created with profiling information
Examine timings with gprof
1 | $ gprof a.out gmon.out
|
In the first step, we run the program using the cProfile module. This just stores up the profile information, and we will examine it in the next step.
Run your script under control of the cProfile module:
1 | $ python -m cProfile -o profile.out SCRIPT.py arg1 arg2 ....
|
python -m cProfile: Run library module cProfile as a script.
-o profile.out: Tells cProfile to write the profile to the file profile.out.
SCRIPT.py arg1 arg2 ...: Your normal python interpreter arguments.
The output profile.out contains details of all function calls and times.
The next step is to visualize or analyze the data in profile.out.
I personally prefer first running and storing the profile in profile.out, and then visualizing, to be better. If the call graph is not useful enough, I can visualize it again using different options or examine it using the command line for more details. Also, if I make a change, I can compare the new and old profiles to see how it affected things. This is important!
python -m MODULE is the same as "python /path/to/the/MODULE.py". It is a common shortcut.
This step works on any platform.
Sample output: profile-growsf.out
There is a simple interactive browser for the profile information
1 | $ python -m pstats profile.out
|
Key commands to run:
First, we just need to run some code and collect profiling information. Then we will process it to get information. The first examples will be console based, not graphical like I showed you above. A console-based profile shows the same information and is good for computers like Triton.
The examples are in the /triton/scip/profile directory.
For C, you need to compile your code with an option to enable profile generation. This adds an extra overhead, so is not enabled by default. The flag is -pg for gcc.
1 | $ gcc -g -pg c-profiling.c
|
Run your code. For C, just run it like normal and you will get a gmon.out file that has the profiling data.
1 | $ ./a.out
|
For python, you don't need to recompile, but you need to tell a profile to trace everything. We do this by running with the cProfile module and saving the output to profile.out.
1 | $ python -m cProfile -o profile.out pi.py
|
You should now either have a gmon.out or profile.out file in this directory. This contains all your profiling information. Next we just have to look inside.
We look at the profiling information in two ways. For C, you use gprof to do this and for Python you use the pstats module.
C: We use the gprof utility to study the output. Run it like this:
1 | $ gprof a.out gmon.out
|
It directly prints out a summary report. You may need to scroll up some in order to see the important timing parts
Python: There is a interactive profile viewer in the pstats module. Open it using -m pstats, and then execute the commands strip, sort cumulative, and stats 15 to print the data. You can use different
1 2 3 4 | $ python -m pstats profile.out
strip
sort cumulative
stats 15
|
You can make nice call graphs like you saw earlier. This is done via the gprof2dot.py script, which is already in the scip/profiling/ folder. This is a pretty visualization, and works on headless machines where you don't have a graphical environment. You have to copy (or sshfs) it to your own computer to view it. Usually, I will first look at the text-based profiles, and then to get a big overview make the call graph pictures a few times.
Python:
1 | $ python gprof2dot.py -f pstats profile.out | dot -Tpng > pi-prof-1.png
|
C:
1 | gprof a.out gmon.out | python gprof2dot.py | dot -Tpng > profile.png
|
There are also tools for line-based, instead of function-based, profiling. However, due to the overheads of Python it's not common there.
I don't have magic suggestions on how to improve things. After seeing enough profiles, and a future optimization talk, you will gain intuition on how to do things. Most importantly, by examining profiles before and after your changes, you will be in a position to know what works and what doesn't.
In this set of exercises, we will go through the optimization of a program. We will look at the profile, see where the slow parts are, improve those parts, and see how the profile changes.
There is a Python code for calculating π using a simple Monte Carlo method at /triton/scip/profile/pi.py. Copy this to your working directory. This code takes one command line argument, the number of iterations. Run the code in the console, just using the time command line utility to see how long it takes.
1 2 3 4 5 6 7 | $ time python pi.py 1000000
3.141936
time elapsed: 1.6 per iteration: 1.6e-06
real 0m1.634s
user 0m1.579s
sys 0m0.055s
|
Run this code and generate a profile in the shell. and generate the pi-profile-1.out (note the number 1):
1 | $ python -m cProfile -o pi-profile-1.out pi.py 1000000
|
Write down the total time.
Use pstats to try to figure out what functions take the most time:
1 2 3 4 | $ python -m pstats profile.out
strip
sort time
stats 5
|
You see that get_coords takes up most of the time. Then sort by cumulative time and see what the difference is:
1 2 | sort cumulative
stats 5
|
Bonus/optional: Generate a gprof2dot picture of this for future reference. Copy gprof2dot from the /triton/scip/profile directory.
1 | $ python gprof2dot.py -f pstats pi-profile-1.out | dot -Tpng > pi-prof-1.png
|
We saw that the function get_coords took most of the time. Luckily, I have already programmed a new version of this, called get_coords_numpy. Change the call of get_coords to get_coords_numpy:
x, y = get_coords(N) ---> x, y = get_coords_numpy(N)
Re-run the profile, saving as pi-profile-2.out. View it in the pstats viewer. You see that get_coords is so fast that it doesn't show up anymore. Now, the function circle_count takes up most of the time.
Bonus: In fact, you notice that the total time of circle_count went up. Why do you think that is?
Bonus: Make another gprof2dot picture of this for future reference.
1 | $ python gprof2dot.py -f pstats pi-profile-2.out | dot -Tpng > pi-prof-2.png
|
Repeat all of the same things as above, but replace the function circle_count with the function circle_count_numpy since this is the new bottleneck. Remember to save as pi-profile-3.out. How much faster do things get?
Bonus: Save a gprof2dot image again.
1 | $ python gprof2dot.py -f pstats pi-profile-3.out | dot -Tpng > pi-prof-3.png
|
Now that we are done, we will look at all the gprof2dot pictures we have made. This allows us to quickly compare the different runs.
For convenience, my own copy of the pictures are here:
What is the slowest in #1? What is the difference between #1 and #2?
What is the slowest function in #2? What is the difference between #2 and #3?
gprof2dot thresholds and only shows nodes with greater than 0.5% of the total time. Why do lots of other functions start appearing in #3? Are these a concern in practice?
Everything in this talk uses instrumenting profiling, and probably it is the main thing you will use. However, you should know that there is a wide variety of techniques behind profiling, including some serious tools for dynamic program analysis. If you ever have a program with mainly small, fast function calls, consider stochastic profiling.
How do each of these profile items interact:
Reference: http://se.mathworks.com/help/matlab/ref/profile.html
Tutorial: http://se.mathworks.com/help/matlab/matlab_prog/profiling-for-improving-performance.html
Example:
profile on # Code to be profiled here profile viewer # stop profiler, view it p = profile('info'); profsave(p,'profile.html')
There is a package line_profiler: https://github.com/rkern/line_profiler
There is a lot of overhead, so you must specify which functions to profile!
Run program with kernprof.py
1 | $ kernprof -l program.py
|
Decorate functions to profile
1 2 3 4 | @profile
def bottleneck_function(...):
....
....
|
IPython: %lprun magic command.
RAM usage takes time. Reduce memory usage to improve performance (and scale up).
Heapy:
1 2 3 | from guppy import hpy
h = hpy()
print h.heap()
|
memory_profiler
1 | python -m memory_profiler example.py
|
1 2 3 4 | @profile
def bottleneck_function(...):
....
....
|