Profiling software

Scientists "know" that faster programs is better. But what does this mean?

First, you will usually be spending more time writing programs than running them. All the previous talks optimize writing, debugging, collaborating, and maintaining programs and is the most important thing you can do.

But as for optimizing program speed, there is one saying you see everywhere: Premature optimization is the root of all evil. This means, first you want to make a good program, then profile it to see where the slow parts are, then optimize just those parts. This talk is about profiling: the process of systematically examining program execution times to learn the bottlenecks

Profiling software

Outline

Profiling and optimizing

There is a famous saying:

Premature optimization is the root of all evil - Donald Knuth, 1974
What this saying means is that you should not try to optimize your program before writing it. Programs are very complex interactions between processor architectures, memory, multiple layers of caches, compilers, and the instructions themselves. If you try to guess where the slow part is, you will probably get it wrong and end up wasting your time. Worse yet, you will end up having unreadable and unmaintainable optimized code, and end up working slower in the future.

Profiling

Wikipedia: Profiling is dynamic program analysis of time, space, or other resource usage.

What you can't see, you can't improve. Just like in science.

Level 0 profiling: timing code

Some nomenclature

Example from my work

Example of gprof2dot profile.  Click for full image.

We find one slow branch that takes most time. To optimize, we focus only on that branch.

This is a directed graph showing the flow of time through the program. At the top is the entry point, and looking down you see the flow of time (in fraction of total time) distributed among all functions. Colors serve as a visual aid to see where the hot spots are. Arrows point from callers to callees, and include number of times called.

This is not "a profile". This is one representation of a profile.

Example: profile before and after optimization

Here, you see two call graphs: before and after optimization.

Call graph before optimizing Call graph after optimizing
Profile sources: before and after.

Example: calling external processes

Call graph after optimizing

Profile source: profile-external.out

On the left of this figure, we see various external community detection methods running using the subprocess module.

Example: using a library for work

Call graph after optimizing

How to collect the profiling data

Remember, the whole point of this is that you should write good programs first, and then profile. Of course, sometimes you will profile during development, but don't go crazy sacrificing readability for optimizations. Chances are that will be refactored out later anyway.

Collecting profiling information: C

To do the most basic C profiling, we need to compile with special options in order to instrument the code.

Collecting profiling information: Python

In the first step, we run the program using the cProfile module. This just stores up the profile information, and we will examine it in the next step.

The next step is to visualize or analyze the data in profile.out.

I personally prefer first running and storing the profile in profile.out, and then visualizing, to be better. If the call graph is not useful enough, I can visualize it again using different options or examine it using the command line for more details. Also, if I make a change, I can compare the new and old profiles to see how it affected things. This is important!

python -m MODULE is the same as "python /path/to/the/MODULE.py". It is a common shortcut.

This step works on any platform.

Sample output: profile-growsf.out

Viewing profile information: Python

What can be profiled?

Exercises: Hands-on profiling

Exercise Profiling-1.1: Running code

  1. First, we just need to run some code and collect profiling information. Then we will process it to get information. The first examples will be console based, not graphical like I showed you above. A console-based profile shows the same information and is good for computers like Triton.

    The examples are in the /triton/scip/profile directory.

  2. For C, you need to compile your code with an option to enable profile generation. This adds an extra overhead, so is not enabled by default. The flag is -pg for gcc.

    1
    $ gcc -g -pg c-profiling.c
    
  3. Run your code. For C, just run it like normal and you will get a gmon.out file that has the profiling data.

    1
    $ ./a.out
    

    For python, you don't need to recompile, but you need to tell a profile to trace everything. We do this by running with the cProfile module and saving the output to profile.out.

    1
    $ python -m cProfile -o profile.out pi.py
    
  4. You should now either have a gmon.out or profile.out file in this directory. This contains all your profiling information. Next we just have to look inside.

Exercise Profiling-1.2: Getting profile information

  1. We look at the profiling information in two ways. For C, you use gprof to do this and for Python you use the pstats module.

  2. C: We use the gprof utility to study the output. Run it like this:

    1
    $ gprof a.out gmon.out
    

    It directly prints out a summary report. You may need to scroll up some in order to see the important timing parts

  3. Python: There is a interactive profile viewer in the pstats module. Open it using -m pstats, and then execute the commands strip, sort cumulative, and stats 15 to print the data. You can use different

    1
    2
    3
    4
    $ python -m pstats profile.out
    strip
    sort cumulative
    stats 15
    

Exercise Profiling-1.3: Examining the output

  1. Now that we have the profiling information, let's examine what is in it.
  2. Which function takes the most time?
  3. Which function, when also considering sub-functions, take the most time?

Exercise Profiling-1.4: Bonus: More fancy visualization

  1. You can make nice call graphs like you saw earlier. This is done via the gprof2dot.py script, which is already in the scip/profiling/ folder. This is a pretty visualization, and works on headless machines where you don't have a graphical environment. You have to copy (or sshfs) it to your own computer to view it. Usually, I will first look at the text-based profiles, and then to get a big overview make the call graph pictures a few times.

    Python:

    1
    $ python gprof2dot.py -f pstats profile.out | dot -Tpng > pi-prof-1.png
    

    C:

    1
    gprof a.out gmon.out | python gprof2dot.py | dot -Tpng > profile.png
    

What to look for in a profile

There are also tools for line-based, instead of function-based, profiling. However, due to the overheads of Python it's not common there.

I don't have magic suggestions on how to improve things. After seeing enough profiles, and a future optimization talk, you will gain intuition on how to do things. Most importantly, by examining profiles before and after your changes, you will be in a position to know what works and what doesn't.

Optimizing

Exercise Profiling-2.1: Optimization

  1. In this set of exercises, we will go through the optimization of a program. We will look at the profile, see where the slow parts are, improve those parts, and see how the profile changes.

  2. There is a Python code for calculating π using a simple Monte Carlo method at /triton/scip/profile/pi.py. Copy this to your working directory. This code takes one command line argument, the number of iterations. Run the code in the console, just using the time command line utility to see how long it takes.

    1
    2
    3
    4
    5
    6
    7
    $ time python pi.py 1000000
    3.141936
    time elapsed: 1.6  per iteration: 1.6e-06
    
    real    0m1.634s
    user    0m1.579s
    sys     0m0.055s
    

Exercise Profiling-2.2: Profiling

  1. Run this code and generate a profile in the shell. and generate the pi-profile-1.out (note the number 1):

    1
    $ python -m cProfile -o pi-profile-1.out pi.py 1000000
    

    Write down the total time.

  2. Use pstats to try to figure out what functions take the most time:

    1
    2
    3
    4
    $ python -m pstats profile.out
    strip
    sort time
    stats 5
    

    You see that get_coords takes up most of the time. Then sort by cumulative time and see what the difference is:

    1
    2
    sort cumulative
    stats 5
    
  3. Bonus/optional: Generate a gprof2dot picture of this for future reference. Copy gprof2dot from the /triton/scip/profile directory.

    1
    $ python gprof2dot.py -f pstats pi-profile-1.out | dot -Tpng > pi-prof-1.png
    

Exercise Profiling-2.3: Improving based on the profile

  1. We saw that the function get_coords took most of the time. Luckily, I have already programmed a new version of this, called get_coords_numpy. Change the call of get_coords to get_coords_numpy:

    x, y = get_coords(N)    --->    x, y = get_coords_numpy(N)
    
  2. Re-run the profile, saving as pi-profile-2.out. View it in the pstats viewer. You see that get_coords is so fast that it doesn't show up anymore. Now, the function circle_count takes up most of the time.

  3. Bonus: In fact, you notice that the total time of circle_count went up. Why do you think that is?

  4. Bonus: Make another gprof2dot picture of this for future reference.

    1
    $ python gprof2dot.py -f pstats pi-profile-2.out | dot -Tpng > pi-prof-2.png
    

Exercise Profiling-2.4: More improvement

  1. Repeat all of the same things as above, but replace the function circle_count with the function circle_count_numpy since this is the new bottleneck. Remember to save as pi-profile-3.out. How much faster do things get?

  2. Bonus: Save a gprof2dot image again.

    1
    $ python gprof2dot.py -f pstats pi-profile-3.out | dot -Tpng > pi-prof-3.png
    

Exercise Profiling-2.4: What happened?

  1. Now that we are done, we will look at all the gprof2dot pictures we have made. This allows us to quickly compare the different runs.

    For convenience, my own copy of the pictures are here:

  2. What is the slowest in #1? What is the difference between #1 and #2?

  3. What is the slowest function in #2? What is the difference between #2 and #3?

  4. gprof2dot thresholds and only shows nodes with greater than 0.5% of the total time. Why do lots of other functions start appearing in #3? Are these a concern in practice?

Stochastic vs instrumenting profiling

Everything in this talk uses instrumenting profiling, and probably it is the main thing you will use. However, you should know that there is a wide variety of techniques behind profiling, including some serious tools for dynamic program analysis. If you ever have a program with mainly small, fast function calls, consider stochastic profiling.

How to read profiles

How do each of these profile items interact:

Advanced profiling techniques

Conclusions

The end

Resources

Matlab

Line profiling in Python

Memory profiling in Python