Algorithms and data structures

This page uses MathML, if this does not appear as "n squared" then it does not work in your browser: n2

Outline

What are algorithms?

Why are algorithms important?

Why was this? The Python code used better algorithms.

The exact situation was that the C code used a linear search to do a weighted random choice. This meant that the C code saw four times as slow for a network of twice the size. In my code, I used a binary search I already had written, and doubling the network size makes the problem 2*log(2) times slower. That C code doesn't stand a chance.

This also nicely illustrates a benefit of high-level languages. It is easier to use good algorithms.

Another example: graph representations

  • Adjacency matrix: N×N matrix.
  • List of lists: N lists of neighbors.

Big-O notation

How do we express the "speeds" of algorithms?

You've probably seen this before. This is a quick formalization and review.

There is not just O(), but also o(), Θ(), Ω(), and ω(). These all represent different degrees of bounding. For the purposes of this talk, I am ignoring these differences.

I say that we don't care about constant factors. They are important for optimization, but generally you want to first find the best algorithm, and then optimize constant factors. For small problems, a higher complexity algorithm with smaller constant is better.

How am I going to explain this?

Example 1: trivial

1
2
for i in range(N):
    pass

This is O(N).

How to calculate big-O: multiply sizes of all loops and the inner statements.

pass is a single statement (that does nothing), so is O(1), the best possible.

for i in range(N): does the loop N times.

If N doubles, the amount of work we have to do also doubles.

Example 2: nested

1
2
3
for i in range(N):
    for j in range(N):
        pass

This is O(N2).

Hopefully this is clear. The total number of lines that execute here is constant×N2.

Example 3: consider complexity of functions called

Suppose func(L) is O(L):

1
2
3
for i in range(N):
    for j in range(M):
        func(L)

This is O(NML).

You can't forget the time complexity of the functions you call.

Example 4: more called functions

Suppose func(N) is O(N)

1
2
for i in range(N):
    func(N)

This is O(N2).

Why? First and second lines combine to give a time of 1+2++N= N(N-1) 2=O(N2).

How is complexity reduced?

1
2
3
for i in range(len(data)):
    if data[i] == value:
        return i

I'm not getting into formal algorithmic analysis, but here is when more formal theory could be helpful. For each problem, there is some theoretical minimum amount of work that could be done. Some algorithms are less efficient than that.

In the case above, see compare every element in a list to value, but just return one. All of those needless comparisons could be avoided if we could filter down candidates somehow.

Complexity reduction 2

1
2
3
for i in range(len(data)):       # O(N)
    if data[i] != 0:
        f(data[i])

Memory complexity

1
2
range(N)        # allocates O(N) immediately
xrange(N)       # allocates O(1)

Algorithms: summary

I haven't even come close to giving you an understanding of the field of analysis of algorithms. A CS person would be ashamed. However, the main point I am trying to make is think about what you are doing. When you identify something slow, you can

  • See if you can figure out something better.
  • Remember to come back to it.
  • Ask someone for help, or search for better algorithms yourself.
  • When you do come back to algorithms, it will be much easier.

For an example of a clever big data algorithm, see one of my last examples.

Data structures

The practical portion of this talk

Why are data structures important? Python list insertion

1
2
3
lst = range(10000)
lst.append(5)        # time complexity O(1)
list.insert(0, 5)    # time complexity O(N)

If your list is big, you do not want to be doing the second one!

The point of the second half of this talk is to understand the property of data structures, so that you don't accidentally be doing things like the second one, adding unneeded factors of O(N) (or more) to your code!

Example 2: Time complexity of lists and sets

1
2
3
4
5
6
n = 100
L = range(n)
S = set(L)

%timeit n//2 in L
%timeit n//2 in S

Actual time used:

n=1 n=10 n=100 n=1000
list 181ns 289ns 1270ns 11000ns
set 202ns 202ns 203ns 235ns
We see that sets take about the same of time to use the in operator, regardless of size. For lists, it scales with N. Clearly, if we want to analyze big data, we want to be using sets!

Time complexity of typical data structures

Rest of talk: data structures and complexities.

Dynamic heterogeneous array (Python: list)

Data layout: resizable linear array of N elements.

Python tuple

Hash table mapping (Python: dict)

  • Lookups are an O(1) operation!

    1
    2
    def get(x):
        return hash_table[ hash(x) % len(hash_table) ]
    

Basically, all operations here is O(1). dicts trade extra memory for fastest lookups and modification.

Hash tables have been called "one of the most important data structures known". When studying big data algorithms, most somehow use hash tables to turn a O(Nk) operation into O(N). Their properties seem almost magical.

Hash table set (Python: set)

Native arrays (Python: numpy arrays)

Other useful data structures

I'm not saying to write things yourself: use libraries

Other complexities

Transitions in complexity

Algorithms vs optimization

"Big data algorithms"

What next?

Conclusions

Examples