# Algorithms and data structures

This page uses MathML, if this does not appear as "n squared" then it does not work in your browser: ${n}^{2}$

# Outline

• Algorithms: What and why?
• Basic meaning of O() notation and why it is important.
• Properties of different data structures

# What are algorithms?

• Algorithm: A series of steps to complete some process.
• There can be different algorithms that produce the same result.
• All computational research involves calculating some results via algorithms.
• You could argue that this is the core Computer Science topic, and I can not do it justice.
• This is a brief intro for scientific programmers.

# Why are algorithms important?

• Once I was given a program in C, a "fast" language.
• I re-wrote it in Python, a "slow" language.
• My Python code was faster, for large networks.

The exact situation was that the C code used a linear search to do a weighted random choice. This meant that the C code saw four times as slow for a network of twice the size. In my code, I used a binary search I already had written, and doubling the network size makes the problem $2*log\left(2\right)$ times slower. That C code doesn't stand a chance.

This also nicely illustrates a benefit of high-level languages. It is easier to use good algorithms.

# Another example: graph representations

• Methods of storing a graph with $N$ nodes:
• Adjacency matrix: $N×N$ matrix.
• List of lists: $N$ lists of neighbors.
• How long does it take to compute the number of edges in the graph?
• Matrix: have to look at ${N}^{2}$ elements.
• List of lists: have to take len() of $N$ lists, which takes time $c\phantom{\rule{0ex}{0ex}}timesN$.
• The lists of lists is clearly much faster, for this problem.

# Big-O notation

How do we express the "speeds" of algorithms?

• Big-O notation: used to classify algorithms by how they respond (processing time or memory requirements) to changes in input size.
• "Time $O\left({N}^{2}\right)$" means "time to run is proportional to ${N}^{2}$".
• Double problem size, four times as long to run.
• $N$ can be different parameters, e.g. array size, number of records, number of atoms.
• Different variables can be combined: $O\left(N×m\right)$
• Important since scientists tend to want to process bigger data (data size: $N$).
• Algorithmic analysis does not care about constant factors.

You've probably seen this before. This is a quick formalization and review.

There is not just $O\left(\right)$, but also $o\left(\right)$, $\Theta \left(\right)$, $\Omega \left(\right)$, and $\omega \left(\right)$. These all represent different degrees of bounding. For the purposes of this talk, I am ignoring these differences.

I say that we don't care about constant factors. They are important for optimization, but generally you want to first find the best algorithm, and then optimize constant factors. For small problems, a higher complexity algorithm with smaller constant is better.

# How am I going to explain this?

• Simple examples of short code lines.
• Not a formal analysis.
• Ignoring questions of when algorithms terminate.
• Again, this is not a complete picture.

# Example 1: trivial

 ```1 2``` ```for i in range(N): pass ```

This is $O\left(N\right)$.

How to calculate big-O: multiply sizes of all loops and the inner statements.

pass is a single statement (that does nothing), so is O(1), the best possible.

for i in range(N): does the loop N times.

If $N$ doubles, the amount of work we have to do also doubles.

# Example 2: nested

 ```1 2 3``` ```for i in range(N): for j in range(N): pass ```

This is $O\left({N}^{2}\right)$.

Hopefully this is clear. The total number of lines that execute here is $constant×{N}^{2}$.

# Example 3: consider complexity of functions called

Suppose $func\left(L\right)$ is $O\left(L\right)$:

 ```1 2 3``` ```for i in range(N): for j in range(M): func(L) ```

This is $O\left(NML\right)$.

You can't forget the time complexity of the functions you call.

# Example 4: more called functions

Suppose func(N) is $O\left(N\right)$

 ```1 2``` ```for i in range(N): func(N) ```

This is $O\left({N}^{2}\right)$.

Why? First and second lines combine to give a time of $1+2+\cdots +N=\frac{N\left(N-1\right)}{2}=O\left({N}^{2}\right)$.

# How is complexity reduced?

• Figuring $O\left(\right)$ is good, but what's the difference between a good and bad algorithm?
 ```1 2 3``` ```for i in range(len(data)): if data[i] == value: return i ```
• This is $O\left(\text{len}\left(\text{data}\right)\right)$
• But most lines do nothing! Ideally we could short-circuit and return the right index directly!

I'm not getting into formal algorithmic analysis, but here is when more formal theory could be helpful. For each problem, there is some theoretical minimum amount of work that could be done. Some algorithms are less efficient than that.

In the case above, see compare every element in a list to value, but just return one. All of those needless comparisons could be avoided if we could filter down candidates somehow.

# Complexity reduction 2

 ```1 2 3``` ```for i in range(len(data)): # O(N) if data[i] != 0: f(data[i]) ```
• In this case, we do $O\left(\text{len}\left(\text{data}\right)\right)$ operations, but the important part could be called much less often if data is sparse.

 ```1 2``` ```for dat in nonzero_data: # O(actual_data) f(dat) ```
• In this case, one should keep track of important elements separately, if data will be mostly zeros.

# Memory complexity

• Memory complexity judges the amount of extra space needed for an algorithm.
• Example: Graph adjacency matrix is $O\left({N}^{2}\right)$.
 ```1 2``` ```range(N) # allocates O(N) immediately xrange(N) # allocates O(1) ```

# Algorithms: summary

• Think about time and memory complexity when you write things.
• I haven't taught you how to write algorithms.
• Just know where to look for slow algorithms. You can come back to them if needed, and maybe ask someone for ideas.
• In practice, do your best to make things $O\left(\text{size of data}\right)$
• Recursion can greatly increase complexity.
• "Big data" has extremely clever algorithms for complexity reduction.
• They can make anything $O\left(N\right)$!

I haven't even come close to giving you an understanding of the field of analysis of algorithms. A CS person would be ashamed. However, the main point I am trying to make is think about what you are doing. When you identify something slow, you can

• See if you can figure out something better.
• Remember to come back to it.
• Ask someone for help, or search for better algorithms yourself.
• When you do come back to algorithms, it will be much easier.

For an example of a clever big data algorithm, see one of my last examples.

# Data structures

The practical portion of this talk

• Data structures are specific arrangements of data in memory.
• Arrangements allow low-complexity operations on the data.
• Key point: Using data structures properly is the most important way to have fast code.
• They package optimal algorithms for you so that you don't have to know about them.
• Memory/time tradeoff: Using more memory often can mean faster.

# Why are data structures important? Python list insertion

 ```1 2 3``` ```lst = range(10000) lst.append(5) # time complexity O(1) list.insert(0, 5) # time complexity O(N) ```

If your list is big, you do not want to be doing the second one!

The point of the second half of this talk is to understand the property of data structures, so that you don't accidentally be doing things like the second one, adding unneeded factors of $O\left(N\right)$ (or more) to your code!

# Example 2: Time complexity of lists and sets

• Let us do a quick example of lists and sets: Time to find an "average" element.
 ```1 2 3 4 5 6``` ```n = 100 L = range(n) S = set(L) %timeit n//2 in L %timeit n//2 in S ```

Actual time used:

n=1 n=10 n=100 n=1000
list 181ns 289ns 1270ns 11000ns
set 202ns 202ns 203ns 235ns
We see that sets take about the same of time to use the in operator, regardless of size. For lists, it scales with $N$. Clearly, if we want to analyze big data, we want to be using sets!

# Time complexity of typical data structures

Rest of talk: data structures and complexities.

# Dynamic heterogeneous array (Python: list)

Data layout: resizable linear array of $N$ elements.

• Front operations: .append(...) and del[-1]: $O\left(1\right)$
• Back operations: .insert(0, ...) and del[0]: $O\left(N\right)$
• Indexing: lst[...]: $O\left(1\right)$

# Python tuple

• Same as list, but is immutable.
• More memory efficient, especially if creating/destroying often.

# Hash table mapping (Python: dict)

• Underlying data structure: hash table
• Lookups are an $O\left(1\right)$ operation!

 ```1 2``` ```def get(x): return hash_table[ hash(x) % len(hash_table) ] ```
• Insertions: d[k] = v $O\left(1\right)$
• Deletions: del d[k] $O\left(1\right)$
• Lookups: d[k] $O\left(1\right)$
• Contains:: k in d $O\left(1\right)$
• Size:: len(d) $O\left(1\right)$
• There is no ordering.
• Greater memory use than lists (but still $O\left(N\right)$)

Basically, all operations here is $O\left(1\right)$. dicts trade extra memory for fastest lookups and modification.

Hash tables have been called "one of the most important data structures known". When studying big data algorithms, most somehow use hash tables to turn a $O\left({N}^{k}\right)$ operation into $O\left(N\right)$. Their properties seem almost magical.

# Hash table set (Python: set)

• Same storage as dictionary, but no values.
• Insert/delete/contains also $O\left(1\right)$.
• Optimal intersection and union operations: $O\left(\text{min}\left(N,M\right)\right)$ and $O\left(N+M\right)$

# Native arrays (Python: numpy arrays)

• Linear array of values of the same type.
• Inserting at beginning is $O\left(N\right)$.
• Resizing is $O\left(1\right)$ but not recommended.
• Fast vector operations: +, *, numpy.add, etc.
• $O\left(N\right)$ which is optimal.

# Other useful data structures

• Linked list - fast d.appendleft() and d.popleft(). Can't index middle. - Python: collections.deque
• Heap - list which is cleverly sorted. - Python: collections.hepaq
• Trie and DAWG

I'm not saying to write things yourself: use libraries

# Other complexities

• Worst case performance
• Best case performance
• Amortized worst case performance

# Transitions in complexity

• Example: my Python code from the start
• My code was fast in most cases, but when $\beta$ became large it got slow. Same $N$.
• You could leave problem size the same, but vary other parameters...
• ... and everything slows down greatly.
• In your parameter space, you transitioned to a different complexity.

# Algorithms vs optimization

• Algorithmic optimization provides world-changing improvements.
• Once you have the best algorithm, tricks to speed it up are optimization.
• This is language and domain specific, not covered here.
• Most important part is good algorithms and clean code.
• My personal philosophy is "use the best possible algorithm, then optimization usually isn't needed". :math:`O(N)` is fast enough.

# "Big data algorithms"

• Book: "Mining of massive datasets", http://mmds.org/
• Use hash tables to transform $O\left({N}^{x}\right)$ operations to $O\left(N\right)$ operations
• Example: locality-sensitive hashing and comparing books
• Finding similar books appears to be an $O\left({N}^{2}\right)$ operation (you are comparing every pair)
• Hash functions which tend to put similar documents in same bin.
• Combine them to magnify effect, reduce number of pairs to check.

# What next?

• There is a standard CS course "Algorithms and data structures"
• Probably too abstract for most general scientific programmers.
• "Hard" problems

# Conclusions

• Algorithms are the key to making good programs
• Consider complexity of functions you call and write
• Try to use something optimal, and search for better option if it seems bad.