This page uses MathML, if this does not appear as "n squared" then it does not work in your browser: ${n}^{2}$

- Algorithms: What and why?
- Basic meaning of
`O()`notation and why it is important. - Properties of different data structures

**Algorithm**: A series of steps to complete some process.- There can be different algorithms that produce the same result.
- All computational research involves calculating some results via algorithms.
- You could argue that this is
**the**core Computer Science topic, and I can**not**do it justice.- This is a brief intro for scientific programmers.

- Once I was given a program in C, a "fast" language.
- I re-wrote it in Python, a "slow" language.
- My Python code was faster, for large networks.

Why was this? The Python code used better algorithms.

The exact situation was that the C code used a linear search to do a weighted random choice. This meant that the C code saw four times as slow for a network of twice the size. In my code, I used a binary search I already had written, and doubling the network size makes the problem $2*log\left(2\right)$ times slower. That C code doesn't stand a chance.

This also nicely illustrates a benefit of high-level languages. It is easier to use good algorithms.

- Methods of storing a graph with $N$ nodes:

- Adjacency matrix: $N\times N$ matrix.
- List of lists: $N$ lists of neighbors.

- How long does it take to compute the number of edges in the graph?
- Matrix: have to look at ${N}^{2}$ elements.
- List of lists: have to take
`len()`of $N$ lists, which takes time $c\phantom{\rule{0ex}{0ex}}timesN$.

- The lists of lists is clearly much faster,
*for this problem*.

How do we express the "speeds" of algorithms?

**Big-O notation**: used to classify algorithms by how they respond (processing time or memory requirements) to changes in input size.- "Time $O\left({N}^{2}\right)$" means "time to run is proportional to ${N}^{2}$".
- Double problem size, four times as long to run.
- $N$ can be different parameters, e.g. array size, number of records, number of atoms.
- Different variables can be combined: $O(N\times m)$

- Important since scientists tend to want to process bigger data (data size: $N$).
- Algorithmic analysis does not care about constant factors.

You've probably seen this before. This is a quick formalization and review.

There is not just $O\left(\right)$, but also $o\left(\right)$, $\Theta \left(\right)$, $\Omega \left(\right)$, and $\omega \left(\right)$. These all represent different degrees of bounding. For the purposes of this talk, I am ignoring these differences.

I say that we don't care about constant factors. They

areimportant for optimization, but generally you want to first find the best algorithm, and then optimize constant factors. For small problems, a higher complexity algorithm with smaller constant is better.

- Simple examples of short code lines.
*Not*a formal analysis.- Ignoring questions of when algorithms terminate.
- Again, this is
*not*a complete picture.

1 2 | ```
for i in range(N):
pass
``` |

This is $O\left(N\right)$.

How to calculate big-O: multiply sizes of all loops and the inner statements.

passis a single statement (that does nothing), so is O(1), the best possible.

for i in range(N):does the loopNtimes.If $N$ doubles, the amount of work we have to do also doubles.

1 2 3 | ```
for i in range(N):
for j in range(N):
pass
``` |

This is $O\left({N}^{2}\right)$.

Hopefully this is clear. The total number of lines that execute here is $constant\times {N}^{2}$.

Suppose $func\left(L\right)$ is $O\left(L\right)$:

1 2 3 | ```
for i in range(N):
for j in range(M):
func(L)
``` |

This is $O\left(NML\right)$.

You can't forget the time complexity of the functions you call.

Suppose `func(N)` is $O\left(N\right)$

1 2 | ```
for i in range(N):
func(N)
``` |

This is $O\left({N}^{2}\right)$.

Why? First and second lines combine to give a time of $1+2+\cdots +N=\frac{N(N-1)}{2}=O\left({N}^{2}\right)$.

- Figuring $O\left(\right)$ is good, but what's the difference between a good and bad algorithm?

1 2 3 | ```
for i in range(len(data)):
if data[i] == value:
return i
``` |

- This is $O\left(\text{len}\right(\text{data}\left)\right)$
- But most lines do nothing! Ideally we could short-circuit and return the right index directly!

I'm not getting into formal algorithmic analysis, but here is when more formal theory could be helpful. For each problem, there is some theoretical minimum amount of work that could be done. Some algorithms are less efficient than that.

In the case above, see compare every element in a list to

value, but just return one. All of those needless comparisons could be avoided if we could filter down candidates somehow.

1 2 3 | ```
for i in range(len(data)): # O(N)
if data[i] != 0:
f(data[i])
``` |

In this case, we do $O\left(\text{len}\right(\text{data}\left)\right)$ operations, but the important part could be called much less often if

`data`is sparse.1 2

for dat in nonzero_data: # O(actual_data) f(dat)

In this case, one should keep track of important elements separately, if

`data`will be mostly zeros.

- Memory complexity judges the amount of extra space needed for an algorithm.
- Example: Graph adjacency matrix is $O\left({N}^{2}\right)$.

1 2 | ```
range(N) # allocates O(N) immediately
xrange(N) # allocates O(1)
``` |

*Think*about time and memory complexity when you write things.- I haven't taught you how to write algorithms.
- Just know where to look for slow algorithms. You can come back to them if needed, and maybe ask someone for ideas.

- In practice, do your best to make things $O\left(\text{size of data}\right)$
- Recursion can greatly increase complexity.
- "Big data" has
*extremely*clever algorithms for complexity reduction.- They can make anything $O\left(N\right)$!

I haven't even come close to giving you an understanding of the field of analysis of algorithms. A CS person would be ashamed. However, the main point I am trying to make is

thinkabout what you are doing. When you identify something slow, you can

- See if you can figure out something better.
- Remember to come back to it.
- Ask someone for help, or search for better algorithms yourself.
- When you do come back to algorithms, it will be much easier.
For an example of a clever big data algorithm, see one of my last examples.

The practical portion of this talk

- Data structures are specific arrangements of data in memory.
- Arrangements allow low-complexity operations on the data.
- Key point:
**Using data structures properly is the most important way to have fast code**.- They package optimal algorithms for you so that you don't have to know about them.

- Memory/time tradeoff: Using more memory often can mean faster.

1 2 3 | ```
lst = range(10000)
lst.append(5) # time complexity O(1)
list.insert(0, 5) # time complexity O(N)
``` |

If your list is big, you do **not** want to be doing the second one!

The point of the second half of this talk is to understand the property of data structures, so that you don't accidentally be doing things like the second one, adding unneeded factors of $O\left(N\right)$ (or more) to your code!

- Let us do a quick example of lists and sets: Time to find an "average" element.

1 2 3 4 5 6 | ```
n = 100
L = range(n)
S = set(L)
%timeit n//2 in L
%timeit n//2 in S
``` |

Actual time used:

n=1 | n=10 | n=100 | n=1000 | |
---|---|---|---|---|

list | 181ns | 289ns | 1270ns | 11000ns |

set | 202ns | 202ns | 203ns | 235ns |

We see that sets take about the same of time to use theinoperator, regardless of size. For lists, it scales with $N$. Clearly, if we want to analyze big data, we want to be using sets!

Rest of talk: data structures and complexities.

- Efficient use of these is key
- Python full story: https://wiki.python.org/moin/TimeComplexity

Data layout: resizable linear array of $N$ elements.

*Front operations:*`.append(...)`and`del[-1]`: $O\left(1\right)$*Back operations:*`.insert(0, ...)`and`del[0]`: $O\left(N\right)$*Indexing:*lst[...]: $O\left(1\right)$

- Same as list, but is immutable.
- More memory efficient, especially if creating/destroying often.

- Underlying data structure: hash table

Lookups are an $O\left(1\right)$ operation!

1 2 def get(x): return hash_table[ hash(x) % len(hash_table) ]

*Insertions:*`d[k] = v`$O\left(1\right)$*Deletions:*`del d[k]`$O\left(1\right)$*Lookups:*`d[k]`$O\left(1\right)$*Contains:*:`k in d`$O\left(1\right)$*Size:*:`len(d)`$O\left(1\right)$- There is no ordering.
- Greater memory use than lists (but still $O\left(N\right)$)

Basically, all operations here is $O\left(1\right)$. `dict`s trade extra
memory for fastest lookups and modification.

Hash tables have been called "one of the most important data structures known". When studying big data algorithms, most somehow use hash tables to turn a $O\left({N}^{k}\right)$ operation into $O\left(N\right)$. Their properties seem almost magical.

- Same storage as dictionary, but no values.
- Insert/delete/contains also $O\left(1\right)$.
- Optimal
`intersection`and`union`operations: $O\left(\text{min}\right(N,M\left)\right)$ and $O(N+M)$

- Linear array of values
*of the same type*. - Inserting at beginning is $O\left(N\right)$.
- Resizing is $O\left(1\right)$ but not recommended.
- Fast vector operations:
`+`,`*`,`numpy.add`, etc.- $O\left(N\right)$ which is optimal.

- Linked list - fast
`d.appendleft()`and`d.popleft()`. Can't index middle. - Python:`collections.deque` - Heap - list which is cleverly sorted.
- Python:
`collections.hepaq` - Trie and DAWG

I'm not saying to write things yourself: use libraries

- Worst case performance
- Best case performance
- Amortized worst case performance

- Example: my Python code from the start
- My code was fast in most cases, but when $\beta $ became large it got slow. Same $N$.

- You could leave problem size the same, but vary other parameters...
- ... and everything slows down greatly.
- In your parameter space, you transitioned to a different complexity.

- Algorithmic optimization provides world-changing improvements.
- Once you have the best algorithm, tricks to speed it up are
optimization.
- This is language and domain specific,
*not covered here*.

- This is language and domain specific,
- Most important part is good algorithms and clean code.
*My personal philosophy is "use the best possible algorithm, then optimization usually isn't needed". :math:`O(N)` is fast enough.*

- Book: "Mining of massive datasets", http://mmds.org/
- Use hash tables to transform $O\left({N}^{x}\right)$ operations to $O\left(N\right)$ operations
- Example: locality-sensitive hashing and comparing books
- Finding similar books appears to be an $O\left({N}^{2}\right)$ operation (you are comparing every pair)
- Hash functions which tend to put similar documents in same bin.
- Combine them to magnify effect, reduce number of pairs to check.

- There is a standard CS course "Algorithms and data structures"
- Probably too abstract for most general scientific programmers.

- "Hard" problems

- Algorithms are the key to making good programs
- Consider complexity of functions you call and write
- Try to use something optimal, and search for better option if it seems bad.