Algorithms and data structures
This page uses MathML, if this does not appear as "n squared" then it
does not work in your browser: n2
Outline
- Algorithms: What and why?
- Basic meaning of O() notation and why it is important.
- Properties of different data structures
What are algorithms?
- Algorithm: A series of steps to complete some process.
- There can be different algorithms that produce the same result.
- All computational research involves calculating some results via
algorithms.
- You could argue that this is the core Computer Science topic,
and I can not do it justice.
- This is a brief intro for scientific programmers.
Why are algorithms important?
- Once I was given a program in C, a "fast" language.
- I re-wrote it in Python, a "slow" language.
- My Python code was faster, for large networks.
Why was this? The Python code used better algorithms.
The exact situation was that the C code used a linear search to do
a weighted random choice. This meant that the C code saw four
times as slow for a network of twice the size. In my code, I used
a binary search I already had written, and doubling the network
size makes the problem 2*log(2) times slower. That C code
doesn't stand a chance.
This also nicely illustrates a benefit of high-level languages.
It is easier to use good algorithms.
Another example: graph representations
- Methods of storing a graph with N nodes:
- Adjacency matrix: N × N matrix.
- List of lists: N lists of neighbors.
- How long does it take to compute the number of edges in the graph?
- Matrix: have to look at N2 elements.
- List of lists: have to take len() of N lists, which takes
time c timesN.
- The lists of lists is clearly much faster, for this problem.
Big-O notation
How do we express the "speeds" of algorithms?
- Big-O notation: used to classify algorithms by how they respond
(processing time or memory requirements) to changes in input size.
- "Time O(N2)" means "time to run is proportional to N2".
- Double problem size, four times as long to run.
- N can be different parameters, e.g. array size, number of
records, number of atoms.
- Different variables can be combined: O(N × m)
- Important since scientists tend to want to process bigger data (data
size: N).
- Algorithmic analysis does not care about constant factors.
You've probably seen this before. This is a quick formalization
and review.
There is not just O(), but also o(),
Θ(), Ω(), and ω(). These
all represent different degrees of bounding. For the purposes of
this talk, I am ignoring these differences.
I say that we don't care about constant factors. They are
important for optimization, but generally you want to first find
the best algorithm, and then optimize constant factors. For small
problems, a higher complexity algorithm with smaller constant is
better.
How am I going to explain this?
- Simple examples of short code lines.
- Not a formal analysis.
- Ignoring questions of when algorithms terminate.
- Again, this is not a complete picture.
Example 1: trivial
This is O(N).
How to calculate big-O: multiply sizes of all loops and the inner
statements.
pass is a single statement (that does nothing), so is O(1), the
best possible.
for i in range(N): does the loop N times.
If N doubles, the amount of work we have to do also
doubles.
Example 2: nested
| for i in range(N):
for j in range(N):
pass
|
This is O(N2).
Hopefully this is clear. The total number of lines that execute
here is constant × N2.
Example 3: consider complexity of functions called
Suppose func(L) is O(L):
| for i in range(N):
for j in range(M):
func(L)
|
This is O(NML).
You can't forget the time complexity of the functions you call.
Example 4: more called functions
Suppose func(N) is O(N)
| for i in range(N):
func(N)
|
This is O(N2).
Why? First and second lines combine to give a time of
1 + 2 + ⋯ + N = (N(N − 1))/(2) = O(N2).
How is complexity reduced?
- Figuring O() is good, but what's the difference between a
good and bad algorithm?
| for i in range(len(data)):
if data[i] == value:
return i
|
- This is O(len(data))
- But most lines do nothing! Ideally we could short-circuit and
return the right index directly!
I'm not getting into formal algorithmic analysis, but here is when
more formal theory could be helpful. For each problem, there is
some theoretical minimum amount of work that could be done. Some
algorithms are less efficient than that.
In the case above, see compare every element in a list to
value, but just return one. All of those needless comparisons
could be avoided if we could filter down candidates somehow.
Complexity reduction 2
| for i in range(len(data)): # O(N)
if data[i] != 0:
f(data[i])
|
In this case, we do O(len(data))
operations, but the important part could be called much less often
if data is sparse.
| for dat in nonzero_data: # O(actual_data)
f(dat)
|
In this case, one should keep track of important elements
separately, if data will be mostly zeros.
Memory complexity
- Memory complexity judges the amount of extra space needed for an
algorithm.
- Example: Graph adjacency matrix is O(N2).
| range(N) # allocates O(N) immediately
xrange(N) # allocates O(1)
|
Algorithms: summary
- Think about time and memory complexity when you write things.
- I haven't taught you how to write algorithms.
- Just know where to look for slow algorithms. You can come back to
them if needed, and maybe ask someone for ideas.
- In practice, do your best to make things O(sizeofdata)
- Recursion can greatly increase complexity.
- "Big data" has extremely clever algorithms for complexity
reduction.
- They can make anything O(N)!
I haven't even come close to giving you an understanding of the
field of analysis of algorithms. A CS person would be ashamed.
However, the main point I am trying to make is think about what
you are doing. When you identify something slow, you can
- See if you can figure out something better.
- Remember to come back to it.
- Ask someone for help, or search for better algorithms yourself.
- When you do come back to algorithms, it will be much easier.
For an example of a clever big data algorithm, see one of my last
examples.
Data structures
The practical portion of this talk
- Data structures are specific arrangements of data in memory.
- Arrangements allow low-complexity operations on the data.
- Key point: Using data structures properly is the most important
way to have fast code.
- They package optimal algorithms for you so that you don't have to
know about them.
- Memory/time tradeoff: Using more memory often can mean faster.
Why are data structures important? Python list insertion
| lst = range(10000)
lst.append(5) # time complexity O(1)
list.insert(0, 5) # time complexity O(N)
|
If your list is big, you do not want to be doing the second one!
The point of the second half of this talk is to understand the
property of data structures, so that you don't accidentally be
doing things like the second one, adding unneeded factors of
O(N) (or more) to your code!
Example 2: Time complexity of lists and sets
- Let us do a quick example of lists and sets: Time to find an
"average" element.
| n = 100
L = range(n)
S = set(L)
%timeit n//2 in L
%timeit n//2 in S
|
Actual time used:
|
n=1 |
n=10 |
n=100 |
n=1000 |
list |
181ns |
289ns |
1270ns |
11000ns |
set |
202ns |
202ns |
203ns |
235ns |
We see that sets take about the same of time to use the in
operator, regardless of size. For lists, it scales with N.
Clearly, if we want to analyze big data, we want to be using sets!
Time complexity of typical data structures
Rest of talk: data structures and complexities.
Dynamic heterogeneous array (Python: list)
Data layout: resizable linear array of N elements.
- Front operations: .append(...) and del[-1]: O(1)
- Back operations: .insert(0, ...) and del[0]: O(N)
- Indexing: lst[...]: O(1)
Python tuple
- Same as list, but is immutable.
- More memory efficient, especially if creating/destroying often.
Hash table mapping (Python: dict)
- Underlying data structure: hash table
- Insertions: d[k] = v O(1)
- Deletions: del d[k] O(1)
- Lookups: d[k] O(1)
- Contains:: k in d O(1)
- Size:: len(d) O(1)
- There is no ordering.
- Greater memory use than lists (but still O(N))
Basically, all operations here is O(1). dicts trade extra
memory for fastest lookups and modification.
Hash tables have been called "one of the most important data
structures known". When studying big data algorithms, most somehow
use hash tables to turn a O(Nk) operation into
O(N). Their properties seem almost magical.
Hash table set (Python: set)
- Same storage as dictionary, but no values.
- Insert/delete/contains also O(1).
- Optimal intersection and union operations:
O(min(N, M)) and O(N + M)
Native arrays (Python: numpy arrays)
- Linear array of values of the same type.
- Inserting at beginning is O(N).
- Resizing is O(1) but not recommended.
- Fast vector operations: +, *, numpy.add, etc.
Other useful data structures
- Linked list - fast d.appendleft() and
d.popleft(). Can't index middle.
- Python: collections.deque
- Heap - list which is cleverly sorted.
- Python: collections.hepaq
- Trie and DAWG
I'm not saying to write things yourself: use libraries
Other complexities
- Worst case performance
- Best case performance
- Amortized worst case performance
Transitions in complexity
- Example: my Python code from the start
- My code was fast in most cases, but when β became
large it got slow. Same N.
- You could leave problem size the same, but vary other parameters...
- ... and everything slows down greatly.
- In your parameter space, you transitioned to a different complexity.
Algorithms vs optimization
- Algorithmic optimization provides world-changing improvements.
- Once you have the best algorithm, tricks to speed it up are
optimization.
- This is language and domain specific, not covered here.
- Most important part is good algorithms and clean code.
- My personal philosophy is "use the best possible algorithm, then
optimization usually isn't needed". :math:`O(N)` is fast enough.
"Big data algorithms"
- Book: "Mining of massive datasets", http://mmds.org/
- Use hash tables to transform O(Nx) operations to
O(N) operations
- Example: locality-sensitive hashing and comparing books
- Finding similar books appears to be an O(N2)
operation (you are comparing every pair)
- Hash functions which tend to put similar documents in same bin.
- Combine them to magnify effect, reduce number of pairs to check.
What next?
- There is a standard CS course "Algorithms and data structures"
- Probably too abstract for most general scientific programmers.
- "Hard" problems
Conclusions
- Algorithms are the key to making good programs
- Consider complexity of functions you call and write
- Try to use something optimal, and search for better option if it
seems bad.
Examples