SQL Database usage: by example

Outline

After this presentation, you will be able to

You won't:

Review of database concepts

Basic model of data

Simple example of rearrangement (aggregating by type):

type data
A 1
A 5
B 2
C 1
C 2
C 3

GROUP BY type

===>

type sum_data
A 6
B 2
C 6

Connecting to a database: Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
>>> import sqlite3
>>> conn = sqlite3.connect('/scratch/networks/data/mit-reality.sqlite')
>>> cursor = conn.cursor()
>>> cursor.execute('SELECT ego_id, affil FROM subject LIMIT 5')
<sqlite3.Cursor object at 0x7f62a00b57a0>
>>> for row in cursor:
...     print row
...
(0, None)
(1, None)
(2, u'1styeargrad ')
(3, u'mlgrad')
(4, u'mlgrad')

Connecting to a database: command line or sqlitebrowser

1
$ sqlite3 /scratch/networks/data/mit-reality.sqlite
SQLite version 3.8.7.1 2014-10-29 13:59:56
Enter ".help" for usage hints.

select * from routes limit 5;
2|141|3|14|Välivainio - Lintula||3||3300cc|ffffff
3|234|3|23B|Ii - Oulu||3||000000|ffffff
4|414|10|41|Oulu - Muhos - Oulu||3||000000|ffffff
6|71|7|7N|Keskusta - Saarela||3||3300cc|ffffff
7|564|3|56B|Kempele - Tyrnävä||3||000000|ffffff

Exercise: Oulu public transit schedules

  1. Download the file: http://rkd.zgib.net/~rkdarst/oulu_transit.week.sqlite

  2. Open the database: sqlitebrowser oulu_transit.week.sqlite or sqlite3 oulu_transit.week.sqlite

  3. Use the .schema command to get the schema (definition) of all tables. This is everything needed to re-create the database.

    1
    .schema
    
  4. Select the first 10 lines of the stops table:

    1
    SELECT * FROM stops LIMIT 10;
    
  5. Count all lines in the stops and routes tables:

    1
    2
    SELECT count(*) FROM stops;
    SELECT count(*) FROM routes;
    

Other databases you may want to connect to

  • HSL data for one day: /local/cache/hsl_data/db-1day.sqlite on thor
  • Mobile phone data in /scratch/networks/darst/datasets/MobilePhoneData/
  • Oxford ego data in /scratch/networks/darst/oxford_egos.sqlite
  • ISI web of science in /data/isi/isi.sqlite (requires isi group.)

Basic SQL commands

Example of incremental query build-up:

1
2
3
4
5
6
7
8
9
sqlite> SELECT * FROM subject LIMIT 5;
sqlite> SELECT * FROM subject LEFT JOIN comm USING (ego_id)
        LIMIT 5;
sqlite> SELECT ego_id, max(ts), min(ts) FROM subject
        LEFT JOIN comm USING (ego_id)
        GROUP BY ego_id LIMIT 5;
sqlite> SELECT ego_id, max(ts), min(ts) FROM subject
        LEFT JOIN comm USING (ego_id)
        GROUP BY ego_id;

SELECT

Syntax: SELECT <columns or expressions> FROM <table name>;

LIMIT

Syntax: SELECT ... FROM ... LIMIT N;

WHERE

Syntax: SELECT ... FROM ... WHERE <condition>

ORDER BY

Syntax: SELECT ... FROM ... ORDER BY <expression>

Ordering of clauses

1
2
SELECT ... FROM ... JOIN ... WHERE ...
GROUP BY ... ORDER BY ... LIMIT ...;

Exercises 2: going a bit deeper

  1. Continue with Oulu transit data.
  2. Use .schema to understand the table and column names.
  3. Use SELECT ... LIMIT 5 to study the type of data in each table.
  4. Use SELECT count(*) and WHERE ego_id=N to get the total number of events of each ego.
  5. Select the first and last event in the comm table. To sort by time in reverse order, use ORDER BY <column name> DESC. To get only the first event in this order, use LIMIT 1.

Aggregate functions

1
SELECT type, sum(count) FROM table GROUP BY type;
type data
A 1
A 5
B 2
C 1
C 2
C 3
===>
type sum(data)
A 1+5
B 2
C 1+2+3

Example: Aggregating MIT reality mining

1
2
3
4
5
6
7
8
9
SELECT ego_id, min(ts), max(ts)
   FROM comm
   WHERE desc='Voice call'
   GROUP BY ego_id;

1|2005-01-27 19:05:37.000000|2005-01-27 22:36:59.000000
3|2004-08-03 15:50:19.000000|2005-05-04 21:16:58.000000
4|2004-08-01 00:05:52.000000|2004-12-24 05:54:03.000000
...

Exercise: Advanced queries on MIT reality mining

On the MIT reality mining dataset, compute the following. You will have to use the .schema command to figure out what tables and columns are relevant!

  1. Number of events per user (of both comm and bluetooth types). (Hint: GROUP BY ego_id and count(*))
  2. Total number of communications of each type (desc column)
  3. Number of users per affiliation. (Hint: subject table)

Join

Example:

ego_id ts
0 1.5
0 2.0
1 0.0
2 0.7
2 1.2
ego_id affil
0 grad
1 mlstaff
2 professor

JOIN on (ego_id)

===>

ego_id ts affil
0 1.5 grad
0 2.0 grad
1 0.0 mlstaff
2 0.7 professor
2 1.2 professor

Join syntax

Syntax 1: SELECT ... FROM <table name> JOIN <table name> USING ( <column name> )

Syntax 1: SELECT ... FROM <table name> JOIN <table name> ON ( <condition> )

Join examples

Explore these examples yourself

  1. MIT reality, adding affil column
1
2
3
4
5
sqlite> SELECT ego_id, affil, count(*) FROM subject LEFT JOIN comm
        USING (ego_id) GROUP BY ego_id LIMIT 10;
0||1
1||20
2|1styeargrad |1

Date and time functions

Date and time operations

Examples of operations you can do:

Exercise: day intervals

  1. Again use the MIT reality mining data, beginning from where you last left off. You will work to add an extra column, the number of days between the first and last event.

  2. Create a table with the following information for only events of description 'Voice call':

    ego_id, affil, #-events-total, first-event, last-event, #-of-days-of-events.

  3. Sort your rows by #-of-days-of-events.

Miscellaneous other functions and expressions

Views

Indexing

Using indexes

Advanced query types

Inserting and updating

Creating entire databases from scratch

Summary

The end

References