Data Management

Data Management with SQLLite

Reference : https://datacarpentry.org/sql-socialsci/aio.html

What is a relational database?

A relational database is a collection of data items organised as a set of tables. Relationships can be defined between the data in one table and the data in another or many other tables. The relational database system will provide mechanisms by which you can query the data in the tables, re-assemble the data in various ways without altering the data in the actual tables. This querying is usually done using SQL (Structured Query Language). SQL allows a great many queries to be constructed from the use of only a few keywords. You could have a relational database with only one table, but then you would’t have any relationships and it would be more like a spreadsheet. Databases are designed to allow efficient querying against very large tables, more than the 1M rows allowed in an Excel spreadsheet.

What is a table?

As were have noted above, a single table is very much like a spreadsheet. It has rows and it has columns. A row represents a single observation and the columns represents the various variables contained within that observation. Often one or more columns in a row will be designated as a ‘primary key’ This column or combination of columns can be used to uniquely identify a specific row in the table. The columns typically have a name associated with them indicating the variable name. A column always represents the same variable for each row contained in the table. Because of this the data in each column will always be of the same type, such as an Integer or Text, of values for all of the rows in the table. Datatypes are discussed in the next section.

What is a data type?

A data type is a description of the kind of data in a table column. Each database system recognises its own set of datatypes, although some are common to many. Typical examples will be Integer or Text.

The table below gives some examples.

Data type

Description

CHARACTER(n)

Character string. Fixed-length n

Text

Character string. Variable length

VARCHAR(n) or CHARACTER VARYING(n)

Character string. Variable length. Maximum length n

BINARY(n)

Binary string. Fixed-length n

BOOLEAN

Stores TRUE or FALSE values

VARBINARY(n) or BINARY VARYING(n)

Binary string. Variable length. Maximum length n

INTEGER(p)

Integer numerical (no decimal).

SMALLINT

Integer numerical (no decimal).

INTEGER

Integer numerical (no decimal).

BIGINT

Integer numerical (no decimal).

DECIMAL(p,s)

Exact numerical, precision p, scale s.

NUMERIC(p,s)

Exact numerical, precision p, scale s. (Same as DECIMAL)

FLOAT(p)

Approximate numerical, mantissa precision p. A floating number in base 10 exponential notation.

REAL

Approximate numerical

FLOAT

Approximate numerical

DOUBLE PRECISION

Approximate numerical

DATE

Stores year, month, and day values

TIME

Stores hour, minute, and second values

TIMESTAMP

Stores year, month, day, hour, minute, and second values

INTERVAL

Composed of a number of integer fields, representing a period of time, depending on the type of interval

ARRAY

A set-length and ordered collection of elements

MULTISET

A variable-length and unordered collection of elements

XML

Stores XML data

But in practice you can usually restrict your usage to a few

Data type

Description

BOOLEAN

Stores TRUE or FALSE values

INTEGER

Integer numerical (no decimal).

FLOAT

Approximate numerical

DATE

Stores year, month, and day values

TIME

Stores hour, minute, and second values

TIMESTAMP

Stores year, month, day, hour, minute, and second values

In SQLite there is only a small number.

Data type

Description

NULL

The value is a NULL value

INTEGER

The value is a signed integer, stored in 1, 2, 3, 4, 6,

or 8 bytes depending on the magnitude of the value

REAL

The value is a floating point value, stored in 8-bytes

TEXT

The value is a text string

BLOB

The data is stored exactly as it was input, Used for binary

data such as images.

We won’t be using any BLOB data and it is debatable whether or not NULL should be considered a type at all.

There are some common datatypes which are missing from the SQLite list.

BOOL or BOOLEAN : This type typicaly accepts values of ‘True’ and ‘False’ In SQLite we would use the Integer type and assign vlaues of 1 to represent ‘True’ and 0 to represent ‘False’.

DATE, DATETIME, TIMESTAMP : SQLite does not have a datatype for storing dates and/or times. You can use TEXT, REAL, or INTEGER values for these and use the built-in Date And Time Functions to manipulate them. We will look at manipulating dates in Lesson 5.

Why do tables have primary key columns?

Whenever you create a table, you will have the option of designating one of the columns as the primary key column. The main property of the primary key column is that the values contained in it must uniquely identify that particular row. That is you cannot have duplicate primary keys. This can be an advantage which adding rows to the table as you will not be allowed to add the same row (or a row with the same primary key) twice.

The primary key column for a table is usually of type Integer although you could have Text. For example if you had a table of car information, then the “Reg_No” column could be made the primary key as it can be used to uniquely identify a particular row in the table.

A table doesn’t have to have a primary key although they are recommended for larger tables. A primary key can also be made up of more than one column, although this is less ususal.

What different types of keys are there?

In addition to the primary key, a table may have one or more Foreign keys. A foreign key does not have to be unique or identified as a foreign key when the table is created. A foreign key in one table will relate to the primary key in another table. This allows a relationship to be created between the two tables. If a table needs to be related to several other tables, then there will be a foreign key (column) for each of those tables.

How does the database represent missing data?

All relational database systems have the concept of a NULL value. NULL can be thought of as being of all data types or of no data type at all. It represents something which is simply not known.

When you create a database table, for each column you are allowed to indicate whether or not it can contain the NULL value. Like primary keys, this can be used as a form of data validation.

In many real life situations you will have to accept that the data isn’t perfect and will have to allow for NULL or missing values in your table.

In DB Browser we can indicate how we want NULL values to be displayed. We will use a RED background to the cell to make it stand out. In SQL queries you can specifically test for NULL values.

We will look at missing data in more detail in a later episode.

Key Points

  • A relational database is data organised as a collection of related tables

  • SQL (Structured Query Language) is used to extract data from the tables. Either a single table or data spread across two or more related tables.

  • A schema, which describes the data in a table, has to be created before data can be added

  • The schema can be used to provide some data validation on input

The Select Statement

Definition of SQL

SQL or Structured Query Language is an international standard for manipulating data in a relational database. Each Relational Database system like Oracle, MySQL or SQLite implements its own variation of the standard.

Fortunately for the types of commands and queries that we will want to write, all of the implementations are much in agreement. The SELECT queries we will be writing to access data in our SQLite database will execute un-altered in many of the other environments.

Essentially you only have to learn SQL once.

SQL and Relational database tables

The strength of SQL is that a single SQL statement or query can request data be returned from one or many of the tables in the database. You can essentially define the relationships between tables on-the-fly as part of your query statement. Relationships between tables are often included as part of the overall database design. In our situation we may be getting an assortment of tables from different sources so being able to imply the relationship as part of the query has definite advantages.

DDL and DML

DDL stands for Data Definition Language. It is the set of SQL commands used to create alter of delete database objects such as tables.

DML stands for Data Manipulation Language. For our purposes this is the SELECT command which is used to extract data items from one or more of the database tables.

Simple SQL queries using the Select statement

For the rest of this episode we will be looking at the SELECT statement.

To follow along, you should open the DB Browser application and connect to the SQL_SAFI database.

In SQL, querying data is performed by a SELECT statement. A select statement has 6 key components;

SELECT colnames
FROM tablename
WHERE conditions
GROUP BY colnames
HAVING conditions
ORDER BY colnames

In practice very few queries will have all of these clauses in them simplifying many queries. On the other hand, conditions in the WHERE clause can be arbitrarily complex and if you need to JOIN two or more tables together then more clauses (JOIN and ON) are needed.

All of the clause names above have been written in uppercase for clarity. SQL is not case sensitive. Neither do you need to write each clause on a new line, but it is often clearer to do so for all but the simplest of queries.

In this episode we will start with the very simple and work our way up to the more complex.

The simplest query is effectively one which returns the contents of the whole table

SELECT *
FROM Farms;

It is better practice and generally more efficient to explicitly list the column names that you want returned.

SELECT Country, A06_province, A07_district, A08_ward, A09_village
FROM Farms;

The ‘*’ character acts as a wildcard meaning all of the columns but you cannot use it as a general wildcard. So for example, the following is not valid.

SELECT A*
FROM Farms;

If you run it you will get an error. When an error does occur you will see an error message displayed in the bottom pane.

In addition to limiting the columns returned by a query, you can also limit the rows returned. The simplest case is to say how many rows are wanted using the Limit clause. In the example below only the first ten rows of the result of the query will be returned. This is useful if you just want to get a feel for what the data looks like.

SELECT *
FROM Farms
LIMIT 10;

Exercise

Write a query which returns the first 5 rows from the Farms table with only the columns Id, and B16 to B20.

Solution

SELECT  Id
      , B16_years_liv
    , B17_parents_liv
    , B18_sp_parents_liv
    , B19_grand_liv
    , B20_sp_grand_liv
FROM Farms
LIMIT 5;

Because the query uses several columns (with longish names), for readability they have been set out on separate lines. SQL takes of white space to you are free to arrange the text of the query as you like.

The Where clause

Usually you will want to restrict the rows returned based on some criteria. i.e. certain values or ranges within one or more columns.

In this example we are only interested in rows where the value in the B16_years_liv column is greater than 25

SELECT  Id, B16_years_liv
FROM Farms
WHERE B16_years_liv > 25
;

In addition to using the ‘>’ we can use many other operators such as <, <=, =, >=, <>

SELECT  Id, B17_parents_liv
FROM Farms
WHERE B17_parents_liv = 'yes'
;

Using more complex logical expressions in the Where clause

We can also use the AND and OR keywords to build more complex selection criteria.

SELECT  Id
FROM Farms
WHERE    B17_parents_liv = 'yes' 
     AND B18_sp_parents_liv = 'yes' 
     AND B19_grand_liv = 'yes' 
     AND B20_sp_grand_liv = 'yes' 
;

Notice that the columns being used in the WHERE clause do not need to returned as part of the SELECT clause.

You can ensure the precedence of the operators by using brackets. Judicious use of brackets can also aid readability

SELECT  Id
FROM Farms
WHERE (B17_parents_liv = 'yes' OR B18_sp_parents_liv = 'yes') AND B16_years_liv > 60
;

Exercise

From the above query, breakdown the Where clause so that each component can be tested individually. Make a note of how many rows are returned in each case.

Solution

To test each of the or clauses

SELECT  Id
FROM Farms
WHERE B17_parents_liv = 'yes'
;
SELECT  Id
FROM Farms
WHERE B18_sp_parents_liv = 'yes'
;
SELECT  Id
FROM Farms
WHERE (B17_parents_liv = 'yes' OR B18_sp_parents_liv = 'yes') 
;
SELECT  Id
FROM Farms
WHERE B16_years_liv > 60
;

OR generally creates a less restrictive condition and AND makes a more restrictive condition.

The following query returns the rows where the value of B16_years_liv is in the range 51 to 59 inclusive.

SELECT Id, B16_years_liv
FROM Farms
WHERE B16_years_liv > 50 AND B16_years_liv < 60
;

The same results could be obtained by using the BETWEEN or IN operators

SELECT Id, B16_years_liv
FROM Farms
WHERE B16_years_liv BETWEEN 51 AND 59
;
SELECT Id, B16_years_liv
FROM Farms
WHERE B16_years_liv IN (51, 52, 53, 54, 55, 56, 57, 58, 59)
;

The list of values in brackets do not have to be contiguous or even in order.

Exercise

Write a query using the Farms table which returns the columns Id, A09_village, A11_years_farm, B16_years_liv. We are only interested in rows where the A09_village value is either ‘God’ or ‘Ruaca’. Additionally we only want A11_years_farm values in the range 20 to 30 exclusive and B16_years_liv values strictly greater than 40. There are many ways of doing this, but try to use an inequality, an IN clause and a BETWEEN clause.

Solution

SELECT Id, A09_village, A11_years_farm, B16_years_liv
FROM Farms
WHERE     A09_village IN ('God', 'Ruaca') 
      AND A11_years_farm BETWEEN 21 AND 29 
      AND B16_years_liv > 40
;

Sorting results

If you want the results of your query to appear in a specific order, you can use the ORDER BY clause

SELECT Id, A09_village, A11_years_farm, B16_years_liv
FROM Farms
WHERE A09_village = 'God'
ORDER BY A11_years_farm
;

By default the SQL assumes Ascending order. You can make this more explicit by using the ASC or DESC keywords.

SELECT Id, A09_village, A11_years_farm, B16_years_liv
FROM Farms
WHERE A09_village = 'God'
ORDER BY A11_years_farm DESC
;

You can also order by multiple columns

SELECT Id, A09_village, A11_years_farm, B16_years_liv
FROM Farms
WHERE A09_village = 'God'
ORDER BY A11_years_farm DESC , B16_years_liv ASC
;

Key Points

  • Strictly speaking SQL is a standard, not a particular implementation

  • SQL implementation are sufficiently close that you only have to learn SQL once

  • The DDL constructs are used to create tables and other database objects

  • The DML constructs, typically the SELECT statement is used to retrieve data from one or more tables

  • The SELECT statement allows you to ‘slice’ and ‘dice’ the columns and rows of the dataset so that the query only returns the data of interest

Missing Data

How does the database represents missing data

At the beginning of this lesson we noted that all database systems have the concept of a NULL value; Something which is missing and nothing is known about it.

In DB Browser we can choose how we want NULLs in a table to be displayed. When we had our initial look at DB Browser, we used the View | Preference option to change the background colour of cells in a table which has a NULL values as red. The example below, using the ‘Browse data’ tab, shows a section of the Farms table in the SQL_SAFI database showing column values which are NULL.

If you type ‘=NULL’ in the filter box for F14_items_owned, only the rows with NULL in F14_items_owned will be displayed.

You can get the same results using the following query;

SELECT *
FROM Farms
WHERE F14_items_owned IS NULL
;

Notice that we use IS and not =. This is because ‘NULL’ equals nothing and everything all at the same time!

This table was created from a csv file, part of which looks like this

The highlighted area shows part of the record with Id = 21, the second record returned by the query. It starts with the ‘F10_liv_owned’ column and ends with the ‘G01_no_meals’ column. The Arrow points to the two consecutive ‘,’s representing the lack of a value for the ‘F14_items_owned’ column. These values are missing from the data.

Reasons for Missing data

There can be many reasons why data is missing; Not collected, lost, Not applicable etc. In the case of our Farms table, many of the missing values have occurred as a result of the survey design.

If you run the following query :

SELECT E01_water_use, E_no_group_count, E_yes_group_count
FROM Farms
;

The first part of the results will look like this:

You may be able to spot from this the relationship between the values in the E01_water_use column and whether or not there is a NULL value in either the E_no_group_count or the E_yes_group_count column.

Only if the Farmer said that they did use water (E01_water_use = ‘yes’) they were asked how many plots they used water on and the value stored in E_yes_group_count otherwise this field was not even presented in the survey and so contains a NULL value. In this situation we expect NULL values and they will not cause any problems.

However the F14_items_owned column records the possessions of the Farmer. This question was always asked. It is not clear from the NULL values we find in this field whether or not it means ‘I have no possessions’ or ‘I do not wish to tell you what possessions I have’, in short, we know nothing about the items owned and therefore the value of NULL is appropriate.

Dealing with missing data

There are several statistical techniques that can be used to allow for NULL values, which one you might will depend on what has caused the NULL value to be recorded.

You may want to change the NULL value to something else. For example if we knew that the NULL values in the F14_items_owned column actually meant that the Farmer had no possessions then we might want to change the NULL values to ‘[]’ to represent and empty list. We can do that in SQL with an UPDATE query.

The update query is shown below. We are not going to run it as it would change our data. You need to be very sure of the effect you are going to have before you change data in this way.

UPDATE Farms
SET F14_items_owned = '[]'
WHERE F14_items_owned is NULL 
;

Rather than changing the data we may just want to miss it out of our analysis.

We can write a query which excludes the rows where F14_items_owned has a `NULL value with:

SELECT * from Farms
WHERE F14_items_owned IS NOT NULL
;

Key Points

  • You should expect missing data

  • You need to know how missing data is being represented in your dataset

  • Database systems always represent what they consider to be missing data as NULL

  • You can explicitly test for NULL values in your data

  • You may need other tests for different representations of NULL

Last updated