Data Management
Last updated
Last updated
Reference : https://datacarpentry.org/sql-socialsci/aio.html
A relational database is a collection of data items organised as a set of tables. Relationships can be defined between the data in one table and the data in another or many other tables. The relational database system will provide mechanisms by which you can query the data in the tables, re-assemble the data in various ways without altering the data in the actual tables. This querying is usually done using SQL (Structured Query Language). SQL allows a great many queries to be constructed from the use of only a few keywords. You could have a relational database with only one table, but then you would’t have any relationships and it would be more like a spreadsheet. Databases are designed to allow efficient querying against very large tables, more than the 1M rows allowed in an Excel spreadsheet.
As were have noted above, a single table is very much like a spreadsheet. It has rows and it has columns. A row represents a single observation and the columns represents the various variables contained within that observation. Often one or more columns in a row will be designated as a ‘primary key’ This column or combination of columns can be used to uniquely identify a specific row in the table. The columns typically have a name associated with them indicating the variable name. A column always represents the same variable for each row contained in the table. Because of this the data in each column will always be of the same type, such as an Integer or Text, of values for all of the rows in the table. Datatypes are discussed in the next section.
A data type is a description of the kind of data in a table column. Each database system recognises its own set of datatypes, although some are common to many. Typical examples will be Integer or Text.
The table below gives some examples.
Data type
Description
CHARACTER(n)
Character string. Fixed-length n
Text
Character string. Variable length
VARCHAR(n) or CHARACTER VARYING(n)
Character string. Variable length. Maximum length n
BINARY(n)
Binary string. Fixed-length n
BOOLEAN
Stores TRUE or FALSE values
VARBINARY(n) or BINARY VARYING(n)
Binary string. Variable length. Maximum length n
INTEGER(p)
Integer numerical (no decimal).
SMALLINT
Integer numerical (no decimal).
INTEGER
Integer numerical (no decimal).
BIGINT
Integer numerical (no decimal).
DECIMAL(p,s)
Exact numerical, precision p, scale s.
NUMERIC(p,s)
Exact numerical, precision p, scale s. (Same as DECIMAL)
FLOAT(p)
Approximate numerical, mantissa precision p. A floating number in base 10 exponential notation.
REAL
Approximate numerical
FLOAT
Approximate numerical
DOUBLE PRECISION
Approximate numerical
DATE
Stores year, month, and day values
TIME
Stores hour, minute, and second values
TIMESTAMP
Stores year, month, day, hour, minute, and second values
INTERVAL
Composed of a number of integer fields, representing a period of time, depending on the type of interval
ARRAY
A set-length and ordered collection of elements
MULTISET
A variable-length and unordered collection of elements
XML
Stores XML data
But in practice you can usually restrict your usage to a few
Data type
Description
BOOLEAN
Stores TRUE or FALSE values
INTEGER
Integer numerical (no decimal).
FLOAT
Approximate numerical
DATE
Stores year, month, and day values
TIME
Stores hour, minute, and second values
TIMESTAMP
Stores year, month, day, hour, minute, and second values
In SQLite there is only a small number.
Data type
Description
NULL
The value is a NULL value
INTEGER
The value is a signed integer, stored in 1, 2, 3, 4, 6,
or 8 bytes depending on the magnitude of the value
REAL
The value is a floating point value, stored in 8-bytes
TEXT
The value is a text string
BLOB
The data is stored exactly as it was input, Used for binary
data such as images.
We won’t be using any BLOB data and it is debatable whether or not NULL should be considered a type at all.
There are some common datatypes which are missing from the SQLite list.
BOOL or BOOLEAN : This type typicaly accepts values of ‘True’ and ‘False’ In SQLite we would use the Integer type and assign vlaues of 1 to represent ‘True’ and 0 to represent ‘False’.
DATE, DATETIME, TIMESTAMP : SQLite does not have a datatype for storing dates and/or times. You can use TEXT, REAL, or INTEGER values for these and use the built-in Date And Time Functions to manipulate them. We will look at manipulating dates in Lesson 5.
Whenever you create a table, you will have the option of designating one of the columns as the primary key column. The main property of the primary key column is that the values contained in it must uniquely identify that particular row. That is you cannot have duplicate primary keys. This can be an advantage which adding rows to the table as you will not be allowed to add the same row (or a row with the same primary key) twice.
The primary key column for a table is usually of type Integer although you could have Text. For example if you had a table of car information, then the “Reg_No” column could be made the primary key as it can be used to uniquely identify a particular row in the table.
A table doesn’t have to have a primary key although they are recommended for larger tables. A primary key can also be made up of more than one column, although this is less ususal.
In addition to the primary key, a table may have one or more Foreign keys. A foreign key does not have to be unique or identified as a foreign key when the table is created. A foreign key in one table will relate to the primary key in another table. This allows a relationship to be created between the two tables. If a table needs to be related to several other tables, then there will be a foreign key (column) for each of those tables.
All relational database systems have the concept of a NULL value. NULL can be thought of as being of all data types or of no data type at all. It represents something which is simply not known.
When you create a database table, for each column you are allowed to indicate whether or not it can contain the NULL value. Like primary keys, this can be used as a form of data validation.
In many real life situations you will have to accept that the data isn’t perfect and will have to allow for NULL or missing values in your table.
In DB Browser we can indicate how we want NULL values to be displayed. We will use a RED background to the cell to make it stand out. In SQL queries you can specifically test for NULL values.
We will look at missing data in more detail in a later episode.
Key Points
A relational database is data organised as a collection of related tables
SQL (Structured Query Language) is used to extract data from the tables. Either a single table or data spread across two or more related tables.
A schema, which describes the data in a table, has to be created before data can be added
The schema can be used to provide some data validation on input
SQL or Structured Query Language is an international standard for manipulating data in a relational database. Each Relational Database system like Oracle, MySQL or SQLite implements its own variation of the standard.
Fortunately for the types of commands and queries that we will want to write, all of the implementations are much in agreement. The SELECT queries we will be writing to access data in our SQLite database will execute un-altered in many of the other environments.
Essentially you only have to learn SQL once.
The strength of SQL is that a single SQL statement or query can request data be returned from one or many of the tables in the database. You can essentially define the relationships between tables on-the-fly as part of your query statement. Relationships between tables are often included as part of the overall database design. In our situation we may be getting an assortment of tables from different sources so being able to imply the relationship as part of the query has definite advantages.
DDL stands for Data Definition Language. It is the set of SQL commands used to create alter of delete database objects such as tables.
DML stands for Data Manipulation Language. For our purposes this is the SELECT command which is used to extract data items from one or more of the database tables.
For the rest of this episode we will be looking at the SELECT statement.
To follow along, you should open the DB Browser application and connect to the SQL_SAFI database.
In SQL, querying data is performed by a SELECT statement. A select statement has 6 key components;
In practice very few queries will have all of these clauses in them simplifying many queries. On the other hand, conditions in the WHERE clause can be arbitrarily complex and if you need to JOIN two or more tables together then more clauses (JOIN and ON) are needed.
All of the clause names above have been written in uppercase for clarity. SQL is not case sensitive. Neither do you need to write each clause on a new line, but it is often clearer to do so for all but the simplest of queries.
In this episode we will start with the very simple and work our way up to the more complex.
The simplest query is effectively one which returns the contents of the whole table
It is better practice and generally more efficient to explicitly list the column names that you want returned.
The ‘*’ character acts as a wildcard meaning all of the columns but you cannot use it as a general wildcard. So for example, the following is not valid.
If you run it you will get an error. When an error does occur you will see an error message displayed in the bottom pane.
In addition to limiting the columns returned by a query, you can also limit the rows returned. The simplest case is to say how many rows are wanted using the Limit
clause. In the example below only the first ten rows of the result of the query will be returned. This is useful if you just want to get a feel for what the data looks like.
ExerciseWrite a query which returns the first 5 rows from the Farms table with only the columns Id, and B16 to B20.
SolutionBecause the query uses several columns (with longish names), for readability they have been set out on separate lines. SQL takes of white space to you are free to arrange the text of the query as you like.
Where
clauseUsually you will want to restrict the rows returned based on some criteria. i.e. certain values or ranges within one or more columns.
In this example we are only interested in rows where the value in the B16_years_liv column is greater than 25
In addition to using the ‘>’ we can use many other operators such as <, <=, =, >=, <>
Where
clauseWe can also use the AND and OR keywords to build more complex selection criteria.
Notice that the columns being used in the WHERE
clause do not need to returned as part of the SELECT
clause.
You can ensure the precedence of the operators by using brackets. Judicious use of brackets can also aid readability
ExerciseFrom the above query, breakdown the
Where
clause so that each component can be tested individually. Make a note of how many rows are returned in each case.
SolutionTo test each of the
or
clauses
OR
generally creates a less restrictive condition andAND
makes a more restrictive condition.
The following query returns the rows where the value of B16_years_liv is in the range 51 to 59 inclusive.
The same results could be obtained by using the BETWEEN or IN operators
The list of values in brackets do not have to be contiguous or even in order.
Write a query using the Farms table which returns the columns Id, A09_village, A11_years_farm, B16_years_liv. We are only interested in rows where the A09_village value is either ‘God’ or ‘Ruaca’. Additionally we only want A11_years_farm values in the range 20 to 30 exclusive and B16_years_liv values strictly greater than 40. There are many ways of doing this, but try to use an inequality, an
IN
clause and aBETWEEN
clause.
If you want the results of your query to appear in a specific order, you can use the ORDER BY clause
By default the SQL assumes Ascending order. You can make this more explicit by using the ASC
or DESC
keywords.
You can also order by multiple columns
Key Points
Strictly speaking SQL is a standard, not a particular implementation
SQL implementation are sufficiently close that you only have to learn SQL once
The DDL constructs are used to create tables and other database objects
The DML constructs, typically the SELECT statement is used to retrieve data from one or more tables
The SELECT statement allows you to ‘slice’ and ‘dice’ the columns and rows of the dataset so that the query only returns the data of interest
At the beginning of this lesson we noted that all database systems have the concept of a NULL value; Something which is missing and nothing is known about it.
In DB Browser we can choose how we want NULLs in a table to be displayed. When we had our initial look at DB Browser, we used the View | Preference
option to change the background colour of cells in a table which has a NULL
values as red. The example below, using the ‘Browse data’ tab, shows a section of the Farms table in the SQL_SAFI database showing column values which are NULL
.
If you type ‘=NULL’ in the filter box for F14_items_owned
, only the rows with NULL in F14_items_owned
will be displayed.
You can get the same results using the following query;
Notice that we use IS
and not =
. This is because ‘NULL’ equals nothing and everything all at the same time!
This table was created from a csv file, part of which looks like this
The highlighted area shows part of the record with Id = 21, the second record returned by the query. It starts with the ‘F10_liv_owned’ column and ends with the ‘G01_no_meals’ column. The Arrow points to the two consecutive ‘,’s representing the lack of a value for the ‘F14_items_owned’ column. These values are missing from the data.
There can be many reasons why data is missing; Not collected, lost, Not applicable etc. In the case of our Farms table, many of the missing values have occurred as a result of the survey design.
If you run the following query :
The first part of the results will look like this:
You may be able to spot from this the relationship between the values in the E01_water_use
column and whether or not there is a NULL
value in either the E_no_group_count
or the E_yes_group_count
column.
Only if the Farmer said that they did use water (E01_water_use = ‘yes’) they were asked how many plots they used water on and the value stored in E_yes_group_count otherwise this field was not even presented in the survey and so contains a NULL
value. In this situation we expect NULL
values and they will not cause any problems.
However the F14_items_owned
column records the possessions of the Farmer. This question was always asked. It is not clear from the NULL
values we find in this field whether or not it means ‘I have no possessions’ or ‘I do not wish to tell you what possessions I have’, in short, we know nothing about the items owned and therefore the value of NULL
is appropriate.
There are several statistical techniques that can be used to allow for NULL
values, which one you might will depend on what has caused the NULL
value to be recorded.
You may want to change the NULL
value to something else. For example if we knew that the NULL
values in the F14_items_owned
column actually meant that the Farmer had no possessions then we might want to change the NULL
values to ‘[]’ to represent and empty list. We can do that in SQL with an UPDATE
query.
The update query is shown below. We are not going to run it as it would change our data. You need to be very sure of the effect you are going to have before you change data in this way.
Rather than changing the data we may just want to miss it out of our analysis.
We can write a query which excludes the rows where F14_items_owned
has a `NULL value with:
Key Points
You should expect missing data
You need to know how missing data is being represented in your dataset
Database systems always represent what they consider to be missing data as
NULL
You can explicitly test for
NULL
values in your dataYou may need other tests for different representations of
NULL