21. Processing data files
Chapter 21. Processing data files
Much of the data available is contained in text files. This data can range from unstructured text, such as a corpus of tweets or literary texts, to more structured data in which each row is a record and the fields are delimited by a special character, such as a comma, a tab, or a pipe (|). Text files can be huge; a data set can be spread over tens or even hundreds of files, and the data in it can be incomplete or horribly dirty. With all the variations, itâs almost inevitable that youâll need to read and use data from text files. This chapter gives you strategies for using Python to do exactly that.
21.1. Welcome to ETL
The need to get data out of files, parse it, turn it into a useful format, and then do something with it has been around for as long as there have been data files. In fact, there is a standard term for the process: extract-transform-load (ETL). The extraction refers to the process of reading a data source and parsing it, if necessary. The transformation can be cleaning and normalizing the data, as well as combining, breaking up, or reorganizing the records it contains. The loading refers to storing the transformed data in a new place, either a different file or a database. This chapter deals with the basics of ETL in Python, starting with text-based data files and storing the transformed data in other files. I look at more structured data files in chapter 22 and storage in databases in chapter 23.
21.2. Reading text files
The first part of ETLâthe âextractâ portionâinvolves opening a file and reading its contents. This process seems like a simple one, but even at this point there can be issues, such as the fileâs size. If a file is too large to fit into memory and be manipulated, you need to structure your code to handle smaller segments of the file, possibly operating one line at a time.
21.2.1. Text encoding: ASCII, Unicode, and others
Another possible pitfall is in the encoding. This chapter deals with text files, and in fact, much of the data exchanged in the real world is in text files. But the exact nature of text can vary from application to application, from person to person, and of course from country to country.
Sometimes, text means something in the ASCII encoding, which has 128 characters, only 95 of which are printable. The good news about ASCII encoding is that itâs the lowest common denominator of most data exchange. The bad news is that it doesnât begin to handle the complexities of the many alphabets and writing systems of the world. Reading files using ASCII encoding is almost certain to cause trouble and throw errors on character values that it doesnât understand, whether itâs a German Þ, a Portuguese ç, or something from almost any language other than English.
These errors arise because ASCII is based on 7-bit values, whereas the bytes in a typical file are 8 bits, allowing 256 possible values as opposed to the 128 of a 7-bit value. Itâs routine to use those additional values to store additional charactersâanything from extra punctuation (such as the printerâs en dash and em dash) to symbols (such as the trademark, copyright, and degree symbols) to accented versions of alphabetical characters. The problem has always been that if, in reading a text file, you encounter a character in the 128 outside the ASCII range, you have no way of knowing for sure how it was encoded. Is the character value of 214, say, a division symbol, an Ã, or something else? Short of having the code that created the file, you have no way to know.
Unicode and UTF-8
One way to mitigate this confusion is Unicode. The Unicode encoding called UTF-8 accepts the basic ASCII characters without any change but also allows an almost unlimited set of other characters and symbols according to the Unicode standard. Because of its flexibility, UTF-8 was used in more 85% of web pages served at the time I wrote this chapter, which means that your best bet for reading text files is to assume UTF-8 encoding. If the files contain only ASCII characters, theyâll still be read correctly, but youâll also be covered if other characters are encoded in UTF-8. The good news is that the Python 3 string data type was designed to handle Unicode by default.
Even with Unicode, thereâll be occasions when your text contains values that canât be successfully encoded. Fortunately, the open function in Python accepts an optional errors parameter that tells it how to deal with encoding errors when reading or writing files. The default option is 'strict', which causes an error to be raised whenever an encoding error is encountered. Other useful options are 'ignore', which causes the character causing the error to be skipped; 'replace', which causes the character to be replaced by a marker character (often, ?); 'backslashreplace', which replaces the character with a backslash escape sequence; and 'surrogateescape', which translates the offending character to a private Unicode code point on reading and back to the original sequence of bytes on writing. Your particular use case will determine how strict you need to be in handling or resolving encoding issues.
Look at a short example of a file containing an invalid UTF-8 character, and see how the different options handle that character. First, write the file, using bytes and binary mode:
copy
This code results in a file that contains âABCâ followed by three non-ASCII characters, which may be rendered differently depending on the encoding used. If you use vim to look at the file, you see
copy
Now that you have the file, try reading it with the default 'strict' errors option:
copy
The fourth byte, which had a value of 255, isnât a valid UTF-8 character in that position, so the 'strict' errors setting raises an exception. Now see how the other error options handle the same file, keeping in mind that the last three characters raise an error:
copy
If you want any problem characters to disappear, 'ignore' is the option to use. The 'replace' option only marks the place occupied by the invalid character, and the other options in different ways attempt to preserve the invalid characters without interpretation.
21.2.2. Unstructured text
Unstructured text files are the easiest sort of data to read but the hardest to extract information from. Processing unstructured text can vary enormously, depending on both the nature of the text and what you want to do with it, so any comprehensive discussion of text processing is beyond the scope of this book. A short example, however, can illustrate some of the basic issues and set the stage for a discussion of structured text data files.
One of the simplest issues is deciding what forms a basic logical unit in the file. If you have a corpus of thousands of tweets, the text of Moby Dick, or a collection of news stories, you need to be able to break them up into cohesive units. In the case of tweets, each may fit onto a single line, and you can read and process each line of the file fairly simply.
In the case of Moby Dick or even a news story, the problem can be trickier. You may not want to treat all of a novel or news item as a single item in many cases. But if thatâs the case, you need to decide what sort of unit you do want and then come up with a strategy to divide the file accordingly. Perhaps you want to consider the text paragraph by paragraph. In that case, you need to identify how paragraphs are separated in your file and create your code accordingly. If a paragraph is the same as a line in the text file, the job is easy. Often, however, the line breaks in a text file are shorter, and you need to do a bit more work.
Now look at a couple of examples:
copy
In the sample, which is indeed the beginning of Moby Dick, the lines are broken more or less as they might be on the page, and paragraphs are indicated by a single blank line. If you want to deal with each paragraph as a unit, you need to break the text on the blank lines. Fortunately, this task is easy if you use the string split() method. Each newline character in a string can represented by "\n". Naturally, the last line of a paragraphâs text ends with a newline, and if the next line is blank, itâs immediately followed by a second newline for the blank line:
12copy
Splitting the text into paragraphs is a very simple first step in handling unstructured text. You might also need to do more normalization of the text before processing. Suppose that you want to count the rate of occurrence of every word in a text file. If you just split the file on whitespace, you get a list of words in the file. Counting their occurrences accurately will be hard, however, because This, this, this., and this, are not the same. The way to make this code work is to normalize the text by removing the punctuation and making everything the same case before processing. For the example text above, the code for a normalized list of words might look like this:
1234copy
QUICK CHECK: NORMALIZATION
Look closely at the list of words generated. Do you see any issues with the normalization so far? What other issues do you think you might encounter in a longer section of text? How do you think you might deal with those issues?
21.2.3. Delimited flat files
Although reading unstructured text files is easy, the downside is their very lack of structure. Itâs often much more useful to have some organization in the file to help with picking out individual values. The simplest way is to break the file into lines and have one element of information per line. You may have a list of the names of files to be processed, a list of peopleâs names that need to be printed (on name tags, say), or maybe a series of temperature readings from a remote monitor. In such cases, the data parsing is very simple: You read in the line and convert it to the right type, if necessary. Then the file is ready to use.
Most of the time, however, things arenât not quite so simple. Usually, you need to group multiple related bits of information, and you need your code to read them in together. The common way to do this is to put the related pieces of information on the same line, separated by a special character. That way, as you read each line of the file, you can use the special characters to split the file into its different fields and put the values of those fields in variables for later processing.
This file is a simple example of temperature data in delimited format:
copy
This data is pipe-delimited, meaning that each field in the line is separated by the pipe (|) character, in this case giving you four fields: the state of the observations, the date of the observations, the average high temperature, and the number of stations reporting. Other common delimiters are the tab character and the comma. The comma is perhaps the most common, but the delimiter could be any character you donât expect to occur in the values. (More about that issue next.) Comma delimiters are so common that this format is often called CSV (comma-separated values), and files of this type often have a .csv extension as a hint of their format.
Whatever character is being used as the delimiter, if you know what character it is, you can write your own code in Python to break each line into its fields and return them as a list. In the previous case, you can use the string split() method to break a line into a list of values:
copy
Note that this technique is very easy to do but leaves all the values as strings, which might not be convenient for later processing.
TRY THIS: READ A FILE
Write the code to read a text file (assume temp_data_pipes_00a.txt, as shown in the example), split each line of the file into a list of values, and add that list to a single list of records.
What issues or problems did you encounter in implementing this code? How might you go about converting the last three fields to the correct date, real, and int types?
21.2.4. The csv module
If you need to do much processing of delimited data files, you should become familiar with the csv module and its options. When Iâve been asked to name my favorite module in the Python standard library, more than once Iâve cited the csv moduleânot because itâs glamorous (it isnât), but because it has probably saved me more work and kept me from more self-inflicted bugs over my career than any other module.
The csv module is a perfect case of Pythonâs âbatteries includedâ philosophy. Although itâs perfectly possible, and in many cases not even terribly hard, to roll your own code to read delimited files, itâs even easier and much more reliable to use the Python module. The csv module has been tested and optimized, and it has features that you probably wouldnât bother to write if you had to do it yourself, but that are truly handy and time-saving when available.
Look at the previous data, and decide how youâd read it by using the csv module. The code to parse the data has to do two things: read each line and strip off the trailing newline character, and then break up the line on the pipe character and append that list of values to a list of lines. Your solution to the exercise might look something like this:
copy
To do the same thing with the csv module, the code might be something like this:
copy
In this simple case, the gain over rolling your own code doesnât seem so great. Still, the code is two lines shorter and a bit clearer, and thereâs no need to worry about stripping off newline characters. The real advantages come when you want to deal with more challenging cases.
The data in the example is real, but itâs actually been simplified and cleaned. The real data from the source is more complex. The real data has more fields, some fields are in quotes while others are not, and the first field is empty. The original is tab-delimited, but for the sake of illustration, I present it as comma-delimited here:
copy
Notice that some fields include commas. The convention in that case is to put quotes around a field to indicate that itâs not supposed to be parsed for delimiters. Itâs quite common, as here, to quote only some fields, especially those in which a value might contain the delimiter character. It also happens, as here, that some fields are quoted even if theyâre not likely to contain the delimiting character.
In a case like this one, your home-grown code becomes cumbersome. Now you can no longer split the line on the delimiting character; you need to be sure that you look only at delimiters that arenât inside quoted strings. Also, you need to remove the quotes around quoted strings, which might occur in any position or not at all. With the csv module, you donât need to change your code at all. In fact, because the comma is the default delimiter, you donât even need to specify it:
copy
Notice that the extra quotes have been removed and that any field values with commas have the commas intact inside the fieldsâall without any more characters in the command.
QUICK CHECK: HANDLING QUOTING
Consider how youâd approach the problems of handling quoted fields and embedded delimiter characters if you didnât have the csv library. Which would be easier to handle: the quoting or the embedded delimiters?
21.2.5. Reading a csv file as a list of dictionaries
In the preceding examples, you got a row of data back as a list of fields. This result works fine in many cases, but sometimes it may be handy to get the rows back as dictionaries where the field name is the key. For this use case, the csv library has a DictReader, which can take a list of fields as a parameter or can read them from the first line of the data. If you want to open the data with a DictReader, the code would look like this:
copy
Note that the csv.DictReader returns OrderedDicts, so the fields stay in their original order. Although their representation is a little different, the fields still behave like dictionaries:
copy
If the data is particularly complex, and specific fields need to be manipulated, a DictReader can make it much easier to be sure youâre getting the right field; it also makes your code somewhat easier to understand. Conversely, if your data set is quite large, you need to keep in mind that DictReader can take on the order of twice as long to read the same amount of data.
21.3. Excel files
The other common file format that I discuss in this chapter is the Excel file, which is the format that Microsoft Excel uses to store spreadsheets. I include Excel files here because the way you end up treating them is very similar to the way you treat delimited files. In fact, because Excel can both read and write CSV files, the quickest and easiest way to extract data from an Excel spreadsheet file often is to open it in Excel and then save it as a CSV file. This procedure doesnât always make sense, however, particularly if you have a lot of files. In that case, even though you could theoretically automate the process of opening and saving each file in CSV format, itâs probably faster to deal with the Excel files directly.
Itâs beyond the scope of this book to have an in-depth discussion of spreadsheet files, with their options for multiple sheets in the same file, macros, and various formatting options. Instead, in this section I look at an example of reading a simple one-sheet file simply to extract the data from it.
As it happens, Pythonâs standard library doesnât have a module to read or write Excel files. To read that format, you need to install an external module. Fortunately, several modules are available to do the job. For this example, you use one called OpenPyXL, which is available from the Python package repository. You can install it with the following command from a command line:
copy
Reading the file is fairly simple, but itâs still more work than CSV files require. First, you need to load the workbook; next, you need to get the specific sheet; then you can iterate over the rows; and from there, you extract the values of the cells. Some sample code to read the spreadsheet looks like this:
copy
This code gets you the same results as the much simpler code did for a csv file. Itâs not surprising that the code to read a spreadsheet is more complex, because spreadsheets are themselves much more complex objects. You should also be sure that you understand the way that data has been stored in the spreadsheet. If the spreadsheet contains formatting that has some significance, if labels need to be disregarded or handled differently, or if formulas and references need to be processed, you need to dig deeper into how those elements should be processed, and you need to write more-complex code.
Spreadsheets also often have other possible issues. At this writing, itâs common for spreadsheets to be limited to around a million rows. Although that limit sounds large, more and more often youâll need to handle data sets that are larger. Also, spreadsheets sometimes automatically apply inconvenient formatting. One company I worked for had part numbers that consisted of a digit and at least one letter followed by some combination of digits and letters. It was possible to get a part number such as 1E20. Most spreadsheets automatically interpret 1E20 as scientific notation and save it as 1.00E+20 (1 times 10 to the 20th power) while leaving 1F20 as a string. For some reason, itâs rather difficult to keep this from happening, and particularly with a large data set, the problem wonât be detected until farther down the pipeline, if all. For these reasons, I recommend using CSV or delimited files when at all possible. Users usually can save a spreadsheet as CSV, so thereâs usually no need put up with the extra complexity and formatting hassles that spreadsheets involve.
21.4. Data cleaning
One common problem youâll encounter in processing text-based data files is dirty data. By dirty, I mean that there are all sorts of surprises in the data, such as null values, values that arenât legal for your encoding, or extra whitespace. The data may also be unsorted or in an order that makes processing difficult. The process of dealing with situations like these is called data cleaning.
21.4.1. Cleaning
In a very simple example data clean, you might need to process a file that was exported from a spreadsheet or other financial program, and the columns dealing with money may have percentage and currency symbols (such as %, $, ÂĢ, and ?), as well as extra groupings that use a period or comma. Data from other sources may have other surprises that make processing tricky if theyâre not caught in advance. Look again at the temperature data you saw previously. The first data line looks like this:
copy
Some columns, such as 'State' (field 2) and 'Notes' (field 1), are clearly text, and you wouldnât be likely to do much with them. There are also two date fields in different formats, and you might well want to do calculations with the dates, possibly to change the order of the data and to group rows by month or day, or possibly to calculate how far apart in time two rows are.
The rest of the fields seem to be different types of numbers; the temperatures are decimals, and the record counts columns are integers. Notice, however, that the heat index temperatures have a variation: When the value for the 'Max Temp for Daily Max Air Temp (F)' field is below 80, the values for the heat index fields arenât reported, but instead are listed as 'Missing', and the record count is 0. Also note that the 'Daily Max Heat Index (F) % Coverage' field is expressed as a percentage of the number of temperature records that also qualify to have a heat index. Both of these issues will be problematic if you want to do any math calculations on the values in those fields, because both 'Missing' and any number ending with % will be parsed as strings, not numbers.
Cleaning data like this can be done at different steps in the process. Quite often, I prefer to clean the data as itâs being read from the file, so I might well replace the 'Missing' with a None value or an empty string as the lines are being processed. You could also leave the 'Missing' strings in place and write your code so that no math operations are performed on a value if it is 'Missing'.
TRY THIS: CLEANING DATA
How would you handle the fields with 'Missing' as possible values for math calculations? Can you write a snippet of code that averages one of those columns?
What would you do with the average column at the end so that you could also report the average coverage? In your opinion, would the solution to this problem be at all linked to the way that the 'Missing' entries were handled?
21.4.2. Sorting
As I mentioned earlier, itâs often useful to have data in the text file sorted before processing. Sorting the data makes it easier to spot and handle duplicate values, and it can also help bring together related rows for quicker or easier processing. In one case, I received a 20 millionârow file of attributes and values, in which arbitrary numbers of them needed to be matched with items from a master SKU list. Sorting the rows by the item ID made gathering each itemâs attributes much faster. How you do the sorting depends on the size of the data file relative to your available memory and on the complexity of the sort. If all the lines of the file can fit comfortably into available memory, the easiest thing may be to read all of the lines into a list and use the listâs sort method:
copy
You could also use the sorted() function, as in sorted_lines = sorted(lines). This function preserves the order of the lines in your original list, which usually is unnecessary. The drawback to using the sorted() function is that it creates a new copy of the list. This process takes slightly longer and consumes twice as much memory, which might be a bigger concern.
If the data set is larger than memory and the sort is very simple (just by an easily grabbed field), it may be easier to use an external utility, such as the UNIX sort command, to preprocess the data:
copy
In either case, sorting can be done in reverse order and can be keyed by values, not the beginning of the line. For such occasions, you need to study the documentation of the sorting tool you choose to use. A simple example in Python would be to make a sort of lines of text case-insensitive. To do this, you give the sort method a key function that makes the element lowercase before making a comparison:
copy
This example uses a lambda function to ignore the first five characters of each string:
copy
Using key functions to determine the behavior of sorts in Python is very handy, but be aware that the key function is called a lot in the process of sorting, so a complex key function could mean a real performance slowdown, particularly with a large data set.
21.4.3. Data cleaning issues and pitfalls
It seems that there are as many types of dirty data as there are sources and use cases for that data. Your data will always have quirks that do everything from making processing less accurate to making it impossible to even load the data. As a result, I canât provide an exhaustive list of the problems you might encounter and how to deal with them, but I can give you some general hints.
Beware of whitespace and null characters. The problem with whitespace characters is that you canât see them, but that doesnât mean that they canât cause troubles. Extra whitespace at the beginning and end of data lines, extra whitespace around individual fields, and tabs instead of spaces (or vice versa) can all make your data loading and processing more troublesome, and these problems arenât always easily apparent. Similarly, text files with null characters (ASCII 0) may seem okay on inspection but break on loading and processing.
Beware punctuation. Punctuation can also be a problem. Extra commas or periods can mess up CSV files and the processing of numeric fields, and unescaped or unmatched quote characters can also confuse things.
Break down and debug the steps. Itâs easier to debug a problem if each step is separate, which means putting each operation on a separate line, being more verbose, and using more variables. But the work is worth it. For one thing, it makes any exceptions that are raised easier to understand, and it also makes debugging easier, whether with print statements, logging, or the Python debugger. It may also be helpful to save the data after each step and to cut the file size to just a few lines that cause the error.
21.5. Writing data files
The last part of the ETL process may involve saving the transformed data to a database (which I discuss in chapter 22), but often it involves writing the data to files. These files may be used as input for other applications and analysis, either by people or by other applications. Usually, you have a particular file specification listing what fields of data should be included, what they should be named, what format and constraints there should be for each, and so on
21.5.1. CSV and other delimited files
Probably the easiest thing of all is to write your data to CSV files. Because youâve already loaded, parsed, cleaned, and transformed the data, youâre unlikely to hit any unresolved issues with the data itself. And again, using the csv module from the Python standard library makes your work much easier.
Writing delimited files with the csv module is pretty much the reverse of the read process. Again, you need to specify the delimiter that you want to use, and again, the csv module takes care of any situations in which your delimiting character is included in a field:
copy
This code results in the following file:
copy
Just as when reading from a CSV file, itâs possible to write dictionaries instead of lists if you use a DictWriter. If you do use a DictWriter, be aware of a couple of points: You must specify the fields names in a list when you create the writer, and you can use the DictWriterâs writeheader method to write the header at the top of the file. So assume that you have the same data as previously, but in dictionary format:
copy
You can use a DictWriter object from the csv module to write each row, a dictionary, to the correct fields in the CSV file:
copy
21.5.2. Writing Excel files
Writing spreadsheet files is unsurprisingly similar to reading them. You need to create a workbook, or spreadsheet file; then you need to create a sheet or sheets; and finally, you write the data in the appropriate cells. You could create a new spreadsheet from your CSV data file like this:
copy
Itâs also possible to add formatting to cells as you write them to the spreadsheet file. For more on how to add formatting, please refer to the xlswriter documentation.
21.5.3. Packaging data files
If you have several related data files, or if your files are large, it may make sense to package them in a compressed archive. Although various archive formats are in use, the zip file remains popular and almost universally accessible to users on almost every platform. For hints on how to create zip-file packages of your data files, please refer to chapter 20.
LAB 21: WEATHER OBSERVATIONS
The file of weather observations provided here is by month and then by county for the state of Illinois from 1979 to 2011. Write the code to process this file to extract the data for Chicago (Cook County) into a single CSV or spreadsheet file. This process includes replacing the 'Missing' strings with empty strings and translating the percentage to a decimal. You may also consider what fields are repetitive (and therefore can be omitted or stored elsewhere). The proof that youâve got it right occurs when you load the file into a spreadsheet. You can download a solution with the bookâs source code.
Summary
ETL (extract-transform-load) is the process of getting data from one format, making sure that itâs consistent, and then putting it in a format you can use. ETL is the basic step in most data processing.
Encoding can be problematic with text files, but Python lets you deal with some encoding problems when you load files.
Delimited or CSV files are common, and the best way to handle them is with the csv module.
Spreadsheet files can be more complex than CSV files but can be handled much the same way.
Currency symbols, punctuation, and null characters are among the most common data cleaning issues; be on the watch for them.
Presorting your data file can make other processing steps faster.
Last updated