Case study

In this case study, you walk through using Python to fetch some data, clean it, and then graph it. This project may be a short one, but it combines several features of the language I’ve discussed, and it gives you a chance to a see a project worked through from beginning to end. At almost every step, I briefly call out alternatives and enhancements that you can make.
Global temperature change is the topic of much discussion, but those discussions are based on a global scale. Suppose that you want to know what the temperatures have been doing near where you are. One way of finding out is to get historical data for your location, process that data, and plot it to see exactly what’s been happening.
GETTING THE CASE STUDY CODE
The following case study was done by using a Jupyter notebook, as explained in chapter 24. If you’re using Jupyter, you can find the notebook I used (with this text and code) in the source code downloads as Case Study.ipynb. You can also execute the code in a standard Python shell, and a version that supports that shell is in the source code as Case Study.py.
Fortunately, several sources of historical weather data are freely available. I’m going to walk you through using data from the Global Historical Climatology Network, which has data from around the world. You may find other sources, which may have different data formats, but the steps and the processes I discuss here should be generally applicable to any data set.
Downloading the data
The first step will be to get the data. An archive of daily historical weather data at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ has a wide array of data. The first step is to figure out which files you want and exactly where they are; then you download them. When you have the data, you can move on to processing and ultimately displaying your results.
To download the files, which are accessible via HTTPS, you need the requests library. You can get requests with pip install requests at the command prompt. When you have requests, your first step is to fetch the readme.txt file, which can guide you as to the formats and location of the data files you want:
copy
When you look at the readme file, you should see something like this:
copy
In particular, you’re interested in section II, which lists the contents:
copy
As you look at the files available, you see that ghcnd-inventory.txt has a listing of the recording periods for each station, which will help you find a good data set; and ghcnd-stations.txt lists the stations, which should help you find the station closest to your location, so you’ll grab those two files first:
copy
When you have those files, you can save them to your local disk so that you won’t need to download them again if you need to go back to the original data:
copy
Start by looking at the inventory file. Here’s what the first 137 characters show you:
copy
These variables have the following definitions:
copy
From this description, you can tell that the inventory list has most of the information you need to find the station you want to look at. You can use the latitude and longitude to find the stations closest to you; then you can use the FIRSTYEAR and LASTYEAR fields to find a station with records covering a long span of time.
The only question remaining is what the ELEMENT field is; for that, the file suggests that you look at section III. In section III (which I look at in more detail later), you find the following description of the main elements:
For purposes of this example, you’re interested in the TMAX and TMIN elements, which are maximum and minimum temperatures in tenths of degrees Celsius.
Parsing the inventory data
The readme.txt file tells you what you’ve got in the inventory file so that you can parse the data into a more usable format. You could just store the parsed inventory data as a list of lists or list of tuples, but it takes only a little more effort to use namedtuple from the collections library to create a custom class with the attributes named:
Using the Inventory class you created is very straightforward; you simply create each instance from the appropriate values, which in this case are a parsed row of inventory data.
The parsing involves two steps. First, you need to pick out slices of a line according to the field sizes specified. As you look at the field descriptions in the readme file, it’s also clear that there’s an extra space between files, which you need to consider in coming up with any approach to parsing. In this case, because you’re specifying each slice, the extra spaces are ignored. In addition, because the sizes of the STATION and ELEMENT fields exactly correspond to the values stored in them, you shouldn’t need to worry about stripping excess spaces from them.
The second thing that would be nice to do is convert the latitude and longitude values to floats and the start and end years to ints. You could do this at a later stage of data cleaning, and in fact, if the data is inconsistent and doesn’t have values that convert correctly in every row, you might want to wait. But in this case, the data lets you handle these conversions in the parsing step, so do it now:
Selecting a station based on latitude and longitude
Now that the inventory is loaded, you can use the latitude and longitude to find the stations closest to your location and then pick the one with the longest run of temperatures based on start and end years. At even the first line of the data, you can see two things to worry about:
There are various element types, but you’re concerned only with TMIN and TMAX, for minimum and maximum temperature.
None of the first inventory entries you see covers more than a few years. If you’re going to be looking for an historical perspective, you want to find a much longer run of temperature data.
To pick out what you need quickly, we can use a list comprehension to make a sublist of only the station inventory items in which the element is TMIN or TMAX. The other thing that you care about is getting a station with a long run of data, so while you’re creating this sublist, also make sure that the start year is before 1920 and that the end year is at least 2015. That way, you’re looking only at stations with at least 95 years’ worth of data:
Looking at the first five records in your new list, you see that you’re in better shape. Now you have only temperature records, and the start and end years show that you have longer runs.
That leaves the problem of selecting the station nearest your location. To do that, compare the latitude and longitude of the station inventories with those of your location. There are various ways to get the latitude and longitude of any place, but probably the easiest way is to use an online mapping application or online search. (When I do that for the Chicago Loop, I get a latitude of 41.882 and a longitude of -87.629.)
Because you’re interested in the stations closest to your location, that interest implies sorting based on how close the latitude and longitude of the stations are to those of your location. Sorting a list is easy enough, and sorting by latitude and longitude isn’t too hard. But how do you sort by the distance from your latitude and longitude?
The answer is to define a key function for your sort that gets the difference between your latitude and the station’s latitude, and the difference between your longitude and the station’s longitude, and combines them into one number. The only other thing to remember is that you’ll want to add the absolute value of the differences before you combine them to avoid having a high negative difference combined with an equally high positive difference that would fool your sort:
Selecting a station and getting the station metadata
As you look at the top 20 entries in your newly sorted list, it seems that the first station, USC00110338, is a good fit. It’s got both TMIN and TMAX and one of the longer series, starting in 1893 and running up through 2017, for more than 120 years’ worth of data. So save that station into your station variable and quickly parse the station data you’ve already grabbed to pick up a little more information about the station.
Back in the readme file, you find the following information about the station data:
Although you might care more about the metadata fields for more serious research, right now you want to match the start and end year from the inventory records to the rest of the station metadata in the station file.
You have several ways to sift through the stations file to find the one station that matches the station ID you selected. You could create a for loop to go through each line and break out when you find it; you could split the data into lines and then sort and use a binary search, and so on. Depending on the nature and amount of data you have, one approach or another might be appropriate. In this case, because you have the data loaded already, and it’s not too large, use a list comprehension to return a list with its single element being the station you’re looking for:
At this point, you’ve identified that you want weather data from the station at Aurora, Illinois, which is the nearest station to downtown Chicago with more than a century’s worth of temperature data.
Fetching and parsing the actual weather data
With the station identified, the next step is fetching the actual weather data for that station and parsing it. The process is quite similar to what you did in the preceding section.
Fetching the data
First, fetch the data file and save it, in case you need to go back to it:
Parsing the weather data
Again, now that you have the data, you can see it’s quite a bit more complex than the station and inventory data. Clearly, it’s time to head back to the readme.txt file and section III, which is the description of a weather data file. You have a lot of options, so filter them down to the ones that concern you, and leave out the other element types as well as the whole system of flags specifying the source, quality, and type of the values:
The key points you care about right now are that the station ID is the 11 characters of a row, the year is the next 4, the month the next 2, and the element the next 4 after that. After that, there are 31 slots for daily data, with each slot consisting of 5 characters for the temperature, expressed in tenths of a degree Celsius, and 3 characters of flags. As I mentioned earlier, you can disregard the flags for this exercise. You can also see that missing values for the temperatures are coded with -9999 if that day isn’t in the month, so for a typical February, for example, the 29th, 30th, and 31st values would be -9999.
As you process your data in this exercise, you’re looking to get overall trends, so you don’t need to worry much about individual days. Instead, find average values for the month. You can save the maximum, minimum, and mean values for the entire month and use those.
This means that to process each line of weather data, you need to:
Split the line into its separate fields, and ignore or discard the flags for each daily value.
Remove the values with -9999, and convert the year and month into ints and the temperature values into floats, keeping in mind that the temperature readings are in tenths of degrees centigrade.
Calculate the average value, and pick out the high and low values.
To accomplish all these tasks, you can take a couple of approaches. You could do several passes over the data, splitting into fields, discarding the placeholders, converting strings to numbers, and finally calculating the summary values. Or you can write a function that performs all of these operations on a single line and do everything in one pass. Both approaches can be valid. In this case, take the latter approach and create a parse_line function to perform all of your data transformations:
If you test this function with the first line of your raw weather data, you get the following result:
So it looks like you have a function that will work to parse your data. If that function works, you can parse the weather data and either store it or continue with your processing:
Now you have all the weather records, not just the temperature records, parsed and in your list.
Saving the weather data in a database (optional)
At this point, you can save all of the weather records (and the station records and inventory records as well, if you want) in a database. Doing so lets you come back in later sessions and use the same data without having to go to the hassle of fetching and parsing the data again.
As an example, the following code is how you could save the weather data in a sqlite3 database:
When you have the data stored, you could retrieve it from the database with code like the following, which fetches only the TMAX records:
Selecting and graphing data
Because you’re concerned only with temperature, you need to select just the temperature records. You can do that quickly enough by using a couple of list comprehensions to pick out a list for TMAX and one for TMIN. Or you could use the features of pandas, which you’ll be using for graphing the date, to filter out the records you don’t want. Because you’re more concerned with pure Python than with pandas, take the first approach:
Using pandas to graph your data
At this point, you have your data cleaned and ready to graph. To make the graphing easier, you can use pandas and matplotlib, as described in chapter 24. To do this, you need to have a Jupyter server running and have pandas and matplotlib installed. To make sure that they’re installed from within your Jupyter notebook, use the following command:
When pandas and matplotlib are installed, you can load pandas and create data frames for your TMAX and TMIN data:
You could plot the monthly values, but 123 years times 12 months of data is almost 1,500 data points, and the cycle of seasons also makes picking out patterns difficult.
Instead, it probably makes more sense to average the high, low, and mean monthly values into yearly values and plot those values. You could do this in Python, but because you already have your data loaded in a pandas data frame, you can use that to group by year and get the mean values:
This result has a fair amount of variation, but it does seem to indicate that the minimum temperature has been on the rise for the past 20 years.
Note that if you wanted to get the same graph without using Jupyter notebook and matplotlib, you could use still use pandas, but you’d write to a CSV or Microsoft Excel file, using the data frame’s to_csv or to_excel method. Then you could load the resulting file into a spreadsheet and graph from there.
Last updated
Was this helpful?