In this case study, you walk through using Python to fetch some data, clean it, and then graph it. This project may be a short one, but it combines several features of the language I’ve discussed, and it gives you a chance to a see a project worked through from beginning to end. At almost every step, I briefly call out alternatives and enhancements that you can make.
Global temperature change is the topic of much discussion, but those discussions are based on a global scale. Suppose that you want to know what the temperatures have been doing near where you are. One way of finding out is to get historical data for your location, process that data, and plot it to see exactly what’s been happening.
GETTING THE CASE STUDY CODE
The following case study was done by using a Jupyter notebook, as explained in chapter 24. If you’re using Jupyter, you can find the notebook I used (with this text and code) in the source code downloads as Case Study.ipynb. You can also execute the code in a standard Python shell, and a version that supports that shell is in the source code as Case Study.py.
Fortunately, several sources of historical weather data are freely available. I’m going to walk you through using data from the Global Historical Climatology Network, which has data from around the world. You may find other sources, which may have different data formats, but the steps and the processes I discuss here should be generally applicable to any data set.
Downloading the data
The first step will be to get the data. An archive of daily historical weather data at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ has a wide array of data. The first step is to figure out which files you want and exactly where they are; then you download them. When you have the data, you can move on to processing and ultimately displaying your results.
To download the files, which are accessible via HTTPS, you need the requests library. You can get requests with pip install requests at the command prompt. When you have requests, your first step is to fetch the readme.txt file, which can guide you as to the formats and location of the data files you want:
When you look at the readme file, you should see something like this:
print(readme)
README FILE FOR DAILY GLOBAL HISTORICAL CLIMATOLOGY NETWORK (GHCN-DAILY)
Version 3.22
----------------------------------------------------------------------------
How to cite:
Note that the GHCN-Daily dataset itself now has a DOI (Digital Object Identifier)
so it may be relevant to cite both the methods/overview journal article as well
as the specific version of the dataset used.
The journal article describing GHCN-Daily is:
Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An
overview
of the Global Historical Climatology Network-Daily Database. Journal of
Atmospheric
and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.
To acknowledge the specific version of the dataset used, please cite:
Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S.
Anthony, R. Ray,
R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology
Network -
Daily (GHCN-Daily), Version 3. [indicate subset used following decimal,
e.g. Version 3.12].
NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ [access
date].
copy
In particular, you’re interested in section II, which lists the contents:
II. CONTENTS OF ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily
all: Directory with ".dly" files for all of GHCN-Daily
gsn: Directory with ".dly" files for the GCOS Surface Network
(GSN)
hcn: Directory with ".dly" files for U.S. HCN
by_year: Directory with GHCN Daily files parsed into yearly
subsets with observation times where available. See the
/by_year/readme.txt and
/by_year/ghcn-daily-by_year-format.rtf
files for further information
grid: Directory with the GHCN-Daily gridded dataset known
as HadGHCND
papers: Directory with pdf versions of journal articles relevant
to the GHCN-Daily dataset
figures: Directory containing figures that summarize the inventory
of GHCN-Daily station records
ghcnd-all.tar.gz: TAR file of the GZIP-compressed files in the "all"
directory
ghcnd-gsn.tar.gz: TAR file of the GZIP-compressed "gsn" directory
ghcnd-hcn.tar.gz: TAR file of the GZIP-compressed "hcn" directory
ghcnd-countries.txt: List of country codes (FIPS) and names
ghcnd-inventory.txt: File listing the periods of record for each station and
element
ghcnd-stations.txt: List of stations and their metadata (e.g., coordinates)
ghcnd-states.txt: List of U.S. state and Canadian Province codes
used in ghcnd-stations.txt
ghcnd-version.txt: File that specifies the current version of GHCN Daily
readme.txt: This file
status.txt: Notes on the current status of GHCN-Daily
copy
As you look at the files available, you see that ghcnd-inventory.txt has a listing of the recording periods for each station, which will help you find a good data set; and ghcnd-stations.txt lists the stations, which should help you find the station closest to your location, so you’ll grab those two files first:
II. CONTENTS OF ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily
all: Directory with ".dly" files for all of GHCN-Daily
gsn: Directory with ".dly" files for the GCOS Surface
Network
(GSN)
hcn: Directory with ".dly" files for U.S. HCN
by_year: Directory with GHCN Daily files parsed into yearly
subsets with observation times where available. See
the
/by_year/readme.txt and
/by_year/ghcn-daily-by_year-format.rtf
files for further information
grid: Directory with the GHCN-Daily gridded dataset known
as HadGHCND
papers: Directory with pdf versions of journal articles relevant
to the GHCN-Daily dataset
figures: Directory containing figures that summarize the inventory
of GHCN-Daily station records
ghcnd-all.tar.gz: TAR file of the GZIP-compressed files in the "all"
directory
ghcnd-gsn.tar.gz: TAR file of the GZIP-compressed "gsn" directory
ghcnd-hcn.tar.gz: TAR file of the GZIP-compressed "hcn" directory
ghcnd-countries.txt: List of country codes (FIPS) and names
ghcnd-inventory.txt: File listing the periods of record for each station and
element
ghcnd-stations.txt: List of stations and their metadata (e.g., coordinates)
ghcnd-states.txt: List of U.S. state and Canadian Province codes
used in ghcnd-stations.txt
ghcnd-version.txt: File that specifies the current version of GHCN Daily
readme.txt: This file
status.txt: Notes on the current status of GHCN-Daily
# get inventory and stations files
r = requests.get('https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-
inventory.txt')
inventory_txt = r.text
r = requests.get('https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-
stations.txt')
stations_txt = r.text
copy
When you have those files, you can save them to your local disk so that you won’t need to download them again if you need to go back to the original data:
# save both the inventory and stations files to disk, in case we need themwithopen("inventory.txt", "w")as inventory_file: inventory_file.write(inventory_txt)withopen("stations.txt", "w")as stations_file: stations_file.write(stations_txt)
copy
Start by looking at the inventory file. Here’s what the first 137 characters show you:
print(inventory_txt[:137])ACW00011604 17.1167-61.7833 TMAX 19491949ACW00011604 17.1167-61.7833 TMIN 19491949ACW00011604 17.1167-61.7833 PRCP 19491949If we look at section VII of the readme.txt file we can see that the format of the inventory file is:VII. FORMAT OF "ghcnd-inventory.txt"------------------------------Variable Columns Type------------------------------ID 1-11 CharacterLATITUDE 13-20 RealLONGITUDE 22-30 RealELEMENT 32-35 CharacterFIRSTYEAR 37-40 IntegerLASTYEAR 42-45 Integer------------------------------
copy
These variables have the following definitions:
ID is the station identification code. Please see "ghcnd-stations.txt"for a complete list of stations and their metadata.LATITUDE is the latitude of the station (in decimal degrees).LONGITUDE is the longitude of the station (in decimal degrees).ELEMENT is the element type. See section III for a definition of elements.FIRSTYEAR is the first year of unflagged data for the given element.LASTYEAR is the last year of unflagged data for the given element.
copy
From this description, you can tell that the inventory list has most of the information you need to find the station you want to look at. You can use the latitude and longitude to find the stations closest to you; then you can use the FIRSTYEAR and LASTYEAR fields to find a station with records covering a long span of time.
The only question remaining is what the ELEMENT field is; for that, the file suggests that you look at section III. In section III (which I look at in more detail later), you find the following description of the main elements:
ELEMENT is the element type. There are five core elements as well as a number of addition elements. The five core elements are: PRCP =Precipitation (tenths of mm) SNOW =Snowfall (mm) SNWD = Snow depth (mm) TMAX = Maximum temperature (tenths of degrees C) TMIN = Minimum temperature (tenths of degrees C)
For purposes of this example, you’re interested in the TMAX and TMIN elements, which are maximum and minimum temperatures in tenths of degrees Celsius.
Parsing the inventory data
The readme.txt file tells you what you’ve got in the inventory file so that you can parse the data into a more usable format. You could just store the parsed inventory data as a list of lists or list of tuples, but it takes only a little more effort to use namedtuple from the collections library to create a custom class with the attributes named:
# parse to named tuples# use namedtuple to create a custom Inventory classfrom collections import namedtupleInventory =namedtuple("Inventory", ['station', 'latitude', 'longitude','element', 'start', 'end'])
Using the Inventory class you created is very straightforward; you simply create each instance from the appropriate values, which in this case are a parsed row of inventory data.
The parsing involves two steps. First, you need to pick out slices of a line according to the field sizes specified. As you look at the field descriptions in the readme file, it’s also clear that there’s an extra space between files, which you need to consider in coming up with any approach to parsing. In this case, because you’re specifying each slice, the extra spaces are ignored. In addition, because the sizes of the STATION and ELEMENT fields exactly correspond to the values stored in them, you shouldn’t need to worry about stripping excess spaces from them.
The second thing that would be nice to do is convert the latitude and longitude values to floats and the start and end years to ints. You could do this at a later stage of data cleaning, and in fact, if the data is inconsistent and doesn’t have values that convert correctly in every row, you might want to wait. But in this case, the data lets you handle these conversions in the parsing step, so do it now:
# parse inventory lines and convert some values to floats and intsinventory = [Inventory(x[0:11], float(x[12:20]), float(x[21:30]), x[31:35],int(x[36:40]), int(x[41:45]))for x in inventory_txt.split("\n")if x.startswith("US")]for line in inventory[:5]:print(line)Inventory(station='US009052008', latitude=43.7333, longitude=-96.6333, element='TMAX', start=2008, end=2016)Inventory(station='US009052008', latitude=43.7333, longitude=-96.6333, element='TMIN', start=2008, end=2016)Inventory(station='US009052008', latitude=43.7333, longitude=-96.6333, element='PRCP', start=2008, end=2016)Inventory(station='US009052008', latitude=43.7333, longitude=-96.6333, element='SNWD', start=2009, end=2016)Inventory(station='US10RMHS145', latitude=40.5268, longitude=-105.1113, element='PRCP', start=2004, end=2004)
Selecting a station based on latitude and longitude
Now that the inventory is loaded, you can use the latitude and longitude to find the stations closest to your location and then pick the one with the longest run of temperatures based on start and end years. At even the first line of the data, you can see two things to worry about:
There are various element types, but you’re concerned only with TMIN and TMAX, for minimum and maximum temperature.
None of the first inventory entries you see covers more than a few years. If you’re going to be looking for an historical perspective, you want to find a much longer run of temperature data.
To pick out what you need quickly, we can use a list comprehension to make a sublist of only the station inventory items in which the element is TMIN or TMAX. The other thing that you care about is getting a station with a long run of data, so while you’re creating this sublist, also make sure that the start year is before 1920 and that the end year is at least 2015. That way, you’re looking only at stations with at least 95 years’ worth of data:
inventory_temps = [x for x in inventory if x.element in ['TMIN','TMAX']and x.end >=2015and x.start <1920]inventory_temps[:5][Inventory(station='USC00010252', latitude=31.3072, longitude=-86.5225, element='TMAX', start=1912, end=2017),Inventory(station='USC00010252', latitude=31.3072, longitude=-86.5225, element='TMIN', start=1912, end=2017),Inventory(station='USC00010583', latitude=30.8839, longitude=-87.7853, element='TMAX', start=1915, end=2017),Inventory(station='USC00010583', latitude=30.8839, longitude=-87.7853, element='TMIN', start=1915, end=2017),Inventory(station='USC00012758', latitude=31.445, longitude=-86.9533, element='TMAX', start=1890, end=2017)]
Looking at the first five records in your new list, you see that you’re in better shape. Now you have only temperature records, and the start and end years show that you have longer runs.
That leaves the problem of selecting the station nearest your location. To do that, compare the latitude and longitude of the station inventories with those of your location. There are various ways to get the latitude and longitude of any place, but probably the easiest way is to use an online mapping application or online search. (When I do that for the Chicago Loop, I get a latitude of 41.882 and a longitude of -87.629.)
Because you’re interested in the stations closest to your location, that interest implies sorting based on how close the latitude and longitude of the stations are to those of your location. Sorting a list is easy enough, and sorting by latitude and longitude isn’t too hard. But how do you sort by the distance from your latitude and longitude?
The answer is to define a key function for your sort that gets the difference between your latitude and the station’s latitude, and the difference between your longitude and the station’s longitude, and combines them into one number. The only other thing to remember is that you’ll want to add the absolute value of the differences before you combine them to avoid having a high negative difference combined with an equally high positive difference that would fool your sort:
Selecting a station and getting the station metadata
As you look at the top 20 entries in your newly sorted list, it seems that the first station, USC00110338, is a good fit. It’s got both TMIN and TMAX and one of the longer series, starting in 1893 and running up through 2017, for more than 120 years’ worth of data. So save that station into your station variable and quickly parse the station data you’ve already grabbed to pick up a little more information about the station.
Back in the readme file, you find the following information about the station data:
IV. FORMAT OF "ghcnd-stations.txt"------------------------------Variable Columns Type------------------------------ID 1-11 CharacterLATITUDE 13-20 RealLONGITUDE 22-30 RealELEVATION 32-37 RealSTATE 39-40 CharacterNAME 42-71 CharacterGSN FLAG 73-75 CharacterHCN/CRN FLAG 77-79 CharacterWMO ID 81-85 Character------------------------------These variables have the following definitions:ID is the station identification code. Note that the first two characters denote the FIPS country code, the third characteris a network code that identifies the station numbering system used,and the remaining eight characters contain the actual station ID. See "ghcnd-countries.txt"for a complete list of country codes. See "ghcnd-states.txt"for a list of state/province/territory codes. The network code has the following five values:0 = unspecified (station identified by up to eight alphanumeric characters)1 = Community Collaborative Rain, Hail,andSnow (CoCoRaHS) based identification number. To ensure consistency withwith GHCN Daily,all numbers in the original CoCoRaHS IDs have been left-filled to make them all four digits long. In addition, the characters "-"and"_" have been removed to ensure that the IDs do not exceed 11 characters when preceded by "US1". For example, the CoCoRaHS ID"AZ-MR-156" becomes "US1AZMR0156"in GHCN-Daily C = U.S. Cooperative Network identification number (last six characters of the GHCN-Daily ID) E = Identification number used in the ECA&D non-blended dataset M = World Meteorological Organization ID (last five characters of the GHCN-Daily ID) N = Identification number used in data supplied by a National Meteorological or Hydrological Center R = U.S. Interagency Remote Automatic Weather Station (RAWS) identifier S = U.S. Natural Resources Conservation Service SNOwpackTELemtry (SNOTEL) station identifier W = WBAN identification number (last five characters of the GHCN-Daily ID)LATITUDE is latitude of the station (in decimal degrees).LONGITUDE is the longitude of the station (in decimal degrees).ELEVATION is the elevation of the station (in meters, missing =-999.9).STATE is the U.S. postal code for the state (for U.S. stations only).NAME is the name of the station.GSN FLAG is a flag that indicates whether the station is part of the GCOS Surface Network (GSN). The flag is assigned by cross-referencing the number in the WMOID field with the official list of GSN stations. There are two possible values: Blank = non-GSN station or WMO Station number not available GSN = GSN stationHCN/is a flag that indicates whether the station is part of the U.S.CRN FLAG Historical Climatology Network (HCN). There are three possible values: Blank = Not a member of the U.S. Historical Climatologyor U.S. Climate Reference Networks HCN = U.S. Historical Climatology Network station CRN = U.S. Climate Reference Network or U.S. Regional Climate Network StationWMO ID is the World Meteorological Organization (WMO) number for the station. If the station has no WMO number (or one has not yet been matched to this station), then the field is blank.
Although you might care more about the metadata fields for more serious research, right now you want to match the start and end year from the inventory records to the rest of the station metadata in the station file.
You have several ways to sift through the stations file to find the one station that matches the station ID you selected. You could create a for loop to go through each line and break out when you find it; you could split the data into lines and then sort and use a binary search, and so on. Depending on the nature and amount of data you have, one approach or another might be appropriate. In this case, because you have the data loaded already, and it’s not too large, use a list comprehension to return a list with its single element being the station you’re looking for:
At this point, you’ve identified that you want weather data from the station at Aurora, Illinois, which is the nearest station to downtown Chicago with more than a century’s worth of temperature data.
Fetching and parsing the actual weather data
With the station identified, the next step is fetching the actual weather data for that station and parsing it. The process is quite similar to what you did in the preceding section.
Fetching the data
First, fetch the data file and save it, in case you need to go back to it:
# fetch daily records for selected stationr = requests.get('https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/all/ {}.dly'.format(station.station_id))weather = r.text# save into a text file, so we won't need to fetch againwithopen('weather_{}.txt'.format(station), "w") as weather_file: weather_file.write(weather)# read from saved daily file if needed (only used if we want to start the process over without downloadng the file)withopen('weather_{}.txt'.format(station))as weather_file: weather = weather_file.read()print(weather[:540])USC00110338189301TMAX -116-446-1396-836-1006-836-726-836-336-1786-1506-1286-1726-2006-1896-1506-1066-616-946-336-336-336-33666-336-786-336446-89 I6 -22666USC00110338189301TMIN -506-1396-2506-1446-1786-2286-1446-2226-1786-2506-2006-2066-2676-2726-2946-2946-3116-2006-2336-1786-1566-896-2006-1946-1946-1786-2006-33 I6 -1566-1396-1676
Parsing the weather data
Again, now that you have the data, you can see it’s quite a bit more complex than the station and inventory data. Clearly, it’s time to head back to the readme.txt file and section III, which is the description of a weather data file. You have a lot of options, so filter them down to the ones that concern you, and leave out the other element types as well as the whole system of flags specifying the source, quality, and type of the values:
III. FORMAT OF DATA FILES (".dly" FILES)Each ".dly" file contains data for one station. The name of the filecorresponds to a station's identification code. For example,"USC00026481.dly"contains the data for the station with the identification code USC00026481).Each record in a file contains one month of daily data. The variables on eachline include the following:------------------------------Variable Columns Type------------------------------ID 1-11 CharacterYEAR 12-15 IntegerMONTH 16-17 IntegerELEMENT 18-21 CharacterVALUE1 22-26 IntegerMFLAG1 27-27 CharacterQFLAG1 28-28 CharacterSFLAG1 29-29 CharacterVALUE2 30-34 IntegerMFLAG2 35-35 CharacterQFLAG2 36-36 CharacterSFLAG2 37-37 Character.........VALUE31 262-266 IntegerMFLAG31 267-267 CharacterQFLAG31 268-268 CharacterSFLAG31 269-269 Character------------------------------These variables have the following definitions:ID is the station identification code. Please see "ghcnd-stations.txt"for a complete list of stations and their metadata.YEAR is the year of the record.MONTH is the month of the record.ELEMENT is the element type. There are five core elements as well as a number of addition elements. The five core elements are: PRCP = Precipitation (tenths of mm) SNOW = Snowfall (mm) SNWD = Snow depth (mm) TMAX = Maximum temperature (tenths of degrees C) TMIN = Minimum temperature (tenths of degrees C)...VALUE1 is the value on the first day of the month (missing =-9999).MFLAG1 is the measurement flag for the first day of the month.QFLAG1 is the quality flag for the first day of the month.SFLAG1 is the source flag for the first day of the month.VALUE2 is the value on the second day of the monthMFLAG2 is the measurement flag for the second day of the month.QFLAG2 is the quality flag for the second day of the month.SFLAG2 is the source flag for the second day of the month.... and so on through the 31st day of the month. Note: If the month has lessthan 31 days, then the remaining variables are set to missing (e.g., for April,VALUE31 =-9999, MFLAG31 = blank, QFLAG31 = blank, SFLAG31 = blank).
The key points you care about right now are that the station ID is the 11 characters of a row, the year is the next 4, the month the next 2, and the element the next 4 after that. After that, there are 31 slots for daily data, with each slot consisting of 5 characters for the temperature, expressed in tenths of a degree Celsius, and 3 characters of flags. As I mentioned earlier, you can disregard the flags for this exercise. You can also see that missing values for the temperatures are coded with -9999 if that day isn’t in the month, so for a typical February, for example, the 29th, 30th, and 31st values would be -9999.
As you process your data in this exercise, you’re looking to get overall trends, so you don’t need to worry much about individual days. Instead, find average values for the month. You can save the maximum, minimum, and mean values for the entire month and use those.
This means that to process each line of weather data, you need to:
Split the line into its separate fields, and ignore or discard the flags for each daily value.
Remove the values with -9999, and convert the year and month into ints and the temperature values into floats, keeping in mind that the temperature readings are in tenths of degrees centigrade.
Calculate the average value, and pick out the high and low values.
To accomplish all these tasks, you can take a couple of approaches. You could do several passes over the data, splitting into fields, discarding the placeholders, converting strings to numbers, and finally calculating the summary values. Or you can write a function that performs all of these operations on a single line and do everything in one pass. Both approaches can be valid. In this case, take the latter approach and create a parse_line function to perform all of your data transformations:
defparse_line(line):""" parses line of weather data removes values of -9999 (missing value) """# return None if line is emptyifnot line:returnNone# split out first 4 fields and string containing temperature values record, temperature_string = (line[:11],int(line[11:15]),int(line[15:17]), line[17:21]), line[21:]# raise exception if the temperature string is too shortiflen(temperature_string)<248:raiseValueError("String not long enough - {} {}".format(temperature_string, str(line)))# use a list comprehension on the temperature_string to extract and convert the values = [float(temperature_string[i:i +5])/10for i inrange(0, 248, 8)ifnot temperature_string[i:i +5].startswith("-9999")]# get the number of values, the max and min, and calculate average count =len(values) tmax =round(max(values), 1) tmin =round(min(values), 1) mean =round(sum(values)/count, 1)# add the temperature summary values to the record fields extracted earlier andreturnreturn record + (tmax, tmin, mean, count)
If you test this function with the first line of your raw weather data, you get the following result:
So it looks like you have a function that will work to parse your data. If that function works, you can parse the weather data and either store it or continue with your processing:
# process all weather data# list comprehension, will not parse empty linesweather_data = [parse_line(x)for x in weather.split("\n")if x]len(weather_data)weather_data[:10][('USC00110338',1893,1,'TMAX',4.4,-20.0,-7.8,31), ('USC00110338',1893,1,'TMIN',-3.3,-31.1,-19.2,31), ('USC00110338',1893,1,'PRCP',8.9,0.0,1.1,31), ('USC00110338',1893,1,'SNOW',10.2,0.0,1.0,31), ('USC00110338',1893,1,'WT16',0.1,0.1,0.1,2), ('USC00110338',1893,1,'WT18',0.1,0.1,0.1,11), ('USC00110338',1893,2,'TMAX',5.6,-17.2,-0.9,27), ('USC00110338',1893,2,'TMIN',0.6,-26.1,-11.7,27), ('USC00110338',1893,2,'PRCP',15.0,0.0,2.0,28), ('USC00110338',1893,2,'SNOW',12.7,0.0,0.6,28)]
Now you have all the weather records, not just the temperature records, parsed and in your list.
Saving the weather data in a database (optional)
At this point, you can save all of the weather records (and the station records and inventory records as well, if you want) in a database. Doing so lets you come back in later sessions and use the same data without having to go to the hassle of fetching and parsing the data again.
As an example, the following code is how you could save the weather data in a sqlite3 database:
import sqlite3conn = sqlite3.connect("weather_data.db")cursor = conn.cursor()# create weather tablecreate_weather ="""CREATE TABLE "weather" ( "id" text NOT NULL, "year" integer NOT NULL, "month" integer NOT NULL, "element" text NOT NULL, "max" real, "min" real, "mean" real, "count" integer)"""cursor.execute(create_weather)conn.commit()# store parsed weather data in databasefor record in weather_data: cursor.execute("""insert into weather (id, year, month, element, max, min, mean, count) values (?,?,?,?,?,?,?,?) """, record)conn.commit()
When you have the data stored, you could retrieve it from the database with code like the following, which fetches only the TMAX records:
cursor.execute("""select * from weather where element='TMAX' order by year, month""")tmax_data = cursor.fetchall()tmax_data[:5][('USC00110338',1893,1,'TMAX',4.4,-20.0,-7.8,31), ('USC00110338',1893,2,'TMAX',5.6,-17.2,-0.9,27), ('USC00110338',1893,3,'TMAX',20.6,-7.2,5.6,30), ('USC00110338',1893,4,'TMAX',28.9,3.3,13.5,30), ('USC00110338',1893,5,'TMAX',30.6,7.2,19.2,31)]
Selecting and graphing data
Because you’re concerned only with temperature, you need to select just the temperature records. You can do that quickly enough by using a couple of list comprehensions to pick out a list for TMAX and one for TMIN. Or you could use the features of pandas, which you’ll be using for graphing the date, to filter out the records you don’t want. Because you’re more concerned with pure Python than with pandas, take the first approach:
tmax_data = [x for x in weather_data if x[3]=='TMAX']tmin_data = [x for x in weather_data if x[3]=='TMIN']tmin_data[:5][('USC00110338',1893,1,'TMIN',-3.3,-31.1,-19.2,31), ('USC00110338',1893,2,'TMIN',0.6,-26.1,-11.7,27), ('USC00110338',1893,3,'TMIN',3.3,-13.3,-4.6,31), ('USC00110338',1893,4,'TMIN',12.2,-5.6,2.2,30), ('USC00110338',1893,5,'TMIN',14.4,-0.6,5.7,31)]
Using pandas to graph your data
At this point, you have your data cleaned and ready to graph. To make the graphing easier, you can use pandas and matplotlib, as described in chapter 24. To do this, you need to have a Jupyter server running and have pandas and matplotlib installed. To make sure that they’re installed from within your Jupyter notebook, use the following command:
# Install pandas and matplotlib using pip! pip3.6 install pandas matplotlibimport pandas as pd%matplotlib inline
When pandas and matplotlib are installed, you can load pandas and create data frames for your TMAX and TMIN data:
You could plot the monthly values, but 123 years times 12 months of data is almost 1,500 data points, and the cycle of seasons also makes picking out patterns difficult.
Instead, it probably makes more sense to average the high, low, and mean monthly values into yearly values and plot those values. You could do this in Python, but because you already have your data loaded in a pandas data frame, you can use that to group by year and get the mean values:
# select Year, Min, Max, Mean columns, group by year, average and line plottmin_df[['Year','Min','Mean','Max']].groupby('Year').mean().plot( kind='line', figsize=(16, 4))
This result has a fair amount of variation, but it does seem to indicate that the minimum temperature has been on the rise for the past 20 years.
Note that if you wanted to get the same graph without using Jupyter notebook and matplotlib, you could use still use pandas, but you’d write to a CSV or Microsoft Excel file, using the data frame’s to_csv or to_excel method. Then you could load the resulting file into a spreadsheet and graph from there.