Case study

In this case study, you walk through using Python to fetch some data, clean it, and then graph it. This project may be a short one, but it combines several features of the language I’ve discussed, and it gives you a chance to a see a project worked through from beginning to end. At almost every step, I briefly call out alternatives and enhancements that you can make.

Global temperature change is the topic of much discussion, but those discussions are based on a global scale. Suppose that you want to know what the temperatures have been doing near where you are. One way of finding out is to get historical data for your location, process that data, and plot it to see exactly what’s been happening.

GETTING THE CASE STUDY CODE

The following case study was done by using a Jupyter notebook, as explained in chapter 24. If you’re using Jupyter, you can find the notebook I used (with this text and code) in the source code downloads as Case Study.ipynb. You can also execute the code in a standard Python shell, and a version that supports that shell is in the source code as Case Study.py.

Fortunately, several sources of historical weather data are freely available. I’m going to walk you through using data from the Global Historical Climatology Network, which has data from around the world. You may find other sources, which may have different data formats, but the steps and the processes I discuss here should be generally applicable to any data set.

Downloading the data

The first step will be to get the data. An archive of daily historical weather data at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ has a wide array of data. The first step is to figure out which files you want and exactly where they are; then you download them. When you have the data, you can move on to processing and ultimately displaying your results.

To download the files, which are accessible via HTTPS, you need the requests library. You can get requests with pip install requests at the command prompt. When you have requests, your first step is to fetch the readme.txt file, which can guide you as to the formats and location of the data files you want:

# import requests

import requests
# get readme.txt file

r = requests.get('https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt')
readme = r.text.

copy

When you look at the readme file, you should see something like this:

print(readme)

README FILE FOR DAILY GLOBAL HISTORICAL CLIMATOLOGY NETWORK (GHCN-DAILY)
Version 3.22

----------------------------------------------------------------------------
How to cite:

Note that the GHCN-Daily dataset itself now has a DOI (Digital Object Identifier)
so it may be relevant to cite both the methods/overview journal article as well
as the specific version of the dataset used.

The journal article describing GHCN-Daily is:
Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012:  An
     overview
of the Global Historical Climatology Network-Daily Database.  Journal of
     Atmospheric
and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.

To acknowledge the specific version of the dataset used, please cite:
Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S.
     Anthony, R. Ray,
R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology
     Network -
Daily (GHCN-Daily), Version 3. [indicate subset used following decimal,
e.g. Version 3.12].
NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ [access
     date].

copy

In particular, you’re interested in section II, which lists the contents:

II. CONTENTS OF ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily

all:                Directory with ".dly" files for all of GHCN-Daily
gsn:                Directory with ".dly" files for the GCOS Surface Network
                    (GSN)
hcn:                  Directory with ".dly" files for U.S. HCN
by_year:              Directory with GHCN Daily files parsed into yearly
                      subsets with observation times where available. See the
              /by_year/readme.txt and
              /by_year/ghcn-daily-by_year-format.rtf
              files for further information
grid:                 Directory with the GHCN-Daily gridded dataset known
                      as HadGHCND
papers:           Directory with pdf versions of journal articles relevant
                      to the GHCN-Daily dataset
figures:          Directory containing figures that summarize the inventory
                      of GHCN-Daily station records

ghcnd-all.tar.gz:  TAR file of the GZIP-compressed files in the "all"
     directory
ghcnd-gsn.tar.gz:  TAR file of the GZIP-compressed "gsn" directory
ghcnd-hcn.tar.gz:  TAR file of the GZIP-compressed "hcn" directory

ghcnd-countries.txt:  List of country codes (FIPS) and names
ghcnd-inventory.txt:  File listing the periods of record for each station and
                      element
ghcnd-stations.txt:   List of stations and their metadata (e.g., coordinates)
ghcnd-states.txt:     List of U.S. state and Canadian Province codes
                      used in ghcnd-stations.txt
ghcnd-version.txt:    File that specifies the current version of GHCN Daily

readme.txt:           This file
status.txt:           Notes on the current status of GHCN-Daily

copy

As you look at the files available, you see that ghcnd-inventory.txt has a listing of the recording periods for each station, which will help you find a good data set; and ghcnd-stations.txt lists the stations, which should help you find the station closest to your location, so you’ll grab those two files first:

II. CONTENTS OF ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily

all:                  Directory with ".dly" files for all of GHCN-Daily
gsn:                  Directory with ".dly" files for the GCOS Surface
     Network
                     (GSN)
hcn:                  Directory with ".dly" files for U.S. HCN
by_year:              Directory with GHCN Daily files parsed into yearly
                      subsets with observation times where available.  See
     the
              /by_year/readme.txt and
              /by_year/ghcn-daily-by_year-format.rtf
              files for further information
grid:                 Directory with the GHCN-Daily gridded dataset known
                      as HadGHCND
papers:           Directory with pdf versions of journal articles relevant
                      to the GHCN-Daily dataset
figures:          Directory containing figures that summarize the inventory
                      of GHCN-Daily station records

ghcnd-all.tar.gz:  TAR file of the GZIP-compressed files in the "all"
                   directory
ghcnd-gsn.tar.gz:  TAR file of the GZIP-compressed "gsn" directory
ghcnd-hcn.tar.gz:  TAR file of the GZIP-compressed "hcn" directory

ghcnd-countries.txt:  List of country codes (FIPS) and names
ghcnd-inventory.txt:  File listing the periods of record for each station and
                      element
ghcnd-stations.txt:   List of stations and their metadata (e.g., coordinates)
ghcnd-states.txt:     List of U.S. state and Canadian Province codes
                      used in ghcnd-stations.txt
ghcnd-version.txt:    File that specifies the current version of GHCN Daily

readme.txt:           This file
status.txt:           Notes on the current status of GHCN-Daily

# get inventory and stations files

r = requests.get('https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-
     inventory.txt')
inventory_txt = r.text
r = requests.get('https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-
     stations.txt')
stations_txt = r.text

copy

When you have those files, you can save them to your local disk so that you won’t need to download them again if you need to go back to the original data:

# save both the inventory and stations files to disk, in case we need them

with open("inventory.txt", "w") as inventory_file:
    inventory_file.write(inventory_txt)

with open("stations.txt", "w") as stations_file:
    stations_file.write(stations_txt)

copy

Start by looking at the inventory file. Here’s what the first 137 characters show you:

print(inventory_txt[:137])

ACW00011604  17.1167  -61.7833 TMAX 1949 1949
ACW00011604  17.1167  -61.7833 TMIN 1949 1949
ACW00011604  17.1167  -61.7833 PRCP 1949 1949
If we look at section VII of the readme.txt file we can see that the format
     of the inventory file is:
VII. FORMAT OF "ghcnd-inventory.txt"

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
LATITUDE     13-20   Real
LONGITUDE    22-30   Real
ELEMENT      32-35   Character
FIRSTYEAR    37-40   Integer
LASTYEAR     42-45   Integer
------------------------------

copy

These variables have the following definitions:

ID         is the station identification code. Please see "ghcnd-stations.txt"
           for a complete list of stations and their metadata.

LATITUDE   is the latitude of the station (in decimal degrees).

LONGITUDE  is the longitude of the station (in decimal degrees).

ELEMENT    is the element type. See section III for a definition of elements.

FIRSTYEAR  is the first year of unflagged data for the given element.

LASTYEAR   is the last year of unflagged data for the given element.

copy

From this description, you can tell that the inventory list has most of the information you need to find the station you want to look at. You can use the latitude and longitude to find the stations closest to you; then you can use the FIRSTYEAR and LASTYEAR fields to find a station with records covering a long span of time.

The only question remaining is what the ELEMENT field is; for that, the file suggests that you look at section III. In section III (which I look at in more detail later), you find the following description of the main elements:

ELEMENT    is the element type. There are five core elements as well as a
           number of addition elements.

       The five core elements are:

           PRCP = Precipitation (tenths of mm)
           SNOW = Snowfall (mm)
           SNWD = Snow depth (mm)
           TMAX = Maximum temperature (tenths of degrees C)
           TMIN = Minimum temperature (tenths of degrees C)

For purposes of this example, you’re interested in the TMAX and TMIN elements, which are maximum and minimum temperatures in tenths of degrees Celsius.

Parsing the inventory data

The readme.txt file tells you what you’ve got in the inventory file so that you can parse the data into a more usable format. You could just store the parsed inventory data as a list of lists or list of tuples, but it takes only a little more effort to use namedtuple from the collections library to create a custom class with the attributes named:

# parse to named tuples

# use namedtuple to create a custom Inventory class
from collections import namedtuple
Inventory = namedtuple("Inventory", ['station', 'latitude', 'longitude',
     'element', 'start', 'end'])

Using the Inventory class you created is very straightforward; you simply create each instance from the appropriate values, which in this case are a parsed row of inventory data.

The parsing involves two steps. First, you need to pick out slices of a line according to the field sizes specified. As you look at the field descriptions in the readme file, it’s also clear that there’s an extra space between files, which you need to consider in coming up with any approach to parsing. In this case, because you’re specifying each slice, the extra spaces are ignored. In addition, because the sizes of the STATION and ELEMENT fields exactly correspond to the values stored in them, you shouldn’t need to worry about stripping excess spaces from them.

The second thing that would be nice to do is convert the latitude and longitude values to floats and the start and end years to ints. You could do this at a later stage of data cleaning, and in fact, if the data is inconsistent and doesn’t have values that convert correctly in every row, you might want to wait. But in this case, the data lets you handle these conversions in the parsing step, so do it now:

# parse inventory lines and convert some values to floats and ints

inventory = [Inventory(x[0:11], float(x[12:20]), float(x[21:30]), x[31:35],
     int(x[36:40]), int(x[41:45]))
             for x in inventory_txt.split("\n") if x.startswith("US")]

for line in inventory[:5]:
    print(line)
Inventory(station='US009052008', latitude=43.7333, longitude=-96.6333,
     element='TMAX', start=2008, end=2016)
Inventory(station='US009052008', latitude=43.7333, longitude=-96.6333,
     element='TMIN', start=2008, end=2016)
Inventory(station='US009052008', latitude=43.7333, longitude=-96.6333,
     element='PRCP', start=2008, end=2016)
Inventory(station='US009052008', latitude=43.7333, longitude=-96.6333,
     element='SNWD', start=2009, end=2016)
Inventory(station='US10RMHS145', latitude=40.5268, longitude=-105.1113,
     element='PRCP', start=2004, end=2004)

Selecting a station based on latitude and longitude

Now that the inventory is loaded, you can use the latitude and longitude to find the stations closest to your location and then pick the one with the longest run of temperatures based on start and end years. At even the first line of the data, you can see two things to worry about:

  • There are various element types, but you’re concerned only with TMIN and TMAX, for minimum and maximum temperature.

  • None of the first inventory entries you see covers more than a few years. If you’re going to be looking for an historical perspective, you want to find a much longer run of temperature data.

To pick out what you need quickly, we can use a list comprehension to make a sublist of only the station inventory items in which the element is TMIN or TMAX. The other thing that you care about is getting a station with a long run of data, so while you’re creating this sublist, also make sure that the start year is before 1920 and that the end year is at least 2015. That way, you’re looking only at stations with at least 95 years’ worth of data:

inventory_temps = [x for x in inventory if x.element in ['TMIN', 'TMAX']
                   and x.end >= 2015 and x.start < 1920]
inventory_temps[:5]

[Inventory(station='USC00010252', latitude=31.3072, longitude=-86.5225,
     element='TMAX', start=1912, end=2017),
 Inventory(station='USC00010252', latitude=31.3072, longitude=-86.5225,
     element='TMIN', start=1912, end=2017),
 Inventory(station='USC00010583', latitude=30.8839, longitude=-87.7853,
     element='TMAX', start=1915, end=2017),
 Inventory(station='USC00010583', latitude=30.8839, longitude=-87.7853,
     element='TMIN', start=1915, end=2017),
 Inventory(station='USC00012758', latitude=31.445, longitude=-86.9533,
     element='TMAX', start=1890, end=2017)]

Looking at the first five records in your new list, you see that you’re in better shape. Now you have only temperature records, and the start and end years show that you have longer runs.

That leaves the problem of selecting the station nearest your location. To do that, compare the latitude and longitude of the station inventories with those of your location. There are various ways to get the latitude and longitude of any place, but probably the easiest way is to use an online mapping application or online search. (When I do that for the Chicago Loop, I get a latitude of 41.882 and a longitude of -87.629.)

Because you’re interested in the stations closest to your location, that interest implies sorting based on how close the latitude and longitude of the stations are to those of your location. Sorting a list is easy enough, and sorting by latitude and longitude isn’t too hard. But how do you sort by the distance from your latitude and longitude?

The answer is to define a key function for your sort that gets the difference between your latitude and the station’s latitude, and the difference between your longitude and the station’s longitude, and combines them into one number. The only other thing to remember is that you’ll want to add the absolute value of the differences before you combine them to avoid having a high negative difference combined with an equally high positive difference that would fool your sort:

# Downtown Chicago, obtained via online map
latitude, longitude = 41.882, -87.629

inventory_temps.sort(key=lambda x:  abs(latitude-x.latitude) + abs(longitude-x.longitude))

inventory_temps[:20]
Out[24]:
[Inventory(station='USC00110338', latitude=41.7806, longitude=-88.3092,
     element='TMAX', start=1893, end=2017),
 Inventory(station='USC00110338', latitude=41.7806, longitude=-88.3092,
     element='TMIN', start=1893, end=2017),
 Inventory(station='USC00112736', latitude=42.0628, longitude=-88.2861,
     element='TMAX', start=1897, end=2017),
 Inventory(station='USC00112736', latitude=42.0628, longitude=-88.2861,
     element='TMIN', start=1897, end=2017),
 Inventory(station='USC00476922', latitude=42.7022, longitude=-87.7861,
     element='TMAX', start=1896, end=2017),
 Inventory(station='USC00476922', latitude=42.7022, longitude=-87.7861,
     element='TMIN', start=1896, end=2017),
 Inventory(station='USC00124837', latitude=41.6117, longitude=-86.7297,
     element='TMAX', start=1897, end=2017),
 Inventory(station='USC00124837', latitude=41.6117, longitude=-86.7297,
     element='TMIN', start=1897, end=2017),
 Inventory(station='USC00119021', latitude=40.7928, longitude=-87.7556,
     element='TMAX', start=1893, end=2017),
 Inventory(station='USC00119021', latitude=40.7928, longitude=-87.7556,
     element='TMIN', start=1894, end=2017),
 Inventory(station='USC00115825', latitude=41.3708, longitude=-88.4336,
     element='TMAX', start=1912, end=2017),
 Inventory(station='USC00115825', latitude=41.3708, longitude=-88.4336,
     element='TMIN', start=1912, end=2017),
 Inventory(station='USC00115326', latitude=42.2636, longitude=-88.6078,
     element='TMAX', start=1893, end=2017),
 Inventory(station='USC00115326', latitude=42.2636, longitude=-88.6078,
     element='TMIN', start=1893, end=2017),
 Inventory(station='USC00200710', latitude=42.1244, longitude=-86.4267,
     element='TMAX', start=1893, end=2017),
 Inventory(station='USC00200710', latitude=42.1244, longitude=-86.4267,
     element='TMIN', start=1893, end=2017),
 Inventory(station='USC00114198', latitude=40.4664, longitude=-87.685,
     element='TMAX', start=1902, end=2017),
 Inventory(station='USC00114198', latitude=40.4664, longitude=-87.685,
     element='TMIN', start=1902, end=2017),
 Inventory(station='USW00014848', latitude=41.7072, longitude=-86.3164,
     element='TMAX', start=1893, end=2017),
 Inventory(station='USW00014848', latitude=41.7072, longitude=-86.3164,
     element='TMIN', start=1893, end=2017)]

Selecting a station and getting the station metadata

As you look at the top 20 entries in your newly sorted list, it seems that the first station, USC00110338, is a good fit. It’s got both TMIN and TMAX and one of the longer series, starting in 1893 and running up through 2017, for more than 120 years’ worth of data. So save that station into your station variable and quickly parse the station data you’ve already grabbed to pick up a little more information about the station.

Back in the readme file, you find the following information about the station data:

IV. FORMAT OF "ghcnd-stations.txt"

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
LATITUDE     13-20   Real
LONGITUDE    22-30   Real
ELEVATION    32-37   Real
STATE        39-40   Character
NAME         42-71   Character
GSN FLAG     73-75   Character
HCN/CRN FLAG 77-79   Character
WMO ID       81-85   Character
------------------------------

These variables have the following definitions:

ID         is the station identification code.  Note that the first two
           characters denote the FIPS  country code, the third character
           is a network code that identifies the station numbering system
           used, and the remaining eight characters contain the actual
           station ID.

           See "ghcnd-countries.txt" for a complete list of country codes.
           See "ghcnd-states.txt" for a list of state/province/territory
           codes.

           The network code  has the following five values:

           0 = unspecified (station identified by up to eight
           alphanumeric characters)
       1 = Community Collaborative Rain, Hail,and Snow (CoCoRaHS)
           based identification number.  To ensure consistency with
           with GHCN Daily, all numbers in the original CoCoRaHS IDs
           have been left-filled to make them all four digits long.
           In addition, the characters "-" and "_" have been removed
           to ensure that the IDs do not exceed 11 characters when
           preceded by "US1". For example, the CoCoRaHS ID
           "AZ-MR-156" becomes "US1AZMR0156" in GHCN-Daily
       C = U.S. Cooperative Network identification number (last six
           characters of the GHCN-Daily ID)
       E = Identification number used in the ECA&D non-blended
           dataset
       M = World Meteorological Organization ID (last five
           characters of the GHCN-Daily ID)
       N = Identification number used in data supplied by a
           National Meteorological or Hydrological Center
       R = U.S. Interagency Remote Automatic Weather Station (RAWS)
           identifier
       S = U.S. Natural Resources Conservation Service SNOwpack
           TELemtry (SNOTEL) station identifier
           W = WBAN identification number (last five characters of the
           GHCN-Daily ID)

LATITUDE   is latitude of the station (in decimal degrees).

LONGITUDE  is the longitude of the station (in decimal degrees).

ELEVATION  is the elevation of the station (in meters, missing = -999.9).


STATE      is the U.S. postal code for the state (for U.S. stations only).

NAME       is the name of the station.
GSN FLAG   is a flag that indicates whether the station is part of the GCOS
           Surface Network (GSN). The flag is assigned by cross-referencing
           the number in the WMOID field with the official list of GSN
           stations. There are two possible values:

           Blank = non-GSN station or WMO Station number not available
           GSN   = GSN station

HCN/      is a flag that indicates whether the station is part of the U.S.
CRN FLAG  Historical Climatology Network (HCN).  There are three possible
          values:

           Blank = Not a member of the U.S. Historical Climatology
               or U.S. Climate Reference Networks
           HCN   = U.S. Historical Climatology Network station
           CRN   = U.S. Climate Reference Network or U.S. Regional Climate
               Network Station

WMO ID     is the World Meteorological Organization (WMO) number for the
           station.  If the station has no WMO number (or one has not yet
           been matched to this station), then the field is blank.

Although you might care more about the metadata fields for more serious research, right now you want to match the start and end year from the inventory records to the rest of the station metadata in the station file.

You have several ways to sift through the stations file to find the one station that matches the station ID you selected. You could create a for loop to go through each line and break out when you find it; you could split the data into lines and then sort and use a binary search, and so on. Depending on the nature and amount of data you have, one approach or another might be appropriate. In this case, because you have the data loaded already, and it’s not too large, use a list comprehension to return a list with its single element being the station you’re looking for:

station_id = 'USC00110338'

# parse stations
Station = namedtuple("Station", ['station_id', 'latitude', 'longitude',
     'elevation', 'state', 'name', 'start', 'end'])

stations = [(x[0:11], float(x[12:20]), float(x[21:30]), float(x[31:37]),
     x[38:40].strip(), x[41:71].strip())
            for x in stations_txt.split("\n") if x.startswith(station_id)]

station = Station(*stations[0] + (inventory_temps[0].start,
     inventory_temps[0].end))
print(station)
Station(station_id='USC00110338', latitude=41.7806, longitude=-88.3092,
     elevation=201.2, state='IL', name='AURORA', start=1893, end=2017)

At this point, you’ve identified that you want weather data from the station at Aurora, Illinois, which is the nearest station to downtown Chicago with more than a century’s worth of temperature data.

Fetching and parsing the actual weather data

With the station identified, the next step is fetching the actual weather data for that station and parsing it. The process is quite similar to what you did in the preceding section.

Fetching the data

First, fetch the data file and save it, in case you need to go back to it:

# fetch daily records for selected station

r = requests.get('https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/all/
     {}.dly'.format(station.station_id))
weather = r.text

# save into a text file, so we won't need to fetch again

with open('weather_{}.txt'.format(station), "w") as weather_file:
    weather_file.write(weather)

# read from saved daily file if needed (only used if we want to start the
     process over without downloadng the file)

with open('weather_{}.txt'.format(station)) as weather_file:
    weather = weather_file.read()

print(weather[:540])
USC00110338189301TMAX  -11  6  -44  6 -139  6  -83  6 -100  6  -83  6  -72  6
     -83  6  -33  6 -178  6 -150  6 -128  6 -172  6 -200  6 -189  6 -150  6 -
     106  6  -61  6  -94  6  -33  6  -33  6  -33  6  -33  6    6  6  -33  6
     -78  6  -33  6   44  6  -89 I6  -22  6    6  6
USC00110338189301TMIN  -50  6 -139  6 -250  6 -144  6 -178  6 -228  6 -144  6
     -222  6 -178  6 -250  6 -200  6 -206  6 -267  6 -272  6 -294  6 -294  6
     -311  6 -200  6 -233  6 -178  6 -156  6  -89  6 -200  6 -194  6 -194  6
     -178  6 -200  6  -33 I6 -156  6 -139  6 -167  6

Parsing the weather data

Again, now that you have the data, you can see it’s quite a bit more complex than the station and inventory data. Clearly, it’s time to head back to the readme.txt file and section III, which is the description of a weather data file. You have a lot of options, so filter them down to the ones that concern you, and leave out the other element types as well as the whole system of flags specifying the source, quality, and type of the values:

III. FORMAT OF DATA FILES (".dly" FILES)

Each ".dly" file contains data for one station.  The name of the file
corresponds to a station's identification code.  For example,
     "USC00026481.dly"
contains the data for the station with the identification code USC00026481).

Each record in a file contains one month of daily data.  The variables on
     each
line include the following:

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
YEAR         12-15   Integer
MONTH        16-17   Integer
ELEMENT      18-21   Character
VALUE1       22-26   Integer
MFLAG1       27-27   Character
QFLAG1       28-28   Character
SFLAG1       29-29   Character
VALUE2       30-34   Integer
MFLAG2       35-35   Character
QFLAG2       36-36   Character
SFLAG2       37-37   Character
  .           .          .
  .           .          .
  .           .          .
VALUE31    262-266   Integer
MFLAG31    267-267   Character
QFLAG31    268-268   Character
SFLAG31    269-269   Character
------------------------------

These variables have the following definitions:

ID         is the station identification code. Please see "ghcnd-stations.txt"
           for a complete list of stations and their metadata.
YEAR       is the year of the record.

MONTH      is the month of the record.

ELEMENT    is the element type.   There are five core elements as well as a
           number of addition elements.

       The five core elements are:

           PRCP = Precipitation (tenths of mm)
           SNOW = Snowfall (mm)
           SNWD = Snow depth (mm)
           TMAX = Maximum temperature (tenths of degrees C)
           TMIN = Minimum temperature (tenths of degrees C)

...

VALUE1     is the value on the first day of the month (missing = -9999).

MFLAG1     is the measurement flag for the first day of the month.

QFLAG1     is the quality flag for the first day of the month.

SFLAG1     is the source flag for the first day of the month.

VALUE2     is the value on the second day of the month

MFLAG2     is the measurement flag for the second day of the month.

QFLAG2     is the quality flag for the second day of the month.

SFLAG2     is the source flag for the second day of the month.

... and so on through the 31st day of the month.  Note: If the month has less
than 31 days, then the remaining variables are set to missing (e.g., for April,
VALUE31 = -9999, MFLAG31 = blank, QFLAG31 = blank, SFLAG31 = blank).

The key points you care about right now are that the station ID is the 11 characters of a row, the year is the next 4, the month the next 2, and the element the next 4 after that. After that, there are 31 slots for daily data, with each slot consisting of 5 characters for the temperature, expressed in tenths of a degree Celsius, and 3 characters of flags. As I mentioned earlier, you can disregard the flags for this exercise. You can also see that missing values for the temperatures are coded with -9999 if that day isn’t in the month, so for a typical February, for example, the 29th, 30th, and 31st values would be -9999.

As you process your data in this exercise, you’re looking to get overall trends, so you don’t need to worry much about individual days. Instead, find average values for the month. You can save the maximum, minimum, and mean values for the entire month and use those.

This means that to process each line of weather data, you need to:

  • Split the line into its separate fields, and ignore or discard the flags for each daily value.

  • Remove the values with -9999, and convert the year and month into ints and the temperature values into floats, keeping in mind that the temperature readings are in tenths of degrees centigrade.

  • Calculate the average value, and pick out the high and low values.

To accomplish all these tasks, you can take a couple of approaches. You could do several passes over the data, splitting into fields, discarding the placeholders, converting strings to numbers, and finally calculating the summary values. Or you can write a function that performs all of these operations on a single line and do everything in one pass. Both approaches can be valid. In this case, take the latter approach and create a parse_line function to perform all of your data transformations:

def parse_line(line):
    """ parses line of weather data
        removes values of -9999 (missing value)
    """

    # return None if line is empty
    if not line:
        return None

    # split out first 4 fields and string containing temperature values
    record, temperature_string = (line[:11], int(line[11:15]),
     int(line[15:17]), line[17:21]), line[21:]

    # raise exception if the temperature string is too short
    if len(temperature_string) < 248:
        raise ValueError("String not long enough - {}
     {}".format(temperature_string, str(line)))

    # use a list comprehension on the temperature_string to extract and
     convert the
    values = [float(temperature_string[i:i + 5])/10 for i in range(0, 248, 8)
              if not temperature_string[i:i + 5].startswith("-9999")]

    # get the number of values, the max and min, and calculate average
    count = len(values)
    tmax = round(max(values), 1)
    tmin = round(min(values), 1)
    mean = round(sum(values)/count, 1)

    # add the temperature summary values to the record fields extracted
     earlier and return
    return record + (tmax, tmin, mean, count)

If you test this function with the first line of your raw weather data, you get the following result:

parse_line(weather[:270])
Out[115]:
('USC00110338', 1893, 1, 'TMAX', 4.4, -20.0, -7.8, 31)

So it looks like you have a function that will work to parse your data. If that function works, you can parse the weather data and either store it or continue with your processing:

# process all weather data

# list comprehension, will not parse empty lines
weather_data = [parse_line(x) for x in weather.split("\n") if x]

len(weather_data)

weather_data[:10]

[('USC00110338', 1893, 1, 'TMAX', 4.4, -20.0, -7.8, 31),
 ('USC00110338', 1893, 1, 'TMIN', -3.3, -31.1, -19.2, 31),
 ('USC00110338', 1893, 1, 'PRCP', 8.9, 0.0, 1.1, 31),
 ('USC00110338', 1893, 1, 'SNOW', 10.2, 0.0, 1.0, 31),
 ('USC00110338', 1893, 1, 'WT16', 0.1, 0.1, 0.1, 2),
 ('USC00110338', 1893, 1, 'WT18', 0.1, 0.1, 0.1, 11),
 ('USC00110338', 1893, 2, 'TMAX', 5.6, -17.2, -0.9, 27),
 ('USC00110338', 1893, 2, 'TMIN', 0.6, -26.1, -11.7, 27),
 ('USC00110338', 1893, 2, 'PRCP', 15.0, 0.0, 2.0, 28),
 ('USC00110338', 1893, 2, 'SNOW', 12.7, 0.0, 0.6, 28)]

Now you have all the weather records, not just the temperature records, parsed and in your list.

Saving the weather data in a database (optional)

At this point, you can save all of the weather records (and the station records and inventory records as well, if you want) in a database. Doing so lets you come back in later sessions and use the same data without having to go to the hassle of fetching and parsing the data again.

As an example, the following code is how you could save the weather data in a sqlite3 database:

import sqlite3

conn = sqlite3.connect("weather_data.db")
cursor = conn.cursor()

# create weather table

create_weather = """CREATE TABLE "weather" (
    "id" text NOT NULL,
    "year" integer NOT NULL,
    "month" integer NOT NULL,
    "element" text NOT NULL,
    "max" real,
    "min" real,
    "mean" real,
    "count" integer)"""
cursor.execute(create_weather)
conn.commit()

# store parsed weather data in database

for record in weather_data:
    cursor.execute("""insert into weather (id, year, month, element, max,
     min, mean, count) values (?,?,?,?,?,?,?,?) """,
                      record)

conn.commit()

When you have the data stored, you could retrieve it from the database with code like the following, which fetches only the TMAX records:

cursor.execute("""select * from weather where element='TMAX' order by year,
     month""")
tmax_data = cursor.fetchall()
tmax_data[:5]

[('USC00110338', 1893, 1, 'TMAX', 4.4, -20.0, -7.8, 31),
 ('USC00110338', 1893, 2, 'TMAX', 5.6, -17.2, -0.9, 27),
 ('USC00110338', 1893, 3, 'TMAX', 20.6, -7.2, 5.6, 30),
 ('USC00110338', 1893, 4, 'TMAX', 28.9, 3.3, 13.5, 30),
 ('USC00110338', 1893, 5, 'TMAX', 30.6, 7.2, 19.2, 31)]

Selecting and graphing data

Because you’re concerned only with temperature, you need to select just the temperature records. You can do that quickly enough by using a couple of list comprehensions to pick out a list for TMAX and one for TMIN. Or you could use the features of pandas, which you’ll be using for graphing the date, to filter out the records you don’t want. Because you’re more concerned with pure Python than with pandas, take the first approach:

tmax_data = [x for x in weather_data if x[3] == 'TMAX']
tmin_data = [x for x in weather_data if x[3] == 'TMIN']
tmin_data[:5]

[('USC00110338', 1893, 1, 'TMIN', -3.3, -31.1, -19.2, 31),
 ('USC00110338', 1893, 2, 'TMIN', 0.6, -26.1, -11.7, 27),
 ('USC00110338', 1893, 3, 'TMIN', 3.3, -13.3, -4.6, 31),
 ('USC00110338', 1893, 4, 'TMIN', 12.2, -5.6, 2.2, 30),
 ('USC00110338', 1893, 5, 'TMIN', 14.4, -0.6, 5.7, 31)]

Using pandas to graph your data

At this point, you have your data cleaned and ready to graph. To make the graphing easier, you can use pandas and matplotlib, as described in chapter 24. To do this, you need to have a Jupyter server running and have pandas and matplotlib installed. To make sure that they’re installed from within your Jupyter notebook, use the following command:

# Install pandas and matplotlib using pip
! pip3.6 install pandas matplotlib

import pandas as pd
%matplotlib inline

When pandas and matplotlib are installed, you can load pandas and create data frames for your TMAX and TMIN data:

tmax_df = pd.DataFrame(tmax_data, columns=['Station', 'Year', 'Month',
     'Element', 'Max', 'Min', 'Mean', 'Days'])
tmin_df = pd.DataFrame(tmin_data, columns=['Station', 'Year', 'Month',
     'Element', 'Max', 'Min', 'Mean', 'Days'])

You could plot the monthly values, but 123 years times 12 months of data is almost 1,500 data points, and the cycle of seasons also makes picking out patterns difficult.

Instead, it probably makes more sense to average the high, low, and mean monthly values into yearly values and plot those values. You could do this in Python, but because you already have your data loaded in a pandas data frame, you can use that to group by year and get the mean values:

# select Year, Min, Max, Mean columns, group by year, average and line plot

tmin_df[['Year','Min', 'Mean', 'Max']].groupby('Year').mean().plot(
     kind='line', figsize=(16, 4))

This result has a fair amount of variation, but it does seem to indicate that the minimum temperature has been on the rise for the past 20 years.

Note that if you wanted to get the same graph without using Jupyter notebook and matplotlib, you could use still use pandas, but you’d write to a CSV or Microsoft Excel file, using the data frame’s to_csv or to_excel method. Then you could load the resulting file into a spreadsheet and graph from there.

Last updated