24. Exploring data
Chapter 24. Exploring data
Over the past few chapters, I’ve dealt with some aspects of using Python to get and clean data. Now it’s time to look at a few of the things that Python can help you do to manipulate and explore data.
24.1. Python tools for data exploration
In this chapter, we’ll look at some common Python tools for data exploration: Jupyter notebook, pandas, and matplotlib. I can only touch briefly on a few features of these tools, but the aim is to give you an idea of what is possible and some initial tools to use in exploring data with Python.
24.1.1. Python’s advantages for exploring data
Python has become one of the leading languages for data science and continues to grow in that area. As I’ve mentioned, however, Python isn’t always the fastest language in terms of raw performance. Conversely, some data-crunching libraries, such as NumPy, are largely written in C and heavily optimized to the point that speed isn’t an issue. In addition, considerations such as readability and accessibility often outweigh pure speed; minimizing the amount of developer time needed is often more important. Python is readable and accessible, and both on its own and in combination with tools developed in the Python community, it’s an enormously powerful tool for manipulating and exploring data.
24.1.2. Python can be better than a spreadsheet
Spreadsheets have been the tools of choice for ad-hoc data manipulation for decades. People who are skilled with spreadsheets can make them do truly impressive tricks: spreadsheets can combine different but related data sets, pivot tables, use lookup tables to link data sets, and much more. But although people everywhere get a vast amount of work done with them every day, spreadsheets do have limitations, and Python can help you go beyond those limitations.
One limitation that I’ve already alluded to is the fact that most spreadsheet software has a row limit—currently, about 1 million rows, which isn’t enough for many data sets. Another limitation is the central metaphor of the spreadsheet itself. Spreadsheets are two-dimensional grids, rows and columns, or at best stacks of grids, which limits the ways you can manipulate and think about complex data.
With Python, you can code your way around the limitations of spreadsheets and manipulate data the way you want. You can combine Python data structures such as lists, tuples, sets, and dictionaries in endlessly flexible ways, or you can create your own classes to package both data and behavior exactly the way you need.
24.2. Jupyter notebook
Probably one of the most compelling tools for exploring data with Python doesn’t augment what the language itself does, but changes the way you use the language to interact with your data. Jupyter notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text. Although several other languages are now supported, it originated in connection with IPython, an alternative shell for Python developed by the scientific community.
What makes Jupyter such a convenient and powerful tool is the fact that you interact with it in a web browser. It lets you combine text and code, as well as modify and execute your code interactively. You can not only run and modify code in chunks, but also save and share the notebooks with others.
The best way to get a feel for what Jupyter notebook can do is start playing with it. It’s fairly easy to run a Jupyter process locally on your machine, or you can access online versions. For some options, see the sidebar on ways to run Jupyter.
WAYS TO RUN JUPYTER
Jupyter online: Accessing online instances of Jupyter is one of the easiest ways to get started. Currently, Project Jupyter, the community behind Jupyter, hosts free notebooks at https://jupyter.org/try. You can also find demo notebooks and kernels for other languages. At this writing, you can also access free notebooks on Microsoft’s Azure platform at https://notebooks.azure.com, and many other ways are available.
Jupyter locally: Although using an online instance is quite convenient, it’s not very much work to set up your own instance of Jupyter on your computer. Usually for local versions, you point your browser to localhost:8888.
If you use Docker, you have several containers to choose among. To run the data science notebook container, use something like this:
If you’d rather run directly on your system, it’s easy to install and run Jupyter in a virtualenv.
macOS and Linux systems: First, open a command window, and enter the following commands:
Windows systems:
The last command should run the Jupyter notebook web app and open a browser window pointing at it.
24.2.1. Starting a kernel
When you have Jupyter installed, running, and open in your browser, you need to start a Python kernel. One nice thing about Jupyter is that it lets you run multiple kernels at the same time. You can run kernels for different versions of Python and for other languages such as R, Julia, and even Ruby.
Starting a kernel is easy. Just click the new button and select Python 3 (figure 24.1).
Figure 24.1. Starting a Python kernel
24.2.2. Executing code in a cell
When you have a kernel running, you can start entering and running Python code. Right away, you’ll notice a few differences from the ordinary Python command shell. You won’t get the >>> prompt that you see in the standard Python shell, and pressing Enter just adds new lines in the cell. To execute the code in a cell, illustrated in figure 24.2, choose Cell > Run Cells, click the Run button immediately to the left of the down arrow on the button bar, or use the key combination Alt-Enter. After you use Jupyter notebook a little bit, it’s quite likely that the Alt-Enter key combination will become quite natural to you.
Figure 24.2. Executing code in a notebook cell
You can test how it works by entering some code or an expression into the first cell of your new notebook and then pressing Alt-Enter.
As you can see, any output is shown immediately below the cell, and a new cell is created and ready for your next input. Also note that each cell that’s executed is numbered in the order in which it’s executed.
TRY THIS: USING JUPYTER NOTEBOOK
Enter some code in the notebook and experiment with running it. Check out the Edit, Cell, and Kernel menus to see what options are there. When you have a little code running, use the Kernel menu to restart the kernel, repeat your steps, and then use the Cell menu to rerun the code in all of the cells.
24.3. Python and pandas
In the course of exploring and manipulating data, you perform quite a few common operations, such as loading data into a list or dictionary, cleaning data, and filtering data. Most of these operations are repeated often, have to be done in standard patterns, and are simple and often tedious. If you think that this combination is a strong reason to automate those tasks you’re not alone. One of the now-standard tools for handling data in Python—pandas—was created to automate the boring heavy lifting of handling data sets.
24.3.1. Why you might want to use pandas
pandas was created to make manipulating and analyzing tablular or relational data easy by providing a standard framework for holding the data, with convenient tools for frequent operations. As a result, it’s almost more of an extension to Python than a library, and it changes the way you can interact with data. The plus side is that after you grok how pandas work, you can do some impressive things and save a lot of time. It does take time to learn how to get the most from pandas, however. As with many tools, if you use pandas for what it was designed for, it excels. The simple examples I show you in the following sections should give you a rough idea whether pandas is a tool that’s suited for your use cases.
24.3.2. Installing pandas
pandas is easy to install with pip. It’s often used along with matplotlib for plotting, so you can install both tools from the command line of your Jupyter virtual environment with this code:
From a cell in a Jupyter notebook, you can use
copy
If you use pandas, life will be easier if you use the following three lines:
The first line is a Jupyter “magic” function that enables matplotlib to plot data in the cell where your code is (which is very useful). The second line imports pandas with the alias of pd, which is both easier to type and common among pandas users; the last line also imports numpy. Although pandas depends quite a bit on numpy, you won’t use it explicitly in the following examples, but it’s reasonable to get into the habit of importing it anyway.
24.3.3. Data frames
One basic structure that you get with pandas is a data frame. A data frame is a two-dimensional grid, rather similar to a relational database table except in memory. Creating a data frame is easy; you give it some data. To keep things absolutely simple, give it a 3 × 3 grid of numbers as the first example. In Python, such a grid is a list of lists:
copy
Sadly, in Python the grid won’t look like a grid unless you make some additional effort. So see what you can do with the same grid as a pandas data frame:
That code is fairly straightforward; all you needed to do was turn your grid into a data frame. You’ve gained a more gridlike display, and now you have both row and column numbers. It’s often rather bothersome to keep track of what column number is what, of course, so give your columns names:
You may wonder whether naming the columns has any benefit, but the column names can be put to use with another pandas trick: the ability to select columns by name. If you want the contents only of column "two", for example, you can get it very simply:
Here, you’ve already saved time in comparison to Python. To get only column two of your grid, you’d need to use a list comprehension while also remembering to use a zero-based index (and you still wouldn’t get the nice output):
You can loop over data frame column values just as easily as the list you got by using a comprehension:
That’s not bad for a start, but by using a list of columns in double brackets, you can do better, getting a subset of the data frame that’s another data frame. Instead of getting the middle column, get the first and last columns of your data frame as another data frame:
copy
A data frame also has several methods that apply the same operation and argument to every item in the frame. If you want to add two to every item in the data frame’s edges, you could use the add() method:
Here again, it’s possible to get the same result by using list comprehensions and/or nested loops, but those techniques aren’t as convenient. It’s pretty easy to see how such functionality can make life easier, particularly for someone who’s more interested in the information that the data contains than in the process of manipulating it.
24.4. Data cleaning
In earlier chapters, I discussed a few ways to use Python to clean data. Now that I’ve added pandas to the mix, I’ll show you examples of how to use its functionality to clean data. As I present the following operations, I also refer to ways that the same operation might be done in plain Python, both to illustrate how using pandas is different and to show why pandas isn’t right for every use case (or user, for that matter).
24.4.1. Loading and saving data with pandas
pandas has an impressive collection of methods to load data from different sources. It supports several file formats (including fixed-width and delimited text files, spreadsheets, JSON, XML, and HTML), but it’s also possible to read from SQL databases, Google BiqQuery, HDF, and even clipboard data. You should be aware that many of these operations aren’t actually part of pandas itself; pandas relies on having other libraries installed to handle those operations, such as SQLAlchemy for reading from SQL databases. This distinction matters mostly if something goes wrong; quite often, the problem that needs to be fixed is outside pandas, and you’re left to deal with the underlying library.
Reading a JSON file with the read_json() method is simple:
This code gives you a data frame like this:
For another example of how simple reading data into pandas is, load some data from the CSV file of temperature data from chapter 21 and from the JSON file of Mars weather data used in chapter 22. In the first case, use the read_csv() method:
Clearly, loading the file in a single step is appealing, and you can see that pandas had no issues loading the file. You can also see that the empty first column has been translated into NaN (not a number). You do still have the same issue with 'Missing' for some values, and in fact it might make sense to have those 'Missing' values converted to NaN:
The addition of the na_values parameter controls what values will be translated to NaN on load. In this case, you added the string 'Missing' so that the row of the data frame was translated from
to
This technique can be particularly useful if you have one of those data files in which, for whatever reason, “no data” is indicated in a variety of ways: NA, N/A, ?, -, and so on. To handle a case like that, you can inspect the data to find out what’s used and then reload it, using the na_values parameter to standardize all those variations as NaN.
Saving data
If you want to save the contents of a data frame, a pandas data frame has a similarly broad collection of methods. If you take your simple grid data frame, you can write it in several ways. This line
writes a file that looks like this:
copy
Similarly, you can transform a data grid to a JSON object or write it to a file:
24.4.2. Data cleaning with a data frame
Converting a particular set of values to NaN on load is a very simple bit of data cleaning that pandas makes trivial. Going beyond that, data frames support several operations that can make data cleaning less of a chore. To see how this works, reopen the temperature CSV file, but this time, instead of using the headers to name the columns, use the range() function with the names parameter to give them numbers, which will make referring to them easier. You also may recall from an earlier example that the first field of every line—the "Notes" field—is empty and loaded with NaN values. Although you could ignore this column, it would be even easier if you didn’t have it. You can use the range() function again, this time starting from 1, to tell pandas to load all columns except the first one. But if you know that all of your values are from Illinois and you don’t care about the long-form date field, you could start from 4 to make things much more manageable:
Now you have a data frame that has only the columns you might want to work with. But you still have an issue: the last column, which lists the percentage of coverage for the heat index, is still a string ending with a percentage sign rather than an actual percentage. This problem is apparent if you look at the first row’s value for column 17:
copy
To fix this problem, you need to do two things: Remove the % from the end of the value and then cast the value from string to a number. Optionally, if you want to represent the resulting percentage as a fraction, you need to divide it by 100. The first bit is simple because pandas lets you use a single command to repeat an operation on a column:
This code takes the column and calls a string strip() operation on it to remove the trailing %. Now when you look at the first value in the column (or any of the other values), you see that the offending percentage sign is gone. It’s also worth noting that you could have used other operations, such as replace("%", ""), to achieve the same result.
The second operation is to convert the string to a numeric value. Again, pandas lets you perform this operation with one command:
Now the values in column 17 are numeric, and if you want to, you can use the div() method to finish the job of turning those values into fractions:
In fact, it would be possible to achieve the same result in a single line by chaining the three operations together:
This example is very simple, but it gives you an idea of the convenience that pandas can bring to cleaning your data. pandas has a wide variety of operations for transforming data, as well as the ability to use custom functions, so it would be hard to think of a scenario in which you couldn’t streamline data cleaning with pandas.
Although the number of options is almost overwhelming, a wide variety of tutorials and videos is available, and the documentation at http://pandas.pydata.org is excellent.
TRY THIS: CLEANING DATA WITH AND WITHOUT PANDAS
Experiment with the operations. When the final column has been converted to a fraction, can you think of a way to convert it back to a string with the trailing percentage sign?
By contrast, load the same data into a plain Python list by using the csv module, and apply the same changes by using plain Python.
24.5. Data aggregation and manipulation
The preceding examples probably gave you some idea of the many options pandas gives you for performing fairly complex operations on your data with only a few commands. As you might expect, this level of functionality is also available for aggregating data. In this section, I walk through a few simple examples of aggregating data to illustrate some of the many possibilities. Although many options are available, I focus on merging data frames, performing simple data aggregation, and grouping and filtering.
24.5.1. Merging data frames
Quite often in the course of handling data, you need to relate two data sets. Suppose that you have one file containing the number of sales calls made per month by members of a sales team, and in another file, you have the dollar amounts of the sales in each of their territories:
copy
Clearly, it would be very useful to link revenue and team-member activity. These two files are very simple, yet merging them with plain Python isn’t entirely trivial. pandas has a function to merge two data frames:
The merge function creates a new data frame by joining the two frames on the columns specified in the column field. The merge function works similarly to a relational-database join, giving you a table that combines the columns from the two files:
In this case, you have a one-to-one correspondence between the rows in the two fields, but the merge function can also do one-to-many and many-to-many joins, as well as right and left joins.
QUICK CHECK: MERGING DATA SETS
How would you go about merging to data sets like the ones in the Python example?
24.5.2. Selecting data
It can also be useful to select or filter the rows in a data frame based on some condition. In the example sales data, you may want to look only at territory 3, which is also easy:
In this example, you select only rows in which the territory is equal to 3 but using exactly that expression, revenue.Territory==3, as the index for the data frame. From the point of view of plain Python, such use is nonsense and illegal, but for a pandas data frame, it works and makes for a much more concise expression.
More complex expressions are also allowed, of course. If you want to select only rows in which the amount per call is greater than 500, you could use this expression instead:
copy
Even better, you could calculate and add that column to your data frame by using a similar operation:
Again, note that pandas’s built-in logic replaces a more cumbersome structure in plain Python.
QUICK CHECK: SELECTING IN PYTHON
What Python code structure would you use to select only rows meeting certain conditions?
24.5.3. Grouping and aggregation
As you might expect, pandas has plenty of tools to summarize and aggregate data as well. In particular, getting the sum, mean, median, minimum, and maximum values from a column uses clearly named column methods:
If, for example, you want to get all of the rows in which the amount per call is above the median, you can combine this trick with the selection operation:
In addition to being able to pick out summary values, it’s often useful to group the data based on other columns. In this simple example, you can use the groupby method to group your data. You may want to know the total calls and amounts by month or by territory, for example. In those cases, use those fields with the data frame’s groupby method:
In each case, you select the columns that you want to aggregate, group them by the values in one of those columns, and (in this case) sum the values for each group. You could also use any of the other methods mentioned earlier in this chapter.
Again, all these examples are simple, but they illustrate a few of the options you have for manipulating and selecting data with pandas. If these ideas resonate with your needs, you can learn more by studying the pandas documentation at http://pandas.pydata.org.
TRY THIS: GROUPING AND AGGREGATING
Experiment with pandas and the data in previous examples. Can you get the calls and amounts by both team member and month?
24.6. Plotting data
Another very attractive feature of pandas is the ability to plot the data in a data frame very easily. Although you have many options for plotting data in Python and Jupyter notebook, pandas can use matplotlib directly from a data frame. You may recall that when you started your Jupyter session, one of the first commands you gave was the Jupyter “magic” command to enable matplotlib for inline plotting:
Because you have the ability to plot, see how you might plot some data (figure 24.3). To continue with the sales example, if you want to plot the quarter’s mean sales by territory, you can get a graph right in your notebook just by adding .plot.bar():
Figure 24.3. Bar plot of a pandas data frame in Jupyter notebook
Other options are available. plot() alone or .plot.line() creates a line graph, .plot.pie() creates a pie chart, and so on.
Thanks to the combination of pandas and matplotlib, plotting data in a Jupyter notebook is quite easy. I should also note that although such plotting is easy, there are many things that it doesn’t do extremely well.
TRY THIS: PLOTTING
Plot a line graph of the monthly average amount per call.
24.7. Why you might not want to use pandas
The preceding examples illustrate only a tiny fraction of the tools pandas can offer you in cleaning, exploring, and manipulating data. As I mentioned at the beginning of this chapter, pandas is an excellent tool set that excels in what it was designed to do. That doesn’t mean, however, that pandas is the tool for all situations or for all people.
There are reasons why you might elect to use plain old Python (or some other tool) instead. For one thing, as I mention earlier, learning to fully use pandas is in some ways like learning another language, which may not be something you have the time or inclination for. Also, pandas may not be ideal in all production situations, particularly with very large data sets that don’t require much in the way of math operations or with data that isn’t easy to put into the formats that work best with pandas. Munging large collections of product information, for example, probably wouldn’t benefit so much from pandas; neither would basic processing of a stream of transactions.
The point is that you should choose your tools thoughtfully based on the problems at hand. In many cases, pandas will truly make your life easier as you work with data, but in other cases, plain old Python may be your best bet.
Summary
Python offers many benefits for data handling, including the ability to handle very large data sets and the flexibility to handle data in ways that match your needs.
Jupyter notebook is a useful way to access Python via a web browser, which also makes improved presentation easier.
pandas is a tool that makes many common data-handling operations much easier, including cleaning, combining, and summarizing data.
pandas also makes simple plotting much easier.
Last updated