📉
Tutorials
  • Computer History
  • Function
    • Finance
      • Calculate
    • Manage Data
    • Date&Time
    • Strings and Character
  • Snippets
    • Web Application
      • Hugo
      • JavaScript
        • Stopwatch using JavaScript?
    • Note
    • Start Project
      • GitHub
      • GitLab
    • Python Programming
      • Strings and Character Data
      • List
      • Dictionaries
    • Data Science
      • Setting Option
      • Get Data
  • Link Center
    • Next Articles
    • Google
    • Excel VBA
    • Python
      • Notebook
    • WebApp
      • Vue.js
    • Finance
    • Project
      • Kids
        • Scratch
      • Finance
        • Plotly.js
        • Portfolio
      • Mini Lab
        • Systems Administration
        • Auto Adjust Image
      • Sending Emails
      • ECS
        • Knowledge Base
        • ระบบผู้เชี่ยวชาญ (Expert System)
        • Check product
        • Compare two SQL databases
      • e-Library
        • Knowledge base
        • การจัดหมวดหมู่ห้องสมุด
        • Temp
      • AppSheet
        • บัญชีรายรับรายจ่าย
      • Weather App
      • COVID-19
  • Tutorials
    • Data Science
      • Data Science IPython notebooks
    • UX & UI
      • 7 กฎการออกแบบ UI
    • Web Scraping
      • Scrape Wikipedia Articles
      • Quick Start
    • GUI
      • pysimple
        • Create a GUI
      • Tkinter
        • Python Tkinter Tutorial
      • PyQt
        • PyQt Tutorial
    • MachineLearning
      • การพัฒนา Chat Bot
      • AI ผู้ช่วยใหม่ในการทำ Customer Segmentation
      • Customer Segmentation
      • ตัดคำภาษาไทย ด้วย PyThaiNLP API
    • Excel & VBA
      • INDEX กับ MATCH
      • รวมสูตร Excel ปี 2020
      • How to Write Code in a Spreadsheet
    • Visualization
      • Bokeh
        • Part I: Getting Started
        • Data visualization
        • Plotting a Line Graph
        • Panel Document
        • Interactive Data Visualization
    • VueJS
      • VueJS - Quick Guide
    • Django
      • Customize the Django Admin
      • พัฒนาเว็บด้วย Django
    • Git
      • วิธีสร้าง SSH Key
      • Git คืออะไร
      • เริ่มต้นใช้งาน Git
      • การใช้งาน Git และ Github
      • รวม 10 คำสั่ง Git
      • GIT Push and Pull
    • Finance
      • Stock Analysis using Pandas (Series)
      • Building Investment AI for fintech
      • Resampling Time Series
      • Python for Finance (Series)
      • Stock Data Analysis (Second Edition)
      • Get Stock Data Using Python
      • Stock Price Trend Analysis
      • Calculate Stock Returns
      • Quantitative Trading
      • Backtrader for Backtesting
      • Binance Python API
      • Pine Script (TradingView)
      • Stocks Analysis with Pandas and Scikit-Learn
      • Yahoo Finance API
      • Sentiment Analysis
      • yfinance Library
      • Stock Data Analysis
      • YAHOO_FIN
      • Algorithmic Trading
    • JavaScript
      • Split a number
      • Callback Function
      • The Best JavaScript Examples
      • File and FileReader
      • JavaScript Tutorial
      • Build Reusable HTML Components
      • Developing JavaScript components
      • JavaScript - Quick Guide
      • JavaScript Style Guide()
      • Beginner's Handbook
      • Date Now
    • Frontend
      • HTML
        • File Path
      • Static Site Generators.
        • Creating a New Theme
    • Flask
      • Flask - Quick Guide
      • Flask Dashboards
        • Black Dashboard
        • Light Blue
        • Flask Dashboard Argon
      • Create Flask App
        • Creating First Application
        • Rendering Pages Using Jinja
      • Jinja Templates
        • Primer on Jinja Templating
        • Jinja Template Document
      • Learning Flask
        • Ep.1 Your first Flask app
        • Ep.2 Flask application structure
        • Ep.3 Serving HTML files
        • Ep.4 Serving static files
        • Ep.5 Jinja template inheritance
        • Ep.6 Jinja template design
        • Ep.7 Working with forms in Flask
        • Ep.8 Generating dynamic URLs in Flask
        • Ep.9 Working with JSON data
        • Ep.23 Deploying Flask to a VM
        • Ep.24 Flask and Docker
        • Ep. 25: uWSGI Introduction
        • Ep. 26 Flask before and after request
        • Ep. 27 uWSGI Decorators
        • Ep. 28 uWSGI Decorators
        • Ep. 29 Flask MethodView
        • Ep. 30 Application factory pattern
      • The Flask Mega-Tutorial
        • Chapter 2: Templates
      • Building Flask Apps
      • Practical Flask tutorial series
      • Compiling SCSS to CSS
      • Flask application structure
    • Database
      • READING FROM DATABASES
      • SQLite
        • Data Management
        • Fast subsets of large datasets
      • Pickle Module
        • How to Persist Objects
      • Python SQL Libraries
        • Create Python apps using SQL Server
    • Python
      • Python vs JavaScript
      • Python Pillow – Adjust Image
      • Python Library for Google Search
      • Python 3 - Quick Guide
      • Regular Expressions
        • Python Regular Expressions
        • Regular Expression (RegEx)
        • Validate ZIP Codes
        • Regular Expression Tutorial
      • Python Turtle
      • Python Beginner's Handbook
      • From Beginner to Pro
      • Standard Library
      • Datetime Tutorial
        • Manipulate Times, Dates, and Time Spans
      • Work With a PDF
      • geeksforgeeks.org
        • Python Tutorial
      • Class
      • Modules
        • Modules List
        • pickle Module
      • Working With Files
        • Open, Read, Append, and Other File Handling
        • File Manipulation
        • Reading & Writing to text files
      • Virtual Environments
        • Virtual Environments made easy
        • Virtual Environmen
        • A Primer
        • for Beginners
      • Functions
        • Function Guide
        • Inner Functions
      • Learning Python
        • Pt. 4 Python Strings
        • Pt. 3 Python Variables
      • Zip Function
      • Iterators
      • Try and Except
        • Exceptions: Introduction
        • Exceptions Handling
        • try and excep
        • Errors and Exceptions
        • Errors & Exceptions
      • Control Flow
      • Lambda Functions
        • Lambda Expression คืออะไร
        • map() Function
      • Date and Time
        • Python datetime()
        • Get Current Date and Time
        • datetime in Python
      • Awesome Python
      • Dictionary
        • Dictionary Comprehension
        • ALL ABOUT DICTIONARIES
        • DefaultDict Type for Handling Missing Keys
        • The Definitive Guide
        • Why Functions Modify Lists and Dictionaries
      • Python Structures
      • Variable & Data Types
      • List
        • Lists Explained
        • List Comprehensions
          • Python List Comprehension
          • List Comprehensions in 5-minutes
          • List Comprehension
        • Python List
      • String
        • Strings and Character Data
        • Splitting, Concatenating, and Joining Strings
      • String Formatting
        • Improved String Formatting Syntax
        • String Formatting Best Practices
        • Remove Space
        • Add Spaces
      • Important basic syntax
      • List all the packages
      • comment
    • Pandas
      • Tutorial (GeeksforGeeks)
      • 10 minutes to pandas
      • Options and settings
      • เริ่มต้น Set Up Kaggle.com
      • Pandas - Quick Guide
      • Cookbook
      • NumPy
        • NumPy Package for Scientific
      • IO tools (text, CSV, …)
      • pandas.concat
      • Excel & Google Sheets
        • A Guide to Excel
        • Quickstart to the Google Sheets
        • Python Excel Tutorial: The Definitive Guide
      • Working With Text Data
        • Quickstart
      • API Reference
      • Groupby
      • DateTime Methods
      • DataFrame
      • sort_values()
      • Pundit: Accessing Data in DataFrames
      • datatable
        • DataFrame: to_json()
        • pydatatable
      • Read and Write Files
      • Data Analysis with Pandas
      • Pandas and Python: Top 10
      • 10 minutes to pandas
      • Getting Started with Pandas in Python
    • Markdown
      • Create Responsive HTML Emails
      • Using Markup Languages with Hugo
    • AngularJS
      • Learn AngularJS
    • CSS
      • The CSS Handbook
      • Box Shadow
      • Image Center
      • The CSS Handbook
      • The CSS Handbook
      • Loading Animation
      • CSS Grid Layout
      • Background Image Size
      • Flexbox
  • Series
    • จาวาสคริปต์เบื้องต้น
      • 1: รู้จักกับจาวาสคริปต์
  • Articles
    • Visualization
      • Dash
        • Introducing Dash
    • Finance
      • PyPortfolioOpt
      • Best Libraries for Finance
      • Detection of price support
      • Portfolio Optimization
      • Python Packages For Finance
    • Django
      • เริ่มต้น Django RestFramework
    • General
      • Heroku คืออะไร
      • How to Crack Passwords
    • Notebook
      • IPython Documentation
      • Importing Notebooks
      • Google Colab for Data Analytics
      • Creating Interactive Dashboards
      • The Definitive Guide
      • A gallery of interesting Jupyter Notebooks
      • Advanced Jupyter Notebooks
      • Converting HTML to Notebook
    • Pandas
      • Pandas_UI
      • Pandas Style API
      • Difference Between two Dataframes
      • 19 Essential Snippets in Pandas
      • Time Series Analysis
      • Selecting Columns in a DataFrame
      • Cleaning Up Currency Data
      • Combine Multiple Excel Worksheets
      • Stylin’ with Pandas
      • Pythonic Data Cleaning
      • Make Excel Faster
      • Reading Excel (xlsx) Files
      • How to use iloc and loc for Indexing
      • The Easiest Data Cleaning Method
    • Python
      • pip install package
      • Automating your daily tasks
      • Convert Speech to Text
      • Tutorial, Project Ideas, and Tips
      • Image Handling and Processing
        • Image Processing Part I
        • Image Processing Part II
        • Image tutorial
        • Image Processing with Numpy
        • Converts PIL Image to Numpy Array
      • Convert Dictionary To JSON
      • JSON Dump
      • Speech-to-Text Model
      • Convert Text to Speech
      • Tips & Tricks
        • Fundamentals for Data Science
        • Best Python Code Examples
        • Top 50 Tips & Tricks
        • 11 Beginner Tips
        • 10 Tips & Tricks
      • Password hashing
      • psutil
      • Lambda Expressions
    • Web Scraping
      • Web Scraping using Python
      • Build a Web Scraper
      • Web Scraping for beginner
      • Beautiful Soup
      • Scrape Websites
      • Python Web Scraping
        • Web Scraping Part 1
        • Web Scraping Part 2
        • Web Scraping Part 3
        • Web Scraping Part 4
      • Web Scraper
    • Frontend
      • Book Online with GitBook
      • Progressive Web App คืออะไร
      • self-host a Hugo web app
  • Examples
    • Django
      • Build a Portfolio App
      • SchoolManagement
    • Flask
      • Flask Stock Visualizer
      • Flask by Example
      • Building Flask Apps
      • Flask 101
    • OpenCV
      • Build a Celebrity Look-Alike
      • Face Detection-OpenCV
    • Python
      • Make Game FLASH CARD
      • Sending emails using Google
      • ตรวจหาภาพซ้ำด้วย Perceptual hashing
        • Sending Emails in Python
      • Deck of Cards
      • Extract Wikipedia Data
      • Convert Python File to EXE
      • Business Machine Learning
      • python-business-analytics
      • Simple Blackjack Game
      • Python Turtle Clock
      • Countdown
      • 3D Animation : Moon Phases
      • Defragmentation Algorithm
      • PDF File
        • จัดการข้อความ และรูป จากไฟล์ PDF ด้วย PDFBox
      • Reading and Generating QR codes
      • Generating Password
        • generate one-time password (OTP)
        • Random Password Generator
        • Generating Strong Password
      • PyQt: Building Calculator
      • List Files in a Directory
      • [Project] qID – โปรแกรมแต่งรูปง่ายๆ เพื่อการอัพลงเว็บ
      • Python and Google Docs to Build Books
      • Tools for Record Linking
      • Create Responsive HTML Email
      • psutil()
      • Transfer Learning for Deep Learning
      • ดึงข้อมูลคุณภาพอากาศประเทศไทย
        • Image Classification
    • Web Scraper
      • Scrape Wikipedia Articles
        • Untitled
      • How Scrape Websites with Python 3
    • Finance
      • Algorithmic Trading for Beginners
      • Parse TradingView Stock
      • Creating a stock price database with MariaDB and python
      • Source Code
        • stocks-list
      • Visualizing with D3
      • Real Time Stock in Excel using Python
      • Create Stock Quote Module
      • The Magic Formula Lost Its Sparkle?
      • Stock Market Analysis
      • Stock Portfolio Analyses Part 1
      • Stock Portfolio Analyses Part 2
      • Build A Dashboard In Python
      • Stock Market Predictions with LSTM
      • Trading example
      • Algorithmic Trading Strategies
      • DOWNLOAD FUNDAMENTALS DATA
      • Algorithmic Trading
      • numfin
      • Financial Machine Learning
      • Algorithm To Predict Stock Direction
      • Interactive Brokers API Code
      • The (Artificially) Intelligent Investor
      • Create Auto-Updating Excel of Stock Market
      • Stock Market Predictions
      • Automate Your Stock Portfolio
      • create an analytics dashboard
      • Bitcoin Price Notifications
      • Portfolio Management
    • WebApp
      • CSS
        • The Best CSS Examples
      • JavaScript
        • Memory Game
      • School Clock
      • Frontend Tutorials & Example
      • Side Menu Bar with sub-menu
      • Create Simple CPU Monitor App
      • Vue.js building a converter app
      • jQuery
        • The Best jQuery Examples
      • Image Slideshow
      • Handle Timezones
      • Text to Speech with Javascript
      • Building Blog for Your Portfolio
      • Responsive Website Layout
      • Maths Homework Generator
  • Books
    • Finance
      • Python for Finance (O'Reilly)
    • Website
      • Hugo
        • Go Bootcamp
        • Hugo in Action.
          • About this MEAP
          • Welcome
          • 1. The JAM stack with Hugo
          • 2. Live in 30 minutes
          • 3. Using Markup for content
          • 4. Content Management with Hugo
          • 5. Custom Pages and Customized Content
          • 6. Structuring web pages
          • A Appendix A.
          • B Appendix B.
          • C Appendix C.
    • Python
      • ภาษาไพธอนเบื้องต้น
      • Python Cheatsheet
        • Python Cheatsheet
      • Beginning Python
      • IPython Cookbook
      • The Quick Python Book
        • Case study
        • Part 1. Starting out
          • 1. About Python
          • 2. Getting started
          • 3. The Quick Python overview
        • Part 2. The essentials
          • 14. Exceptions
          • 13. Reading and writing files
          • 12. Using the filesystem
          • 11. Python programs
          • 10. Modules and scoping rules
          • 9. Functions
          • 8. Control flow
          • 4. The absolute basics
          • 5. Lists, tuples, and sets
          • 6. Strings
          • 7. Dictionaries
        • Part 3. Advanced language features
          • 19. Using Python libraries
          • 18. Packages
          • 17. Data types as objects
          • 16. Regular expressions
          • 15. Classes and OOP
        • Part 4. Working with data
          • Appendix B. Exercise answers
          • Appendix A. Python’s documentation
          • 24. Exploring data
          • 23. Saving data
          • 20. Basic file wrangling
          • 21. Processing data files
          • 22. Data over the network
      • The Hitchhiker’s Guide to Python
      • A Whirlwind Tour of Python
        • 9. Defining Functions
      • Automate the Boring Stuff
        • 4. Lists
        • 5. Dictionaries
        • 12. Web Scraping
        • 13. Excel
        • 14. Google Sheets
        • 15. PDF and Word
        • 16. CSV and JSON
    • IPython
    • Pandas
      • จัดการข้อมูลด้วย pandas เบื้องต้น
      • Pandas Tutorial
  • Link Center
    • Temp
  • เทควันโด
    • รวมเทคนิค
    • Help and Documentation
  • Image
    • Logistics
Powered by GitBook
On this page

Was this helpful?

  1. Books
  2. Python
  3. The Quick Python Book
  4. Part 4. Working with data

24. Exploring data

Chapter 24. Exploring data

Over the past few chapters, I’ve dealt with some aspects of using Python to get and clean data. Now it’s time to look at a few of the things that Python can help you do to manipulate and explore data.

24.1. Python tools for data exploration

In this chapter, we’ll look at some common Python tools for data exploration: Jupyter notebook, pandas, and matplotlib. I can only touch briefly on a few features of these tools, but the aim is to give you an idea of what is possible and some initial tools to use in exploring data with Python.

24.1.1. Python’s advantages for exploring data

Python has become one of the leading languages for data science and continues to grow in that area. As I’ve mentioned, however, Python isn’t always the fastest language in terms of raw performance. Conversely, some data-crunching libraries, such as NumPy, are largely written in C and heavily optimized to the point that speed isn’t an issue. In addition, considerations such as readability and accessibility often outweigh pure speed; minimizing the amount of developer time needed is often more important. Python is readable and accessible, and both on its own and in combination with tools developed in the Python community, it’s an enormously powerful tool for manipulating and exploring data.

24.1.2. Python can be better than a spreadsheet

Spreadsheets have been the tools of choice for ad-hoc data manipulation for decades. People who are skilled with spreadsheets can make them do truly impressive tricks: spreadsheets can combine different but related data sets, pivot tables, use lookup tables to link data sets, and much more. But although people everywhere get a vast amount of work done with them every day, spreadsheets do have limitations, and Python can help you go beyond those limitations.

One limitation that I’ve already alluded to is the fact that most spreadsheet software has a row limit—currently, about 1 million rows, which isn’t enough for many data sets. Another limitation is the central metaphor of the spreadsheet itself. Spreadsheets are two-dimensional grids, rows and columns, or at best stacks of grids, which limits the ways you can manipulate and think about complex data.

With Python, you can code your way around the limitations of spreadsheets and manipulate data the way you want. You can combine Python data structures such as lists, tuples, sets, and dictionaries in endlessly flexible ways, or you can create your own classes to package both data and behavior exactly the way you need.

24.2. Jupyter notebook

Probably one of the most compelling tools for exploring data with Python doesn’t augment what the language itself does, but changes the way you use the language to interact with your data. Jupyter notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text. Although several other languages are now supported, it originated in connection with IPython, an alternative shell for Python developed by the scientific community.

What makes Jupyter such a convenient and powerful tool is the fact that you interact with it in a web browser. It lets you combine text and code, as well as modify and execute your code interactively. You can not only run and modify code in chunks, but also save and share the notebooks with others.

The best way to get a feel for what Jupyter notebook can do is start playing with it. It’s fairly easy to run a Jupyter process locally on your machine, or you can access online versions. For some options, see the sidebar on ways to run Jupyter.

WAYS TO RUN JUPYTER

Jupyter locally: Although using an online instance is quite convenient, it’s not very much work to set up your own instance of Jupyter on your computer. Usually for local versions, you point your browser to localhost:8888.

If you use Docker, you have several containers to choose among. To run the data science notebook container, use something like this:

1docker run -it --rm -p 8888:8888 jupyter/datascience-notebook

If you’d rather run directly on your system, it’s easy to install and run Jupyter in a virtualenv.

macOS and Linux systems: First, open a command window, and enter the following commands:

 > python3 -m venv jupyter
 > cd jupyter
 > source bin/activate
 > pip install jupyter
 > jupyter-notebook

Windows systems:

 > python3 -m venv jupyter
 > cd jupyter
 > Scripts/bin/activate
 > pip install jupyter
 > Scripts/jupyter-notebook

The last command should run the Jupyter notebook web app and open a browser window pointing at it.

24.2.1. Starting a kernel

When you have Jupyter installed, running, and open in your browser, you need to start a Python kernel. One nice thing about Jupyter is that it lets you run multiple kernels at the same time. You can run kernels for different versions of Python and for other languages such as R, Julia, and even Ruby.

Figure 24.1. Starting a Python kernel

24.2.2. Executing code in a cell

Figure 24.2. Executing code in a notebook cell

You can test how it works by entering some code or an expression into the first cell of your new notebook and then pressing Alt-Enter.

As you can see, any output is shown immediately below the cell, and a new cell is created and ready for your next input. Also note that each cell that’s executed is numbered in the order in which it’s executed.

TRY THIS: USING JUPYTER NOTEBOOK

Enter some code in the notebook and experiment with running it. Check out the Edit, Cell, and Kernel menus to see what options are there. When you have a little code running, use the Kernel menu to restart the kernel, repeat your steps, and then use the Cell menu to rerun the code in all of the cells.

24.3. Python and pandas

In the course of exploring and manipulating data, you perform quite a few common operations, such as loading data into a list or dictionary, cleaning data, and filtering data. Most of these operations are repeated often, have to be done in standard patterns, and are simple and often tedious. If you think that this combination is a strong reason to automate those tasks you’re not alone. One of the now-standard tools for handling data in Python—pandas—was created to automate the boring heavy lifting of handling data sets.

24.3.1. Why you might want to use pandas

pandas was created to make manipulating and analyzing tablular or relational data easy by providing a standard framework for holding the data, with convenient tools for frequent operations. As a result, it’s almost more of an extension to Python than a library, and it changes the way you can interact with data. The plus side is that after you grok how pandas work, you can do some impressive things and save a lot of time. It does take time to learn how to get the most from pandas, however. As with many tools, if you use pandas for what it was designed for, it excels. The simple examples I show you in the following sections should give you a rough idea whether pandas is a tool that’s suited for your use cases.

24.3.2. Installing pandas

pandas is easy to install with pip. It’s often used along with matplotlib for plotting, so you can install both tools from the command line of your Jupyter virtual environment with this code:

pip install pandas matplotlib

From a cell in a Jupyter notebook, you can use

!pip install pandas matplotlib

copy

If you use pandas, life will be easier if you use the following three lines:

%matplotlib inline
import pandas as pd
import numpy as np

The first line is a Jupyter “magic” function that enables matplotlib to plot data in the cell where your code is (which is very useful). The second line imports pandas with the alias of pd, which is both easier to type and common among pandas users; the last line also imports numpy. Although pandas depends quite a bit on numpy, you won’t use it explicitly in the following examples, but it’s reasonable to get into the habit of importing it anyway.

24.3.3. Data frames

One basic structure that you get with pandas is a data frame. A data frame is a two-dimensional grid, rather similar to a relational database table except in memory. Creating a data frame is easy; you give it some data. To keep things absolutely simple, give it a 3 × 3 grid of numbers as the first example. In Python, such a grid is a list of lists:

grid = [[1,2,3], [4,5,6], [7,8,9]]
print(grid)

# [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

copy

Sadly, in Python the grid won’t look like a grid unless you make some additional effort. So see what you can do with the same grid as a pandas data frame:

import pandas as pd
df = pd.DataFrame(grid)
print(df)

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9

That code is fairly straightforward; all you needed to do was turn your grid into a data frame. You’ve gained a more gridlike display, and now you have both row and column numbers. It’s often rather bothersome to keep track of what column number is what, of course, so give your columns names:

df = pd.DataFrame(grid, columns=["one", "two", "three"] )
print(df)

   one  two  three
0    1    2      3
1    4    5      6
2    7    8      9

You may wonder whether naming the columns has any benefit, but the column names can be put to use with another pandas trick: the ability to select columns by name. If you want the contents only of column "two", for example, you can get it very simply:

print(df["two"])

0    2
1    5
2    8
Name: two, dtype: int64

Here, you’ve already saved time in comparison to Python. To get only column two of your grid, you’d need to use a list comprehension while also remembering to use a zero-based index (and you still wouldn’t get the nice output):

print([x[1] for x in grid])
[2, 5, 8]

You can loop over data frame column values just as easily as the list you got by using a comprehension:

for x in df["two"]:
    print(x)
2
5
8

That’s not bad for a start, but by using a list of columns in double brackets, you can do better, getting a subset of the data frame that’s another data frame. Instead of getting the middle column, get the first and last columns of your data frame as another data frame:

edges = df[["one", "three"]]
print(edges)
   one  three
0    1      3
1    4      6
2    7      9

copy

A data frame also has several methods that apply the same operation and argument to every item in the frame. If you want to add two to every item in the data frame’s edges, you could use the add() method:

print(edges.add(2))
   one  three
0    3      5
1    6      8
2    9     11

Here again, it’s possible to get the same result by using list comprehensions and/or nested loops, but those techniques aren’t as convenient. It’s pretty easy to see how such functionality can make life easier, particularly for someone who’s more interested in the information that the data contains than in the process of manipulating it.

24.4. Data cleaning

In earlier chapters, I discussed a few ways to use Python to clean data. Now that I’ve added pandas to the mix, I’ll show you examples of how to use its functionality to clean data. As I present the following operations, I also refer to ways that the same operation might be done in plain Python, both to illustrate how using pandas is different and to show why pandas isn’t right for every use case (or user, for that matter).

24.4.1. Loading and saving data with pandas

pandas has an impressive collection of methods to load data from different sources. It supports several file formats (including fixed-width and delimited text files, spreadsheets, JSON, XML, and HTML), but it’s also possible to read from SQL databases, Google BiqQuery, HDF, and even clipboard data. You should be aware that many of these operations aren’t actually part of pandas itself; pandas relies on having other libraries installed to handle those operations, such as SQLAlchemy for reading from SQL databases. This distinction matters mostly if something goes wrong; quite often, the problem that needs to be fixed is outside pandas, and you’re left to deal with the underlying library.

Reading a JSON file with the read_json() method is simple:

1mars = pd.read_json("mars_data_01.json")

This code gives you a data frame like this:

report
abs_humidity                         None
atmo_opacity                        Sunny
ls                                    296
max_temp                               -1
max_temp_fahrenheit                  30.2
min_temp                              -72
min_temp_fahrenheit                 -97.6
pressure                              869
pressure_string                    Higher
season                           Month 10
sol                                  1576
sunrise              2017-01-11T12:31:00Z
sunset               2017-01-12T00:46:00Z
terrestrial_date               2017-01-11
wind_direction                         --
wind_speed                           None
temp = pd.read_csv("temp_data_01.csv")

          4      5    6     7     8      9    10    11    12     13    14 \

0  1979/01/01  17.48  994   6.0  30.5   2.89  994 -13.6  15.8    NaN    0
1  1979/01/02   4.64  994  -6.4  15.8  -9.03  994 -23.6   6.6    NaN    0
2  1979/01/03  11.05  994  -0.7  24.7  -2.17  994 -18.3  12.9    NaN    0
3  1979/01/04   9.51  994   0.2  27.6  -0.43  994 -16.3  16.3    NaN    0
4  1979/05/15  68.42  994  61.0  75.1  51.30  994  43.3  57.0    NaN    0
5  1979/05/16  70.29  994  63.4  73.5  48.09  994  41.1  53.0    NaN    0
6  1979/05/17  75.34  994  64.0  80.5  50.84  994  44.3  55.7  82.60    2
7  1979/05/18  79.13  994  75.5  82.1  55.68  994  50.0  61.1  81.42  349
8  1979/05/19  74.94  994  66.9  83.1  58.59  994  50.9  63.2  82.87   78

     15    16      17
0   NaN   NaN  0.0000
1   NaN   NaN  0.0000
2   NaN   NaN  0.0000
3   NaN   NaN  0.0000
4   NaN   NaN  0.0000
5   NaN   NaN  0.0000
6  82.4  82.8  0.0020
7  80.2  83.4  0.3511
8  81.6  85.2  0.0785

Clearly, loading the file in a single step is appealing, and you can see that pandas had no issues loading the file. You can also see that the empty first column has been translated into NaN (not a number). You do still have the same issue with 'Missing' for some values, and in fact it might make sense to have those 'Missing' values converted to NaN:

temp = pd.read_csv("temp_data_01.csv", na_values=['Missing'])

The addition of the na_values parameter controls what values will be translated to NaN on load. In this case, you added the string 'Missing' so that the row of the data frame was translated from

NaN  Illinois  17  Jan 01, 1979  1979/01/01  17.48  994  6.0  30.5  2.89994
     -13.6  15.8  Missing  0  Missing  Missing  0.00%

to

NaN  Illinois  17  Jan 01, 1979  1979/01/01  17.48  994  6.0  30.5  2.89994
     -13.6  15.8  NaN0  NaN  NaN  0.00%

This technique can be particularly useful if you have one of those data files in which, for whatever reason, “no data” is indicated in a variety of ways: NA, N/A, ?, -, and so on. To handle a case like that, you can inspect the data to find out what’s used and then reload it, using the na_values parameter to standardize all those variations as NaN.

Saving data

If you want to save the contents of a data frame, a pandas data frame has a similarly broad collection of methods. If you take your simple grid data frame, you can write it in several ways. This line

df.to_csv("df_out.csv", index=False)

writes a file that looks like this:

one,two,three
1,2,3
4,5,6
7,8,9

copy

Similarly, you can transform a data grid to a JSON object or write it to a file:

df.to_json()

'{"one":{"0":1,"1":4,"2":7},"two":{"0":2,"1":5,"2":8},"three":{"0":3,"1":6,"2
     ":9}}'

24.4.2. Data cleaning with a data frame

Converting a particular set of values to NaN on load is a very simple bit of data cleaning that pandas makes trivial. Going beyond that, data frames support several operations that can make data cleaning less of a chore. To see how this works, reopen the temperature CSV file, but this time, instead of using the headers to name the columns, use the range() function with the names parameter to give them numbers, which will make referring to them easier. You also may recall from an earlier example that the first field of every line—the "Notes" field—is empty and loaded with NaN values. Although you could ignore this column, it would be even easier if you didn’t have it. You can use the range() function again, this time starting from 1, to tell pandas to load all columns except the first one. But if you know that all of your values are from Illinois and you don’t care about the long-form date field, you could start from 4 to make things much more manageable:

temp = pd.read_csv("temp_data_01.csv", na_values=['Missing'], header=0,
     names=range(18), usecols=range(4,18))
print(temp)

           4      5    6     7     8      9    10    11    12     13   14  \
0  1979/01/01  17.48  994   6.0  30.5   2.89  994 -13.6  15.8    NaN    0
1  1979/01/02   4.64  994  -6.4  15.8  -9.03  994 -23.6   6.6    NaN    0
2  1979/01/03  11.05  994  -0.7  24.7  -2.17  994 -18.3  12.9    NaN    0
3  1979/01/04   9.51  994   0.2  27.6  -0.43  994 -16.3  16.3    NaN    0
4  1979/05/15  68.42  994  61.0  75.1  51.30  994  43.3  57.0    NaN    0
5  1979/05/16  70.29  994  63.4  73.5  48.09  994  41.1  53.0    NaN    0
6  1979/05/17  75.34  994  64.0  80.5  50.84  994  44.3  55.7  82.60    2
7  1979/05/18  79.13  994  75.5  82.1  55.68  994  50.0  61.1  81.42  349
8  1979/05/19  74.94  994  66.9  83.1  58.59  994  50.9  63.2  82.87   78

     15    16      17
0   NaN   NaN   0.00%
1   NaN   NaN   0.00%
2   NaN   NaN   0.00%
3   NaN   NaN   0.00%
4   NaN   NaN   0.00%
5   NaN   NaN   0.00%
6  82.4  82.8   0.20%
7  80.2  83.4  35.11%
8  81.6  85.2   7.85%

Now you have a data frame that has only the columns you might want to work with. But you still have an issue: the last column, which lists the percentage of coverage for the heat index, is still a string ending with a percentage sign rather than an actual percentage. This problem is apparent if you look at the first row’s value for column 17:

temp[17][0]

'0.00%'

copy

To fix this problem, you need to do two things: Remove the % from the end of the value and then cast the value from string to a number. Optionally, if you want to represent the resulting percentage as a fraction, you need to divide it by 100. The first bit is simple because pandas lets you use a single command to repeat an operation on a column:

temp[17] = temp[17].str.strip("%")
temp[17][0]

'0.00'

This code takes the column and calls a string strip() operation on it to remove the trailing %. Now when you look at the first value in the column (or any of the other values), you see that the offending percentage sign is gone. It’s also worth noting that you could have used other operations, such as replace("%", ""), to achieve the same result.

The second operation is to convert the string to a numeric value. Again, pandas lets you perform this operation with one command:

temp[17] = pd.to_numeric(temp[17])
temp[17][0]
0.0

Now the values in column 17 are numeric, and if you want to, you can use the div() method to finish the job of turning those values into fractions:

temp[17] = temp[17].div(100)
temp[17]

0    0.0000
1    0.0000
2    0.0000
3    0.0000
4    0.0000
5    0.0000
6    0.0020
7    0.3511
8    0.0785
Name: 17, dtype: float64

In fact, it would be possible to achieve the same result in a single line by chaining the three operations together:

temp[17] = pd.to_numeric(temp[17].str.strip("%")).div(100)

This example is very simple, but it gives you an idea of the convenience that pandas can bring to cleaning your data. pandas has a wide variety of operations for transforming data, as well as the ability to use custom functions, so it would be hard to think of a scenario in which you couldn’t streamline data cleaning with pandas.

TRY THIS: CLEANING DATA WITH AND WITHOUT PANDAS

Experiment with the operations. When the final column has been converted to a fraction, can you think of a way to convert it back to a string with the trailing percentage sign?

By contrast, load the same data into a plain Python list by using the csv module, and apply the same changes by using plain Python.

24.5. Data aggregation and manipulation

The preceding examples probably gave you some idea of the many options pandas gives you for performing fairly complex operations on your data with only a few commands. As you might expect, this level of functionality is also available for aggregating data. In this section, I walk through a few simple examples of aggregating data to illustrate some of the many possibilities. Although many options are available, I focus on merging data frames, performing simple data aggregation, and grouping and filtering.

24.5.1. Merging data frames

Quite often in the course of handling data, you need to relate two data sets. Suppose that you have one file containing the number of sales calls made per month by members of a sales team, and in another file, you have the dollar amounts of the sales in each of their territories:

calls = pd.read_csv("sales_calls.csv")
print(calls)

   Team member  Territory  Month  Calls
0        Jorge          3      1    107
1        Jorge          3      2     88
2        Jorge          3      3     84
3        Jorge          3      4    113
4          Ana          1      1     91
5          Ana          1      2    129
6          Ana          1      3     96
7          Ana          1      4    128
8          Ali          2      1    120
9          Ali          2      2     85
10         Ali          2      3     87
11         Ali          2      4     87

revenue = pd.read_csv("sales_revenue.csv")
print(revenue)

    Territory  Month  Amount
0           1      1   54228
1           1      2   61640
2           1      3   43491
3           1      4   52173
4           2      1   36061
5           2      2   44957
6           2      3   35058
7           2      4   33855
8           3      1   50876
9           3      2   57682
10          3      3   53689
11          3      4   49173

copy

Clearly, it would be very useful to link revenue and team-member activity. These two files are very simple, yet merging them with plain Python isn’t entirely trivial. pandas has a function to merge two data frames:

1calls_revenue = pd.merge(calls, revenue, on=['Territory', 'Month']

The merge function creates a new data frame by joining the two frames on the columns specified in the column field. The merge function works similarly to a relational-database join, giving you a table that combines the columns from the two files:

print(calls_revenue)
   Team member  Territory  Month  Calls  Amount
0        Jorge          3      1    107   50876
1        Jorge          3      2     88   57682
2        Jorge          3      3     84   53689
3        Jorge          3      4    113   49173
4          Ana          1      1     91   54228
5          Ana          1      2    129   61640
6          Ana          1      3     96   43491
7          Ana          1      4    128   52173
8          Ali          2      1    120   36061
9          Ali          2      2     85   44957
10         Ali          2      3     87   35058
11         Ali          2      4     87   33855

In this case, you have a one-to-one correspondence between the rows in the two fields, but the merge function can also do one-to-many and many-to-many joins, as well as right and left joins.

QUICK CHECK: MERGING DATA SETS

How would you go about merging to data sets like the ones in the Python example?

24.5.2. Selecting data

It can also be useful to select or filter the rows in a data frame based on some condition. In the example sales data, you may want to look only at territory 3, which is also easy:

print(calls_revenue[calls_revenue.Territory==3])

  Team member  Territory  Month  Calls  Amount
0       Jorge          3      1    107   50876
1       Jorge          3      2     88   57682
2       Jorge          3      3     84   53689
3       Jorge          3      4    113   49173

In this example, you select only rows in which the territory is equal to 3 but using exactly that expression, revenue.Territory==3, as the index for the data frame. From the point of view of plain Python, such use is nonsense and illegal, but for a pandas data frame, it works and makes for a much more concise expression.

More complex expressions are also allowed, of course. If you want to select only rows in which the amount per call is greater than 500, you could use this expression instead:

print(calls_revenue[calls_revenue.Amount/calls_revenue.Calls>500])

  Team member  Territory  Month  Calls  Amount
1       Jorge          3      2     88   57682
2       Jorge          3      3     84   53689
4         Ana          1      1     91   54228
9         Ali          2      2     85   44957

copy

Even better, you could calculate and add that column to your data frame by using a similar operation:

calls_revenue['Call_Amount'] = calls_revenue.Amount/calls_revenue.Calls
print(calls_revenue)

   Team member  Territory  Month  Calls  Amount  Call_Amount
0        Jorge          3      1    107   50876   475.476636
1        Jorge          3      2     88   57682   655.477273
2        Jorge          3      3     84   53689   639.154762
3        Jorge          3      4    113   49173   435.159292
4          Ana          1      1     91   54228   595.912088
5          Ana          1      2    129   61640   477.829457
6          Ana          1      3     96   43491   453.031250
7          Ana          1      4    128   52173   407.601562
8          Ali          2      1    120   36061   300.508333
9          Ali          2      2     85   44957   528.905882
10         Ali          2      3     87   35058   402.965517
11         Ali          2      4     87   33855   389.137931

Again, note that pandas’s built-in logic replaces a more cumbersome structure in plain Python.

QUICK CHECK: SELECTING IN PYTHON

What Python code structure would you use to select only rows meeting certain conditions?

24.5.3. Grouping and aggregation

As you might expect, pandas has plenty of tools to summarize and aggregate data as well. In particular, getting the sum, mean, median, minimum, and maximum values from a column uses clearly named column methods:

print(calls_revenue.Calls.sum())
print(calls_revenue.Calls.mean())
print(calls_revenue.Calls.median())
print(calls_revenue.Calls.max())
print(calls_revenue.Calls.min())
1215
101.25
93.5
129
84

If, for example, you want to get all of the rows in which the amount per call is above the median, you can combine this trick with the selection operation:

print(calls_revenue.Call_Amount.median())
print(calls_revenue[calls_revenue.Call_Amount >=
     calls_revenue.Call_Amount.median()])

464.2539427570093
  Team member  Territory  Month  Calls  Amount  Call_Amount
0       Jorge          3      1    107   50876   475.476636
1       Jorge          3      2     88   57682   655.477273
2       Jorge          3      3     84   53689   639.154762
4         Ana          1      1     91   54228   595.912088
5         Ana          1      2    129   61640   477.829457
9         Ali          2      2     85   44957   528.905882

In addition to being able to pick out summary values, it’s often useful to group the data based on other columns. In this simple example, you can use the groupby method to group your data. You may want to know the total calls and amounts by month or by territory, for example. In those cases, use those fields with the data frame’s groupby method:

print(calls_revenue[['Month', 'Calls', 'Amount']].groupby(['Month']).sum())

       Calls  Amount
Month
1        318  141165
2        302  164279
3        267  132238
4        328  135201

print(calls_revenue[['Territory', 'Calls',
     'Amount']].groupby(['Territory']).sum())

           Calls  Amount
Territory
1            444  211532
2            379  149931
3            392  211420

In each case, you select the columns that you want to aggregate, group them by the values in one of those columns, and (in this case) sum the values for each group. You could also use any of the other methods mentioned earlier in this chapter.

TRY THIS: GROUPING AND AGGREGATING

Experiment with pandas and the data in previous examples. Can you get the calls and amounts by both team member and month?

24.6. Plotting data

Another very attractive feature of pandas is the ability to plot the data in a data frame very easily. Although you have many options for plotting data in Python and Jupyter notebook, pandas can use matplotlib directly from a data frame. You may recall that when you started your Jupyter session, one of the first commands you gave was the Jupyter “magic” command to enable matplotlib for inline plotting:

%matplotlib inline
calls_revenue[['Territory', 'Calls']].groupby(['Territory']).sum().plot.bar()

Figure 24.3. Bar plot of a pandas data frame in Jupyter notebook

Other options are available. plot() alone or .plot.line() creates a line graph, .plot.pie() creates a pie chart, and so on.

Thanks to the combination of pandas and matplotlib, plotting data in a Jupyter notebook is quite easy. I should also note that although such plotting is easy, there are many things that it doesn’t do extremely well.

TRY THIS: PLOTTING

Plot a line graph of the monthly average amount per call.

24.7. Why you might not want to use pandas

The preceding examples illustrate only a tiny fraction of the tools pandas can offer you in cleaning, exploring, and manipulating data. As I mentioned at the beginning of this chapter, pandas is an excellent tool set that excels in what it was designed to do. That doesn’t mean, however, that pandas is the tool for all situations or for all people.

There are reasons why you might elect to use plain old Python (or some other tool) instead. For one thing, as I mention earlier, learning to fully use pandas is in some ways like learning another language, which may not be something you have the time or inclination for. Also, pandas may not be ideal in all production situations, particularly with very large data sets that don’t require much in the way of math operations or with data that isn’t easy to put into the formats that work best with pandas. Munging large collections of product information, for example, probably wouldn’t benefit so much from pandas; neither would basic processing of a stream of transactions.

The point is that you should choose your tools thoughtfully based on the problems at hand. In many cases, pandas will truly make your life easier as you work with data, but in other cases, plain old Python may be your best bet.

Summary

  • Python offers many benefits for data handling, including the ability to handle very large data sets and the flexibility to handle data in ways that match your needs.

  • Jupyter notebook is a useful way to access Python via a web browser, which also makes improved presentation easier.

  • pandas is a tool that makes many common data-handling operations much easier, including cleaning, combining, and summarizing data.

  • pandas also makes simple plotting much easier.

PreviousAppendix A. Python’s documentationNext23. Saving data

Last updated 5 years ago

Was this helpful?

Jupyter online: Accessing online instances of Jupyter is one of the easiest ways to get started. Currently, Project Jupyter, the community behind Jupyter, hosts free notebooks at . You can also find demo notebooks and kernels for other languages. At this writing, you can also access free notebooks on Microsoft’s Azure platform at , and many other ways are available.

Starting a kernel is easy. Just click the new button and select Python 3 ().

When you have a kernel running, you can start entering and running Python code. Right away, you’ll notice a few differences from the ordinary Python command shell. You won’t get the >>> prompt that you see in the standard Python shell, and pressing Enter just adds new lines in the cell. To execute the code in a cell, illustrated in , choose Cell > Run Cells, click the Run button immediately to the left of the down arrow on the button bar, or use the key combination Alt-Enter. After you use Jupyter notebook a little bit, it’s quite likely that the Alt-Enter key combination will become quite natural to you.

For another example of how simple reading data into pandas is, load some data from the CSV file of temperature data from and from the JSON file of Mars weather data used in . In the first case, use the read_csv() method:

Although the number of options is almost overwhelming, a wide variety of tutorials and videos is available, and the documentation at is excellent.

Again, all these examples are simple, but they illustrate a few of the options you have for manipulating and selecting data with pandas. If these ideas resonate with your needs, you can learn more by studying the pandas documentation at .

Because you have the ability to plot, see how you might plot some data (). To continue with the sales example, if you want to plot the quarter’s mean sales by territory, you can get a graph right in your notebook just by adding .plot.bar():

https://jupyter.org/try
https://notebooks.azure.com
figure 24.1
figure 24.2
chapter 21
chapter 22
http://pandas.pydata.org
http://pandas.pydata.org
figure 24.3