19 Essential Snippets in Pandas
Aug 26, 2016
Last updated
Aug 26, 2016
Last updated
After playing around with Pandas Python Data Analysis Library for about a month, Iโve compiled a pretty large list of useful snippets that I find myself reusing over and over again. These tips can save you some time sifting through the comprehensive Pandas docs.
For this article, we are starting with a DataFrame filled with Pizza orders. If youโre brand new to Pandas, hereโs a few translations and key terms.
DataFrame - Indexed rows and columns of data, like a spreadsheet or database table.
Series = Single column of data.
Axis - 0 == Rows, 1 == Columns
Shape - (number_of_rows, number_of_columns) in a DataFrame
There are a ton of options for the read_csv function that can simplify preprocessing of data. Nobody want to waste time cleaning data, so see if you can knock it out when import the initial file.
Need to parse dates? Just pass in the corresponding column name(s).
Only need a few specific columns?
The first thing you probably want to do is see what the data looks like. Here a few ways to check out Pandas data.
Hereโs the head of the pizza DataFrameโฆ
order_number
date
size
topping
price
discount
coupon
0
PZZA0001
08/21/16
Small
Anchovies
12.99
3.5
Yes
1
PZZA0000
09/26/16
Large
Pepperoni
14.50
0.0
No
2
PZZA0001
09/27/16
Extra Large
Bell Pepper
19.99
0.0
No
3
PZZA0002
09/28/16
Extra Large
Olives
20.99
5.0
Yes
4
PZZA0003
09/29/16
Extra Large
Pepperoni
21.99
0.0
No
The quick and easy way is to just define a new column on the dataframe. This will give us column with the number 23 on every row. Usually, you will be setting the new column with an array or Series that matches the number of rows in the data.
Need to build a new column based on values from other columns?
Need the column in a certain order? The first argument is the position of the column. This will put the column at the begining of the DataFrame.
By cell I mean a single row/column intersection, like those in an Excel spreadsheet. You would expect this to be simple, but the syntax is not very obvious. There are three methods in Pandas that almost do the same thing, .loc, iloc, .ix โ adding to the confusion for newcomers.
Typically, I use .ix
because it allows a mix of integers and strings. Enter the index of the row first, then the column.
You can also select the column first with dot notation, then the row index, which looks a little cleaner.
Either method will return the value of the cell.
>>> 'Bell Pepper'
Letโs the we need to analyze orders that have pineapple in the topping column.
Or that meet a certain price threshold
How about both at the same time? Just add the conditions to tuples and connect them with a bitwise operator.
Now we have all the pizzas with a Pineapple topping priced over 11.99.
order_number
date
size
topping
price
discount
coupon
6
PZZA0006
10/01/16
Medium
Pineapple
17.50
0.0
No
9
PZZA0009
10/04/16
Medium
Pineapple
12.99
2.0
Yes
Pretty self-explanatory, but very useful.
Anonymous lambda functions in Python are useful for these tasks. Letโs say we need to calculate taxes for every row in the DataFrame with a custom function. The pandas apply method allows us to pass a function that will run on every value in a column. In this example, we extract a new taxes feature by running a custom function on the price data.
order_number
price
taxes
0
PZZA0000
12.99
1.5588
1
PZZA0001
14.50
1.7400
2
PZZA0002
19.99
2.3988
3
PZZA0003
20.99
2.5188
4
PZZA0004
21.99
2.6388
The where function in numpy is useful when extracting features with conditional logic. Letโs imagine the pizza parlor is only profitable on sales above $15.00. We create a new column based on this insight like so:
If you have a DataFrame with the same type of data in every column, possibly a time series with financial data, you may need to find he mean horizontally.
or to find the standard deviation vertically
Converting the the values in a DataFrame to an array is simple
If you want to preserve the table presentation
You can concatenate rows or columns together, the only requirement is that the shape is the same on corresponding axis. To concat rows vertically:
Or to concat columns horizontally:
Merging in Pandas works just like SQL. If you you have two DataFrames that share a key, perhaps a pizza โorder_idโ, you can perform inner, outer, left, right joins just like you would in SQL.
First, make sure the data is in datetime format. Then use dt
method to extract the data you need.
Count the total number of NaNs present:
List the NaN count for each column:
Most machine learning algorithms do not like NaN values, so youโll probably need to convert them. If the topping column is missing some values, we can fill them a default value.
or we can drop any row missing data across the entire DataFrame:
Grouping columns is a great way to extract features from data. This is especially useful when you have data that can be counted or quantified in some way. For example, you might have group pizzas by topping, then calculate the mean for price in each group.
or maybe you want to see the count of a certain value
Letโs say we want to create 3 separate bins for different price ranges. This is especially useful for simplifying noisy data.
order_number
price
price_point
0
PZZA0000
12.99
Normal
1
PZZA0001
14.50
Normal
2
PZZA0002
19.99
Expensive
3
PZZA0003
20.99
Expensive
4
PZZA0004
21.99
Expensive
Letโs say we want to categorize toppings by โvegetableโ or โmeatโ. Dealing with nominal values like these can be handled with a for loop. (Note: you can also use the apply function described earlier to perform this task. )
Sometimes you might have a massive file that will max out your RAM and crash your system. In that case, you might need to analyze the file in smaller chunks.
Reference : https://jeffdelaney.me