Pandas and Python: Top 10
MAR 7TH, 2013 1:43 PM
I recently discovered the high-performance Pandas library written in Python while performing data munging in a machine learning project. Using simple examples, I want to highlight my favorite (and sometimes hard to find) features.
Apart from serving as a quick reference, I hope this post will help new users to quickly start extracting value from Pandas. For a good overview of Pandas and its advanced features, I highly recommended Wes McKinney’s Python for Data Analysis book and the documentation on the website.
Here is my top 10 list:
Example DataFrame
I will use a simple data frame for explanation.
Indexing
Selecting a subset of columns
It is one of the simplest features but was surprisingly difficult to find. The ix method works elegantly for this purpose. Suppose you wanted to index only using columns int_col and string_col, you would use the advanced indexing ix method as shown below.
EDIT Suggestion by Dan in the comments below. Another (probably more elegant) syntax for indexing multiple columns is given below.
Conditional indexing
One can index using boolean indexing
EDIT Suggestion by Roby Levy in the comments below. One can select multiple boolean operators (| for or, & for and, and ~ for not) and group them by parenthisis.
Renaming columns
Use the rename method to rename columns. It copies the data to another DataFrame.
Set the inplace = True flag incase you want to modify the existing DataFrame.
Handling missing values
Handling of missing values can be performed beautifully using pandas.
Drop missing values
The dropna can used to drop rows or columns with missing data (NaN). By default, it drops all rows with any missing entry.
Fill missing values
The fillna method on the other hand can be used to fill missing data (NaN). The example below shows a simple replacement using the mean of the available values.
Map, Apply
Forget writing for loops while using pandas. One can do beautiful vectorized computation by applying function over rows and columns using the map, apply and applymap methods.
map
The map operation operates over each element of a Series.
apply
The apply is a pretty flexible function which, as the name suggests, applies a function along any axis of the DataFrame. The examples show the application of the sum function over columns. (Thanks to Mindey in the comments below to use np.sum instead of np.sqrt in the example)
applymap
The applymap operation can be used to apply the function to each element of the DataFrame.
Vectorized mathematical and string operations
HT: @janschulz
One can perform vectorized calculations using simple operators and numpy functions.
Also, vectorized string operations are easy to use.
GroupBy
The groupby method let’s you perform SQL-like grouping operations. The example below shows a grouping operation performed with str_col columns entries as keys. It is used to calculate the mean of the float_col for each key. For more details, please refer to the split-apply-combine description on the pandas website.
New Columns = f(Existing Columns)
Generating new columns from existing columns in a data frame is an integral part of my workflow. This was one of the hardest parts for me to figure out. I hope these examples will save time and effort for other people.
I will try to illustrate it in a piecemeal manner – multiple columns as a function of a single column, single column as a function of multiple columns, and finally multiple columns as a function of multiple columns.
multiple columns as a function of a single column
I often have to generate multiple columns of a DataFrame as a function of a single columns. Related Stack Overflow question
single column as a function of multiple columns
It’s sometimes useful to generate multiple DataFrame columns from a single column. It comes in handy especially when methods return tuples. Related Stack Overflow question
multiple columns as a function of multiple columns
Finally, a way to generate a new DataFrame with multiple columns based on multiple columns in an existing DataFrame. Related Stack Overflow question
Stats
Pandas provides nifty methods to understand your data. I am highlighting the describe, correlation, covariance, and correlation methods that I use to quickly make sense of my data.
describe
The describe method provides quick stats on all suitable columns.
covariance
The cov method provides the covariance between suitable columns.
correlation
The corr method provides the correlation between suitable columns.
Merge and Join
Pandas supports database-like joins which makes it easy to link data frames.
I will use the simple example to highlight the joins using the merge command.
The inner, outer, left and right joins are show below. The data frames are joined using the str_col keys.
Plot
I was thoroughly surprised by the plotting capabilities of the pandas library. There are several plotting methods available. I am highlighting a couple of simple plots that I use the most.
Let’s start with a simple data frame to plot.
Plot
A simple plot command goes a long way.
Histograms
I really enjoy histograms to get a quick idea about the distribution of the data.
Scikit-learn conversion
This took me a non-trivial amount of time to figure out and I hope others can avoid this mistake. According to the pandas documentation, the ndarray object obtained via the values method has object dtype if values contain more than float and integer dtypes. Now even if you slice the str columns away, the resulting array will still consist of object dtype and might not play well with other libraries such as scikit-learn which are expecting a float dtype. Explicitly converting type works well in this scenario.
EDIT HT: Wes Turner via comments. The sklearn-pandas library looks great for bridging pandas scikit-learn.
Summary
I hope these examples will help new users quickly extract a lot of value out of pandas and serve as a useful quick reference for the pandas pros.
Happy munging!
Posted by Manish Amde Mar 7th, 2013 1:43 pm introduction, machine learning, pandas, python, tutorial
Reference : http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/
Last updated