Have you ever complained about the table output looks boring when you do .head() in Jupyter notebooks? Is there a way not to display indexes (especially when there is already an ID column)? There’re ways to fix these issues.
import pandas as pd defcolor_negative_red(val): color ='red'if val <0else'black'return'color: %s'% colordf = pd.DataFrame(dict(col_1=[1.53,-2.5,3.53], col_2=[-4.1,5.9,0]) )df.style.applymap(color_negative_red)
The reader may have experienced the following issues when using .head(n) to check the dataframe:
(1) There’re too many columns / rows in the dataframe and some columns / rows in the middle are omitted.
(2) Columns containing long texts get truncated.
(3) Columns containing floats display too many / too few digits.
One can set
import pandas as pd pd.options.display.max_columns =50# None -> No Restrictions pd.options.display.max_rows =200# None -> Be careful with this pd.options.display.max_colwidth =100pd.options.display.precision =3
to solve these issues.
3. Group by with multiple aggregations
In SQL we can do aggregations like
SELECT A, B,max(A),avg(A),sum(B),min(B),count(*)FROM table GROUP BY A, B
In Pandas it can be done with .groupby() and .agg():
import pandas as pd import numpy as np df = pd.DataFrame(dict(A=['coke', 'sprite', 'coke', 'sprite', 'sprite', 'coke', 'coke'], B=['alpha','gamma', 'alpha', 'beta', 'gamma', 'beta', 'beta'], col_1=[1,2,3,4,5,6,7], col_2=[1,6,2,4,7,9,3]))tbl = df.groupby(['A','B']).agg({'col_1': ['max', np.mean], 'col_2': ['sum','min','count']})# 'count' will always be the count for number of rows in each group.
And the result will look like this:
Both the rows and columns are multi-indexed. A quick solution to change it to a dataframe without multi-indices is
If you would like to have the column renaming process automated, you can do tbl.columns.get_level_values(0) and tbl.columns.get_level_values(1) to extract the indices in each level and combine them.
4. Column slicing
Some of you might be familiar with this already, but I still find it very useful when handling a dataframe with a ton of columns.
df.iloc[:,2:5].head()# select the 2nd to the 4th column df.loc[:,'column_x':].head()# select all columns starting from 'column_x'
5. Add row ID / random row ID to each group
To add a row ID / random row ID for each group by A, B, one can first append an ID / random ID to all rows:
import numpy as np # df: target dataframe np.random.seed(0) # set random seed df['random_ID_all']= np.random.permutation(df.shape[0])df['ID_all']= [i for i inrange(1, df.shape[0]+1)]
To add a random ID to each group (by A, B), one can then do
Sometimes after we performed group by, we’d like to aggregate the values in the target column as a list of unique values instead of max, min, …etc. This is how it’s done.
df = pd.DataFrame(dict(A=['A','A','A','A','A','B','B','B','B'], B=[1,1,1,2,2,1,1,1,2], C=['CA','NY','CA','FL','FL', 'WA','FL','NY','WA']))tbl = df[['A','B','C']].drop_duplicates()\.groupby(['A','B'])['C']\.apply(list)\.reset_index()# list to string (separated by commas) tbl['C']= tbl.apply(lambdax: (','.join([str(s) for s in x['C']])), axis =1)
If you’d like to save the result, don’t forget to change the separator to anything other than commas.
7. Add row total and column total to a numerical dataframe
This is another common data manipulation. All you need is .apply().
`.memory_usage(deep=True)` can be used on Pandas dataframes to see the amount of memory used (in bytes) for each column. It’s useful when building machine learning models which may require a lot memory in training.
9. Cumulative sum
From time to time, cumulative sum is required when you generate some statistical outcomes. Simply do `