Have you ever complained about the table output looks boring when you do .head() in Jupyter notebooks? Is there a way not to display indexes (especially when there is already an ID column)? Thereâre ways to fix these issues.
The reader may have experienced the following issues when using .head(n) to check the dataframe:
(1) Thereâre too many columns / rows in the dataframe and some columns / rows in the middle are omitted.
(2) Columns containing long texts get truncated.
(3) Columns containing floats display too many / too few digits.
One can set
import pandas as pd
pd.options.display.max_columns = 50 # None -> No Restrictions
pd.options.display.max_rows = 200 # None -> Be careful with this
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3
to solve these issues.
3. Group by with multiple aggregations
In SQL we can do aggregations like
SELECT A, B, max(A), avg(A), sum(B), min(B), count(*)
FROM table
GROUP BY A, B
In Pandas it can be done with .groupby() and .agg():
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(A=['coke', 'sprite', 'coke', 'sprite',
'sprite', 'coke', 'coke'],
B=['alpha','gamma', 'alpha', 'beta',
'gamma', 'beta', 'beta'],
col_1=[1,2,3,4,5,6,7],
col_2=[1,6,2,4,7,9,3]))
tbl = df.groupby(['A','B']).agg({'col_1': ['max', np.mean],
'col_2': ['sum','min','count']})
# 'count' will always be the count for number of rows in each group.
And the result will look like this:
Both the rows and columns are multi-indexed. A quick solution to change it to a dataframe without multi-indices is
If you would like to have the column renaming process automated, you can do tbl.columns.get_level_values(0) and tbl.columns.get_level_values(1) to extract the indices in each level and combine them.
4. Column slicing
Some of you might be familiar with this already, but I still find it very useful when handling a dataframe with a ton of columns.
df.iloc[:,2:5].head() # select the 2nd to the 4th column
df.loc[:,'column_x':].head()
# select all columns starting from 'column_x'
5. Add row ID / random row ID to each group
To add a row ID / random row ID for each group by A, B, one can first append an ID / random ID to all rows:
import numpy as np
# df: target dataframe np.random.seed(0) # set random seed
df['random_ID_all'] = np.random.permutation(df.shape[0])
df['ID_all'] = [i for i in range(1, df.shape[0]+1)]
To add a random ID to each group (by A, B), one can then do
Sometimes after we performed group by, weâd like to aggregate the values in the target column as a list of unique values instead of max, min, âĶetc. This is how itâs done.
df = pd.DataFrame(dict(A=['A','A','A','A','A','B','B','B','B'],
B=[1,1,1,2,2,1,1,1,2],
C=['CA','NY','CA','FL','FL',
'WA','FL','NY','WA']))
tbl = df[['A', 'B', 'C']].drop_duplicates()\
.groupby(['A','B'])['C']\
.apply(list)\
.reset_index() # list to string (separated by commas)
tbl['C'] = tbl.apply(lambda x: (','.join([str(s) for s in x['C']])), axis = 1)
If youâd like to save the result, donât forget to change the separator to anything other than commas.
7. Add row total and column total to a numerical dataframe
This is another common data manipulation. All you need is .apply().
`.memory_usage(deep=True)` can be used on Pandas dataframes to see the amount of memory used (in bytes) for each column. Itâs useful when building machine learning models which may require a lot memory in training.
9. Cumulative sum
From time to time, cumulative sum is required when you generate some statistical outcomes. Simply do `