Have you ever complained about the table output looks boring when you do .head() in Jupyter notebooks? Is there a way not to display indexes (especially when there is already an ID column)? Thereâre ways to fix these issues.
A. Highlight all negative values in a dataframe. (example revised from )
import pandas as pd
def color_negative_red(val):
color = 'red' if val < 0 else 'black'
return 'color: %s' % colordf = pd.DataFrame(dict(col_1=[1.53,-2.5,3.53],
col_2=[-4.1,5.9,0])
)
df.style.applymap(color_negative_red)
The reader may have experienced the following issues when using .head(n) to check the dataframe:
(1) Thereâre too many columns / rows in the dataframe and some columns / rows in the middle are omitted.
(2) Columns containing long texts get truncated.
(3) Columns containing floats display too many / too few digits.
One can set
import pandas as pd
pd.options.display.max_columns = 50 # None -> No Restrictions
pd.options.display.max_rows = 200 # None -> Be careful with this
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3
to solve these issues.
3. Group by with multiple aggregations
In SQL we can do aggregations like
SELECT A, B, max(A), avg(A), sum(B), min(B), count(*)
FROM table
GROUP BY A, B
In Pandas it can be done with .groupby() and .agg():
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(A=['coke', 'sprite', 'coke', 'sprite',
'sprite', 'coke', 'coke'],
B=['alpha','gamma', 'alpha', 'beta',
'gamma', 'beta', 'beta'],
col_1=[1,2,3,4,5,6,7],
col_2=[1,6,2,4,7,9,3]))
tbl = df.groupby(['A','B']).agg({'col_1': ['max', np.mean],
'col_2': ['sum','min','count']})
# 'count' will always be the count for number of rows in each group.
And the result will look like this:
Both the rows and columns are multi-indexed. A quick solution to change it to a dataframe without multi-indices is
If you would like to have the column renaming process automated, you can do tbl.columns.get_level_values(0) and tbl.columns.get_level_values(1) to extract the indices in each level and combine them.
4. Column slicing
Some of you might be familiar with this already, but I still find it very useful when handling a dataframe with a ton of columns.
df.iloc[:,2:5].head() # select the 2nd to the 4th column
df.loc[:,'column_x':].head()
# select all columns starting from 'column_x'
5. Add row ID / random row ID to each group
To add a row ID / random row ID for each group by A, B, one can first append an ID / random ID to all rows:
import numpy as np
# df: target dataframe np.random.seed(0) # set random seed
df['random_ID_all'] = np.random.permutation(df.shape[0])
df['ID_all'] = [i for i in range(1, df.shape[0]+1)]
To add a random ID to each group (by A, B), one can then do
Sometimes after we performed group by, weâd like to aggregate the values in the target column as a list of unique values instead of max, min, âĶetc. This is how itâs done.
df = pd.DataFrame(dict(A=['A','A','A','A','A','B','B','B','B'],
B=[1,1,1,2,2,1,1,1,2],
C=['CA','NY','CA','FL','FL',
'WA','FL','NY','WA']))
tbl = df[['A', 'B', 'C']].drop_duplicates()\
.groupby(['A','B'])['C']\
.apply(list)\
.reset_index() # list to string (separated by commas)
tbl['C'] = tbl.apply(lambda x: (','.join([str(s) for s in x['C']])), axis = 1)
If youâd like to save the result, donât forget to change the separator to anything other than commas.
7. Add row total and column total to a numerical dataframe
This is another common data manipulation. All you need is .apply().
`.memory_usage(deep=True)` can be used on Pandas dataframes to see the amount of memory used (in bytes) for each column. Itâs useful when building machine learning models which may require a lot memory in training.
9. Cumulative sum
From time to time, cumulative sum is required when you generate some statistical outcomes. Simply do `