Have you ever complained about the table output looks boring when you do .head() in Jupyter notebooks? Is there a way not to display indexes (especially when there is already an ID column)? There’re ways to fix these issues.
import pandas as pd defcolor_negative_red(val): color ='red'if val <0else'black'return'color: %s'% colordf = pd.DataFrame(dict(col_1=[1.53,-2.5,3.53],col_2=[-4.1,5.9,0]))df.style.applymap(color_negative_red)
D. More CSS styles. You can use CSS to change the appearance of the table.
2. Pandas options
The reader may have experienced the following issues when using .head(n) to check the dataframe:
(1) There’re too many columns / rows in the dataframe and some columns / rows in the middle are omitted.
(2) Columns containing long texts get truncated.
(3) Columns containing floats display too many / too few digits.
One can set
to solve these issues.
3. Group by with multiple aggregations
In SQL we can do aggregations like
In Pandas it can be done with .groupby() and .agg():
And the result will look like this:
Both the rows and columns are multi-indexed. A quick solution to change it to a dataframe without multi-indices is
If you would like to have the column renaming process automated, you can do tbl.columns.get_level_values(0) and tbl.columns.get_level_values(1) to extract the indices in each level and combine them.
4. Column slicing
Some of you might be familiar with this already, but I still find it very useful when handling a dataframe with a ton of columns.
5. Add row ID / random row ID to each group
To add a row ID / random row ID for each group by A, B, one can first append an ID / random ID to all rows:
To add a random ID to each group (by A, B), one can then do
to get
6. List all unique values in a group
Sometimes after we performed group by, we’d like to aggregate the values in the target column as a list of unique values instead of max, min, …etc. This is how it’s done.
If you’d like to save the result, don’t forget to change the separator to anything other than commas.
7. Add row total and column total to a numerical dataframe
This is another common data manipulation. All you need is .apply().
8. Check memory usage
`.memory_usage(deep=True)` can be used on Pandas dataframes to see the amount of memory used (in bytes) for each column. It’s useful when building machine learning models which may require a lot memory in training.
9. Cumulative sum
From time to time, cumulative sum is required when you generate some statistical outcomes. Simply do `
10. Crosstab
When you need to count the frequencies for groups formed by 3+ features, pd.crosstab() can make your life easier.
Thanks for reading! Comment below if you find bugs / better solutions.
import pandas as pd
pd.options.display.max_columns = 50 # None -> No Restrictions
pd.options.display.max_rows = 200 # None -> Be careful with this
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3
SELECT A, B, max(A), avg(A), sum(B), min(B), count(*)
FROM table
GROUP BY A, B
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(A=['coke', 'sprite', 'coke', 'sprite',
'sprite', 'coke', 'coke'],
B=['alpha','gamma', 'alpha', 'beta',
'gamma', 'beta', 'beta'],
col_1=[1,2,3,4,5,6,7],
col_2=[1,6,2,4,7,9,3]))
tbl = df.groupby(['A','B']).agg({'col_1': ['max', np.mean],
'col_2': ['sum','min','count']})
# 'count' will always be the count for number of rows in each group.
df.iloc[:,2:5].head() # select the 2nd to the 4th column
df.loc[:,'column_x':].head()
# select all columns starting from 'column_x'
import numpy as np
# df: target dataframe np.random.seed(0) # set random seed
df['random_ID_all'] = np.random.permutation(df.shape[0])
df['ID_all'] = [i for i in range(1, df.shape[0]+1)]