This is a repository for short and sweet examples and links for useful pandas recipes. We encourage users to add to this documentation.
Adding interesting links and/or inline examples to this section is a great First Pull Request.
Simplified, condensed, new-user friendly, in-line examples have been inserted where possible to augment the Stack-Overflow and GitHub links. Many of the links contain expanded information, above what the in-line examples offer.
Pandas (pd) and Numpy (np) are the only two abbreviated imported modules. The rest are kept explicitly imported for newer users.
These examples are written for Python 3. Minor tweaks might be necessary for earlier python versions.
In [64]: df = pd.DataFrame({'row': [0, 1, 2], ....: 'One_X': [1.1, 1.1, 1.1], ....: 'One_Y': [1.2, 1.2, 1.2], ....: 'Two_X': [1.11, 1.11, 1.11], ....: 'Two_Y': [1.22, 1.22, 1.22]}) ....:In [65]: dfOut[65]: row One_X One_Y Two_X Two_Y001.11.21.111.22111.11.21.111.22221.11.21.111.22# As Labelled IndexIn [66]: df = df.set_index('row')In [67]: dfOut[67]: One_X One_Y Two_X Two_Yrow 01.11.21.111.2211.11.21.111.2221.11.21.111.22# With Hierarchical ColumnsIn [68]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) ....: for c in df.columns]) ....:In [69]: dfOut[69]: One Two X Y X Yrow 01.11.21.111.2211.11.21.111.2221.11.21.111.22# Now stack & ResetIn [70]: df = df.stack(0).reset_index(1)In [71]: dfOut[71]: level_1 X Yrow 0 One 1.101.200 Two 1.111.221 One 1.101.201 Two 1.111.222 One 1.101.202 Two 1.111.22# And fix the labels (Notice the label 'level_1' got added automatically)In [72]: df.columns = ['Sample','All_X','All_Y']In [73]: dfOut[73]: Sample All_X All_Yrow 0 One 1.101.200 Two 1.111.221 One 1.101.201 Two 1.111.222 One 1.101.202 Two 1.111.22
In [74]: cols = pd.MultiIndex.from_tuples([(x, y) for x in ['A', 'B', 'C'] ....: for y in ['O', 'I']]) ....:In [75]: df = pd.DataFrame(np.random.randn(2, 6), index=['n', 'm'], columns=cols)In [76]: dfOut[76]: A B C O I O I O In 0.469112-0.282863-1.509059-1.1356321.212112-0.173215m 0.119209-1.044236-0.861849-2.104569-0.4949291.071804In [77]: df = df.div(df['C'], level=1)In [78]: dfOut[78]: A B C O I O I O In 0.3870211.633022-1.2449836.5562141.01.0m -0.240860-0.9742791.741358-1.9635771.01.0
In [85]:import itertoolsIn [86]: index =list(itertools.product(['Ada', 'Quinn', 'Violet'], ....: ['Comp', 'Math', 'Sci'])) ....:In [87]: headr =list(itertools.product(['Exams', 'Labs'], ['I', 'II']))In [88]: indx = pd.MultiIndex.from_tuples(index, names=['Student', 'Course'])In [89]: cols = pd.MultiIndex.from_tuples(headr)# Notice these are un-namedIn [90]: data = [[70+ x + y + (x * y) %3for x inrange(4)] for y inrange(9)]In [91]: df = pd.DataFrame(data, indx, cols)In [92]: dfOut[92]: Exams Labs I II I IIStudent Course Ada Comp 70717273 Math 71737574 Sci 72757575Quinn Comp 73747576 Math 74767877 Sci 75787878Violet Comp 76777879 Math 77798180 Sci 78818181In [93]: All =slice(None)In [94]: df.loc['Violet']Out[94]: Exams Labs I II I IICourse Comp 76777879Math 77798180Sci 78818181In [95]: df.loc[(All,'Math'), All]Out[95]: Exams Labs I II I IIStudent Course Ada Math 71737574Quinn Math 74767877Violet Math 77798180In [96]: df.loc[(slice('Ada', 'Quinn'),'Math'), All]Out[96]: Exams Labs I II I IIStudent Course Ada Math 71737574Quinn Math 74767877In [97]: df.loc[(All,'Math'), ('Exams')]Out[97]: I IIStudent Course Ada Math 7173Quinn Math 7476Violet Math 7779In [98]: df.loc[(All,'Math'), (All,'II')]Out[98]: Exams Labs II IIStudent Course Ada Math 7374Quinn Math 7677Violet Math 7980
In [99]: df.sort_values(by=('Labs', 'II'), ascending=False)Out[99]: Exams Labs I II I IIStudent Course Violet Sci 78818181 Math 77798180 Comp 76777879Quinn Sci 75787878 Math 74767877 Comp 73747576Ada Sci 72757575 Math 71737574 Comp 70717273
Unlike agg, applyâs callable is passed a sub-DataFrame which gives you access to all the columns
In [104]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(), .....: 'size': list('SSMMMLL'), .....: 'weight': [8, 10, 11, 1, 20, 12, 12], .....: 'adult': [False] *5+ [True] *2}) .....:In [105]: dfOut[105]: animal size weight adult0 cat S 8False1 dog S 10False2 cat M 11False3 fish M 1False4 dog M 20False5 cat L 12True6 cat L 12True# List the size of the animals with the highest weight.In [106]: df.groupby('animal').apply(lambdasubf: subf['size'][subf['weight'].idxmax()])Out[106]:animalcat Ldog Mfish Mdtype:object
In [107]: gb = df.groupby(['animal'])In [108]: gb.get_group('cat')Out[108]: animal size weight adult0 cat S 8False2 cat M 11False5 cat L 12True6 cat L 12True
In [131]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(), .....: 'Value': [100, 150, 50, 50]}) .....:In [132]: dfOut[132]: Color Value0 Red 1001 Red 1502 Red 503 Blue 50In [133]: df['Counts']= df.groupby(['Color']).transform(len)In [134]: dfOut[134]: Color Value Counts0 Red 10031 Red 15032 Red 5033 Blue 501
Create a list of dataframes, split using a delineation based on logic included in rows.
In [146]: df = pd.DataFrame(data={'Case': ['A', 'A', 'A', 'B', 'A', 'A', 'B', 'A', .....: 'A'], .....: 'Data': np.random.randn(9)}) .....:In [147]: dfs =list(zip(*df.groupby((1* (df['Case'] =='B')).cumsum() .....: .rolling(window=3, min_periods=1).median())))[-1] .....:In [148]: dfs[0]Out[148]: Case Data0 A 0.2762321 A -1.0874012 A -0.6736903 B 0.113648In [149]: dfs[1]Out[149]: Case Data4 A -1.4784275 A 0.5249886 B 0.404705In [150]: dfs[2]Out[150]: Case Data7 A 0.5770468 A -1.715002
In [159]: df = pd.DataFrame(data={'A': [[2, 4, 8, 16], [100, 200], [10, 20, 30]], .....: 'B': [['a', 'b', 'c'], ['jj', 'kk'], ['ccc']]}, .....: index=['I', 'II', 'III']) .....:In [160]:defSeriesFromSubList(aList): .....:return pd.Series(aList) .....:In [161]: df_orgz = pd.concat({ind: row.apply(SeriesFromSubList) .....: for ind, row in df.iterrows()}) .....:In [162]: df_orgzOut[162]:0123I A 24816.0 B a b c NaNII A 100200 NaN NaN B jj kk NaN NaNIII A 102030 NaN B ccc NaN NaN NaN
Depending on df construction, ignore_index may be needed
In [180]: df = df1.append(df2, ignore_index=True)In [181]: dfOut[181]: A B C0-0.870117-0.479265-0.79085510.1448171.726395-0.4645352-0.8219061.5976050.1873073-0.128342-1.511638-0.28985840.399194-1.430030-0.63976051.115116-2.0126001.8106626-0.870117-0.479265-0.79085570.1448171.726395-0.4645358-0.8219061.5976050.1873079-0.128342-1.511638-0.289858100.399194-1.430030-0.639760111.115116-2.0126001.810662
Reading a file that is compressed but not by gzip/bz2 (the native compressed formats which read_csv understands). This example shows a WinZipped file, but is a general application of opening the file within a context manager and using that handle to read. See here
Reading multiple files to create a single DataFrame
The best way to combine multiple files into a single DataFrame is to read the individual frames one by one, put all of the individual frames into a list, and then combine the frames in the list using pd.concat():
In [189]:for i inrange(3): .....: data = pd.DataFrame(np.random.randn(10, 4)) .....: data.to_csv('file_{}.csv'.format(i)) .....:In [190]: files = ['file_0.csv','file_1.csv','file_2.csv']In [191]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
You can use the same approach to read all files matching a pattern. Here is an example using glob:
In [192]:import globIn [193]:import osIn [194]: files = glob.glob('file_*.csv')In [195]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
Finally, this strategy will work with the other pd.read_*(...) functions described in the io docs.
Parsing date components in multi-columns
Parsing date components in multi-columns is faster with a format
In [196]: i = pd.date_range('20000101', periods=10000)In [197]: df = pd.DataFrame({'year': i.year, 'month': i.month, 'day': i.day})In [198]: df.head()Out[198]: year month day02000111200012220001332000144200015In [199]:%timeit pd.to_datetime(df.year *10000+ df.month *100+ df.day, format='%Y%m%d') .....: ds = df.apply(lambdax: "%04d%02d%02d"% (x['year'], .....: x['month'], x['day']), axis=1) .....: ds.head() .....:%timeit pd.to_datetime(ds) .....:9.66 ms +-154 us per loop (mean +- std. dev. of 7 runs, 100 loops each)2.85 ms +-98 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
De-duplicating a large store by chunks, essentially a recursive reduction operation. Shows a function for taking in data from csv file and creating a store by chunks, with date parsing as well. See here
In [206]: df = pd.DataFrame(np.random.randn(8, 3))In [207]: store = pd.HDFStore('test.h5')In [208]: store.put('df', df)# you can store an arbitrary Python object via pickleIn [209]: store.get_storer('df').attrs.my_attribute ={'A':10}In [210]: store.get_storer('df').attrs.my_attributeOut[210]:{'A':10}
Binary files
pandas readily accepts NumPy record arrays, if you need to read in a binary file consisting of an array of C structs. For example, given this C program in a file called main.c compiled with gcc main.c -std=gnu99 on a 64-bit machine,
#include <stdio.h>#include <stdint.h>typedef struct _Data{ int32_t count; double avg;float scale;} Data;intmain(int argc, const char *argv[]){ size_t n = 10; Data d[n];for (int i = 0; i < n; ++i){ d[i].count = i; d[i].avg = i +1.0; d[i].scale = (float) i +2.0f;} FILE *file = fopen("binary.dat", "wb");fwrite(&d, sizeof(Data), n, file);fclose(file);return0;}
the following Python code will read the binary file 'binary.dat' into a pandas DataFrame, where each element of the struct corresponds to a column in the frame:
names ='count','avg','scale'# note that the offsets are larger than the size of the type because of# struct paddingoffsets =0,8,16formats ='i4','f8','f4'dt = np.dtype({'names': names, 'offsets': offsets, 'formats': formats}, align=True)df = pd.DataFrame(np.fromfile('binary.dat', dt))
Note
The offsets of the structure elements may be different depending on the architecture of the machine on which the file was created. Using a raw binary file format like this for general data storage is not recommended, as it is not cross platform. We recommended either HDF5 or parquet, both of which are supported by pandasâ IO facilities.
Often itâs useful to obtain the lower (or upper) triangular form of a correlation matrix calculated from DataFrame.corr(). This can be achieved by passing a boolean mask to where as follows:
In [211]: df = pd.DataFrame(np.random.random(size=(100, 5)))In [212]: corr_mat = df.corr()In [213]: mask = np.tril(np.ones_like(corr_mat, dtype=np.bool), k=-1)In [214]: corr_mat.where(mask)Out[214]:012340 NaN NaN NaN NaN NaN1-0.018923 NaN NaN NaN NaN2-0.076296-0.012464 NaN NaN NaN3-0.169941-0.2894160.076462 NaN NaN40.0643260.018759-0.084140-0.079859 NaN
The method argument within DataFrame.corr can accept a callable in addition to the named correlation types. Here we compute the distance correlation matrix for a DataFrame object.
In [215]:defdistcorr(x,y): .....: n =len(x) .....: a = np.zeros(shape=(n, n)) .....: b = np.zeros(shape=(n, n)) .....:for i inrange(n): .....:for j inrange(i +1, n): .....: a[i, j]=abs(x[i] - x[j]) .....: b[i, j]=abs(y[i] - y[j]) .....: a += a.T .....: b += b.T .....: a_bar = np.vstack([np.nanmean(a, axis=0)] * n) .....: b_bar = np.vstack([np.nanmean(b, axis=0)] * n) .....: A = a - a_bar - a_bar.T + np.full(shape=(n, n), fill_value=a_bar.mean()) .....: B = b - b_bar - b_bar.T + np.full(shape=(n, n), fill_value=b_bar.mean()) .....: cov_ab = np.sqrt(np.nansum(A * B))/ n .....: std_a = np.sqrt(np.sqrt(np.nansum(A**2)) / n) .....: std_b = np.sqrt(np.sqrt(np.nansum(B**2)) / n) .....:return cov_ab / std_a / std_b .....:In [216]: df = pd.DataFrame(np.random.normal(size=(100, 3)))In [217]: df.corr(method=distcorr)Out[217]:01201.0000000.1996530.21487110.1996531.0000000.19511620.2148710.1951161.000000
To create a dataframe from every combination of some given values, like Râs expand.grid() function, we can create a dict where the keys are column names and the values are lists of the data values: