Posts Tagged ‘Pandas’

Visualizing Pandas GroupBy object

March 15, 2017

I am a beginner again, this time learning Python and Pandas. I am enjoying it quite a lot. For learning I write code in a Jupyter notebook and this post is actually written as one – converted to HTML with nbconvert. The quality of the conversion is rather bad, but this is probably the best one can do without adding custom CSS to this blog setup, which would require upgrading to WordPress.com Premium.

Development using Jupyter  is similar to how KDB+ coding is mostly done. In KDB+ one sends commands to a KDB+ server from a client like Studio for KDB+, getting an instant feedback on the result. Pandas is not as expressive and concise as q, but the style of a high-level API for vectorized data manipulation with avoidance of explicit iteration (loops) is similar.

One exception to the instant feedback rule in Jupyter and Pandas is the GroupBy object. To see what I mean let’s define a simple data frame from a dictionary of columns:

In [1]:
import pandas as pd
data = pd.DataFrame({'sym':['a','b','c'],
                     'price1':[100.0,150.0,130.0],
                     'price2':[110.0,150.0,120.0],
                     'vol1':[1000.0,1200.0,1300.0],
                     'vol2':[1500.0,1300.0,1100.0]})
data
Out[1]:
price1 price2 sym vol1 vol2
0 100.0 110.0 a 1000.0 1500.0
1 150.0 150.0 b 1200.0 1300.0
2 130.0 120.0 c 1300.0 1100.0

Grouping is more often done for rows (along the 0 axis), but this time we want to group columns (along axis=1). One group is made of the price1 and price2 columns, the second one groups vol1 and vol2 and the sym column forms its one element group. To do this we define a function that takes a column name and classifies it into one of three categories:

In [2]:
def classifier(column):
    if column.startswith('price'): return 'price'
    if column.startswith('vol'): return 'volume'
    return 'sym'

Now we can group the columns using the classifier:

In [3]:
data.groupby(classifier,axis=1)
Out[3]:
<pandas.core.groupby.DataFrameGroupBy object at 0x00000048DE8A1CF8>

As we can see, the GroupBy object is not printed nicely (at least in Pandas 0.19.2 that I am using).
Of course, there are many ways to print it. One way that I found intuitive and useful is to first convert the GroupBy object to a dictionary of dataframes keyed by the classifier value. This can be done using the dictionary comprehension like {grp:subdf for grp,subdf in df.groupby(classifier,axis=1)}. The dictionary obtained this way can be passed to Panda’s concat function. concat puts the dictionary of dataframes together to get a single dataframe with multi-level columns. The additional column level clearly shows the structure of the original GroupBy object.

In [4]:
def groupCols(df,classifier):
    return pd.concat({grp:subdf for grp,subdf in df.groupby(classifier,axis=1)},
                     axis=1)

groupCols(data,classifier)
Out[4]:
price sym volume
price1 price2 sym vol1 vol2
0 100.0 110.0 a 1000.0 1500.0
1 150.0 150.0 b 1200.0 1300.0
2 130.0 120.0 c 1300.0 1100.0

This trick also works for classifying rows if one uses axis=0 instead of axis=1 in a function similar to groupCols above.