Pandas Groupby and Agg

June 23, 2023

Understanding Groupby and Agg in Pandas for Group-wise Analysis

Hello, data enthusiasts! Today, we are delving into one of the most useful functionalities in Pandas: the groupby and agg methods. We will particularly look into counting IDs and calculating the percent of total for various groupings in our data.

Grouping in Pandas

Pandas' group_by method is highly powerful. It allows us to split our data into separate groups to perform computations for better analysis.

Let's consider a DataFrame 'df' with columns: 'id', 'course', 'building', and 'room'.

import pandas as pd

# Suppose df is your DataFrame
print(df.head())

You might see something like this:

|   | id | course | building | room |
|---|----|--------|----------|------|
| 0 |  1 | Math   |    A     |  101 |
| 1 |  2 | Math   |    B     |  102 |
| 2 |  3 | Bio    |    A     |  101 |
| 3 |  4 | Chem   |    B     |  102 |
| 4 |  5 | Bio    |    B     |  103 |

Now, say we want to find out the count of IDs for each course. Here's how we would do it:

df.groupby('course')['id'].count()

Now, suppose we want to do this by course and by building:

df.groupby(['course', 'building'])['id'].count()

And similarly, by course, building, and room:

df.groupby(['course', 'building', 'room'])['id'].count()

Aggregation in Pandas

The agg function in pandas provides the flexibility to perform multiple aggregations at once. We can also use it to perform different aggregations on different columns.

For instance, if you want to find out the count and percentage of each course:

course_group = df.groupby('course')['id'].agg(['count'])
course_group['percent'] = course_group['count'] / course_group['count'].sum() * 100

You can extend this concept to multiple levels as we did with the group_by method.

Embracing Automation with a Twist of Mirth

Alright folks, put on your humor hats and hold on tight, because it's time to dive into the wild, whimsical world of automation! Specifically, we're looking at Exploratory Data Analysis (EDA), where our motto is, "Show us every nook, cranny, twist, and turn of those tantalizing variables!"

Now you might be thinking, "Doesn't this sound like we're inviting chaos for a house party?" And you'd be right... if we didn't have our trusty sidekick, the 'iterate-over-columns' method, to bring some semblance of order!

You see, when you're attempting to view every possible iteration or combination of variables by grouping, things can escalate quicker than a rabbit's birthday party (did you know they reach maturity at around 4 months? Talk about a fast-paced life!). So, to avoid turning your analysis into a veritable bunny explosion, it's wise to specifically detail the columns over which you want to iterate.

This approach is like finding a comfortable middle ground between an all-out automation frenzy and the cozy familiarity of the hand-held approach. Think of it as the Goldilocks of data analysis techniques - not too much, not too little, but just right. This way, you get to enjoy the party without worrying about those pesky, uninvited variables crashing the scene!

This function generates a list of DataFrames, each containing a grouping and the corresponding ID count and percentage.

Automating the Grouping Process

But what if we have a large number of grouping variables and we want a report for every possible combination? The thought of doing this manually can be overwhelming. But, don't worry! Python has us covered. Let's define a function to automate this process:


from itertools import combinations

def automated_grouping(df, columns):
    reports = []
    for r in range(1, len(columns) + 1):
        for subset in combinations(columns, r):
            subset = list(subset)
            group = df.groupby(subset)['id'].agg(['count'])
            group['percent'] = group['count'] / group['count'].sum() * 100
            reports.append(group)
    return reports


# Columns to consider for grouping
cols = ['course', 'building', 'room']

# Generate reports
reports = automated_grouping(df, cols)

# Print each report
for report in reports:
    print(report)
    print


from itertools import combinations

def automated_grouping(df, columns):
    for r in range(1, len(columns) + 1):
        for subset in combinations(columns, r):
            subset = list(subset)
            group = df.groupby(subset)['id'].agg(['count'])
            group['percent'] = group['count'] / group['count'].sum() * 100
            report_name = "r_" + "_".join(subset)
            globals()[report_name] = group

cols = ['course', 'building', 'room']

automated_grouping(df, cols)

See the code in action at Google Colab.

Search This Blog

Data Analytics With Python