Fun Function Friday!!!
My Random Function: Count and Percent
The other day I was working on a project to simply count and give me the total percent count. First I started with the data.
The Data:
import pandas as pd
from itertools import combinations
data = {
'id': list(range(1, 21)),
'course': ['Math', 'Math', 'Bio', 'Chem', 'Bio', 'Math', 'Chem', 'Bio', 'Chem', 'Math', 'Bio', 'Chem', 'Math', 'Math', 'Bio', 'Chem', 'Bio', 'Math', 'Chem', 'Math'],
'building': ['A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B'],
'room': [101, 102, 101, 102, 103, 101, 104, 103, 102, 101, 105, 106, 107, 108, 109, 110, 111, 105, 106, 107]
}
df = pd.DataFrame(data)
Then I started down the path of aggregating.
df.groupby('course')['id'].count()
df_course_building = df.groupby(['course', 'building'])['id'].count().reset_index()
df_course_building_room = df.groupby(['course', 'building', 'room'])['id'].count().reset_index()
print(df_course_building)
print(df_course_building_room)
Lightbulb moment!
Very quickly, I realized I wanted to integrate every possible combination for 'course,' 'building,' and 'room'. I was working through all the grouping diligently, but I said there must be a better way!
Well, there is the itertools package in Python. The nice thing about the iteration tool, it will find all unique combinations from a list. It took a while to get this function created, but here it is.
Final product
from itertools import combinations
def automated_grouping(df, columns):
for r in range(1, len(columns) + 1):
for subset in combinations(columns, r):
subset = list(subset)
group = df.groupby(subset)['id'].agg(['count'])
group['percent'] = group['count'] / group['count'].sum() * 100
report_name = "r_" + "_".join(subset)
globals()[report_name] = group
cols = ['course', 'building', 'room']
automated_grouping(df, cols)
I had to dig this apart one line at a time...
The combinations function is part of the itertools module in Python, and it is used to generate all possible combinations of a given iterable (like a list or a string) for a given length. It returns an iterator that produces tuples containing the individual combinations.
For example, let's say you have a list of three elements and you want to find all the combinations of length 2:
from itertools import combinations
elements = ['a', 'b', 'c']
comb = combinations(elements, 2)
print(list(comb)) # Output: [('a', 'b'), ('a', 'c'), ('b', 'c')]
Here's what's happening:
- When r is 2, the function returns all possible pairs from the iterable.
- The combinations are emitted in lexicographic sort order, meaning they will be in a sorted order based on the input iterable.
- The function does not produce repeated combinations. In the above example, you see ('a', 'b'), but not ('b', 'a').
This can be extremely useful in various mathematical, statistical, and data analysis scenarios where you must explore different combinations of elements from a given set.
The line for r in range(1, len(columns) + 1):
is significant in the context of the function you provided because it ensures that you generate combinations of all possible lengths, ranging from 1 to the number of elements in the columns list.
subset = list(subset)
is used to convert the tuple returned by the combinations function into a list. This ensures that it is in the correct format for grouping the DataFrame.
The line that begins with group =
is using the Pandas groupby method to group the DataFrame by the columns specified in the variable "subset." This is essential for understanding how your data is distributed across different combinations of the selected attributes.
The line that assigns a new 'percent' column to the "group" DataFrame puts the raw counts into perspective by showing how each combination of values contributes to the overall total.
The line that constructs the "report_name" variable is building a string by joining the names of the columns that are part of the current combination. This provides a meaningful and unique identifier for each report.
Finally, the line that uses globals()[report_name] = group
stores the "group" DataFrame in the global namespace under the name constructed in the "report_name" variable. This allows you to access each of the grouped DataFrames separately and provides a powerful, automated way to explore the relationships in your data.
In summary, this function encapsulates an efficient and flexible method for aggregating your data by all possible combinations of a given set of attributes. By leveraging the capabilities of both the Pandas library and Python's itertools module, it represents a valuable tool for exploring complex datasets, providing insights that may not be apparent through traditional methods of analysis.
Comments
Post a Comment