Filtering Data With Masking
Goal: Filter DataFrames
Filtering data is required for every data analysis project. The majority of blogs I see only detail how to filter the DataFrame by the row index. It is rare I need to filter on the row index and I don't want to reset the index for every filter. For example, we only want data in the DataFrame where the budget is over $10,000,000, and the director's name is James Cameron. Well, that is a very specific example, but you get the idea.
There are two general steps (example on Google Colab)
- Define the filters/mask
- Reference the mask(s) between []
I always made the filtering process more difficult than reality. After looking, and looking, and looking for ways to filter data in pandas, I found a method that meets my expectations. It must be easy to remember, discuss, and explain to non-programmers. Also, if another person not all that familiar we Python, they can update the filters as needed, add new ones, and continue on with the analysis.
First, we import our data:
import pandas as pd
import numpy as np
loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv'
m = pd.read_csv(loc)
Then, we create objects detailing each filter to apply. Below are four masks/filters.
mask1 = (m.budget > m.budget.median())
mask2 = (m.director_name == 'James Cameron')
mask3 = (m.num_critic_for_reviews >240)
mask4 = (m.color == 'Color')
Then, we simply apply masks/filters by placing them in a list with an ampersand as the delimiter.
m[mask1 & mask2 & mask3 & mask4]
Bam, their we go, filtered data!
This method is easy for an average person to read, review, and modify. I can send the filters, via email, to an end user, and they can modify the code to meet the needs of the project. It is that easy!
Comments
Post a Comment