Filtering Data With Masking


Goal: Filter DataFrames

Filtering data is required for every data analysis project.  The majority of blogs I see only detail how to filter the DataFrame by the row index. It is rare I need to filter on the row index and I don't want to reset the index for every filter. For example, we only want data in the DataFrame where the budget is over $10,000,000, and the director's name is James Cameron.  Well, that is a very specific example, but you get the idea.  

There are two general steps (example on Google Colab)
  1. Define the filters/mask
  2. Reference the mask(s) between [] 
I always made the filtering process more difficult than reality.  After looking, and looking, and looking for ways to filter data in pandas, I found a method that meets my expectations.  It must be easy to remember, discuss, and explain to non-programmers.  Also, if another person not all that familiar we Python, they can update the filters as needed, add new ones, and continue on with the analysis.

First, we import our data:
import pandas as pd
import numpy as np
loc = ''
m = pd.read_csv(loc)

Then, we create objects detailing each filter to apply.  Below are four masks/filters.

mask1 = (m.budget > m.budget.median()) 
mask2 = (m.director_name == 'James Cameron')
mask3 = (m.num_critic_for_reviews >240)
mask4 = (m.color == 'Color')

Then, we simply apply masks/filters by placing them in a list with an ampersand as the delimiter. 

m[mask1 & mask2 & mask3 & mask4]

Bam, their we go, filtered data!

This method is easy for an average person to read, review, and modify.  I can send the filters, via email, to an end user, and they can modify the code to meet the needs of the project.  It is that easy!


Popular posts from this blog

Drawing Tables with ReportLab: A Comprehensive Example

Blog Topics

DataFrame groupby agg style bar