PD.CUT

 Goal: Create bins and count observations with PD.CUT


For example, I want to count and group when a variable falls between $0 and $1,000,000 and return Bin1, and if the same variable falls between $1,000,001 and $2,000,000, return bin2.  

Fortunately, Pandas has a handy little method called cut to determine if the value of a variable falls within the bin.  

There are two general ways I want to accomplish this goal:

  1. Hard code each bin size (or span - Ex. $0 to $1,000,000)
  2. statistically determine the bin population
  3. statistically determine the bin span
Next, I want to give some nice labels and then group the data by the new column. 

All of my code is in Google Colab for demonstration purposes. 

First, I want to import the data
import pandas as pd
import numpy as np
loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv'
m = pd.read_csv(loc) 

Second, I want to detail bin sizes and create labels. This is where the magic happens! We went to create bins and explicitly determined by hard coding values. My challenges with this process are the labels.  I wanted to dynamically find the max number and create the bin label, so I used str(m.budget.max())[0:2]. I could have put an "m" at the end of each level, but I renamed the index to "Bin in Millions."  That gives the reader the context for each bin. 

   (pd.cut(m.budget,bins=[0, 1000000,2000000,3000000,4000000,5000000,6000000,10000000,15000000,20000000,m.budget.max()], 
                             labels =['0-1','1-2','2-3','3-4','4-5','5-6','6-10','10-15','15-20',str(m.budget.max())[0:2]]) 
    .value_counts() 
    .reset_index() 
    .rename(columns={'index':'Bin in Millions', 'budget':'count'}) 
    .sort_values(by='count', ascending=False)
)

Determine bins span - each bin is the same width/span

Next, I wanted to investigate how to create bins of even sizes given the min and the max.  For example, if the variance of the bin equals 100, and I specify ten bins, there will be 10 bins, each 10 intervals wide.  Or, if the variance is 90, and I specify 10 bins, each bin will be 9 intervals wide.  All very useful tools! This one is so small, it is a one-liner! 


pd.cut(m.budget, bins=10,).value_counts().reset_index()

Determine bin population - each bin has the same number of movies

The next method is distributing the population equally over a set number of bins.  For example, there are 10,000 movies, and we want to ensure ten bins have roughly 1,000 movies. Pandas determines the min and max for each bin size to ensure each bin has roughly 1,000 movies.  We can accomplish this with quantile. Each bin increases by intervals of ten percent.  


(
    pd.cut(m.budget, bins=m.budget
           .quantile([.1,.2,.3,.4,.5,.6,.7,.8,.9]))
           .value_counts())
)

Now we have three different ways to create bins with Pandas - hard code, equidistant bins, and evenly distributed bins by sample population - Enjoy!

Comments

Popular posts from this blog

Recode with np.select

Python Topic Roadmap