Recode with np.select



Goal: Create a new column based on data from another column. Use Pandas.DataFrame and Numpy.select to create the new column. You can see the full program at Google Colaboratory









There are three basic steps to accomplish the goal.  

  1. Define the conditions
  2. Define the values
  3. Use np.select to apply the conditions and values.
First, create a list containing each "condition," then create a list containing each return value.  
The goal is to correlate each condition with each value.  What does that mean?  It means if the first condition in the "conditions" list evaluates as True, return the first value in the "values" list.  

An error will occur when the lists have different index counts.  For example, if there are five conditions and 4 values.    

This condition example below has two useful applications we would be remiss if we didn't take note:  

The first is the use of .isnull() == True.
The second is the use of .mean(). 

The .isnull() == True tells python if the value in the pd.DataFrame is NaN replaced with "Missing." 

On my first attempt, I did the following:

mm = m['budget'].mean()

this method did not comply with my chain everything goal.  So, I infused the .mean()method into the condition.  Also, mm was marginally slower when compared using %%timeit.

So what does all this do in human words? 

  • If m.budget is less than the mean calculations, return "Less than the mean."  
  • If m.budget is greater than the mean calculation, return "Greater than the mean."
  • if m.budget is NaN, retung "Missing."
Quick note, "m" represents the DataFrame, and budget represents a NumPy series - Int64.  
condition = (m['budget'].isnull() == True), 
	(m['budget'] < m['budget'].mean()), 
	(m['budget'] >= m['budget'].mean()) ]
values = ['Missing',
	'Less than the mean',
    'Greater than the mean']

Now, we want to put it all together with NumPy.select method.  Below is a chained code chuck. It connects methods back-to-back, like a chain, so the program does not require intermediate DataFrames between each method, transformation, or data processing step.

Number 2 (e.g., #2) is where we call the "condition" and apply the "values" when the condition is true.  For example, when the first condition is true (e.g., the value of budget is NaN), then the new variable "budgetGroup" contains "Missing," and when the budget is less than the mean, the value returned is "Less than the mean."

(m #1
    .assign(budgetGroup = np.select(condition,values)) #2
    .assign(The_Mean_Is = m['budget'].mean()) #3
    .groupby(['budgetGroup','The_Mean_Is']) #4
    .agg({'budgetGroup':['count'], #5
        'budget':['min','max','mean']}) #6
)

The code chunk above is an example of a chaining method - I use chaining in all my code to reduce the number of intermediate DataFrames and have everything clean and concise.  Having all the code contained in ( ) negates the opportunity for white space errors.

What does all this do? 

  • #2 and #3 create new variables.  
  • #4, the .groupby method groups the data by the newly created "budgetGroup" and "The_mean_is" variables. 
  • #6 counts the total observations associated with "budgetGroup" and  calculates the minimum, maximum, and mean for "budget."  
Budget as 400+ NaN observations/rows.  As a result, when I tried to count budget, NaN is nothing, so Python could not count NaNs.  That makes sense - nothing is nothing.  So, I counted "budgetGroup" where we accounted for NaN = "Missing."

Chaining
This is the power of chaining - we go from raw data, create two variables, group by, and define the calculations all without defining intermediate DataFrames - very efficient. 

Link to David Amos's blog post on chaining with Matt Harrison.


Comments

Popular posts from this blog

Blog Topics

Drawing Tables with ReportLab: A Comprehensive Example

DataFrame groupby agg style bar