Recode with np.select
Goal: Create a new column based on data from another column. Use Pandas.DataFrame and Numpy.select to create the new column. You can see the full program at Google Colaboratory
There are three basic steps to accomplish the goal.
- Define the conditions
- Define the values
- Use np.select to apply the conditions and values.
The goal is to correlate each condition with each value. What does that mean? It means if the first condition in the "conditions" list evaluates as True, return the first value in the "values" list.
The second is the use of .mean().
- If m.budget is less than the mean calculations, return "Less than the mean."
- If m.budget is greater than the mean calculation, return "Greater than the mean."
- if m.budget is NaN, retung "Missing."
condition = (m['budget'].isnull() == True),
(m['budget'] < m['budget'].mean()),
(m['budget'] >= m['budget'].mean()) ]
values = ['Missing',
'Less than the mean',
'Greater than the mean']
Now, we want to put it all together with NumPy.select method. Below is a chained code chuck. It connects methods back-to-back, like a chain, so the program does not require intermediate DataFrames between each method, transformation, or data processing step.
Number 2 (e.g., #2) is where we call the "condition" and apply the "values" when the condition is true. For example, when the first condition is true (e.g., the value of budget is NaN), then the new variable "budgetGroup" contains "Missing," and when the budget is less than the mean, the value returned is "Less than the mean."
(m #1
.assign(budgetGroup = np.select(condition,values)) #2
.assign(The_Mean_Is = m['budget'].mean()) #3
.groupby(['budgetGroup','The_Mean_Is']) #4
.agg({'budgetGroup':['count'], #5
'budget':['min','max','mean']}) #6
)
The code chunk above is an example of a chaining method - I use chaining in all my code to reduce the number of intermediate DataFrames and have everything clean and concise. Having all the code contained in ( ) negates the opportunity for white space errors.
What does all this do?
- #2 and #3 create new variables.
- #4, the .groupby method groups the data by the newly created "budgetGroup" and "The_mean_is" variables.
- #6 counts the total observations associated with "budgetGroup" and calculates the minimum, maximum, and mean for "budget."
Chaining
This is the power of chaining - we go from raw data, create two variables, group by, and define the calculations all without defining intermediate DataFrames - very efficient.
Link to David Amos's blog post on chaining with Matt Harrison.
Comments
Post a Comment