Understanding np.where and its Applications in Pandas


NumPy's np.where function is a versatile tool that can be used for conditional execution. It's especially handy when working with large datasets in pandas DataFrame or Series objects. In this post, we'll unravel the use of np.where in a real-life data manipulation task involving a movie dataset.

Here's a code snippet we'll be discussing (Google Colab Interactive Example):

************************Code Snippet*******************************;

import pandas as pd

import numpy as np

loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv'

m = pd.read_csv(loc)

m1 = (

    m

    .assign(color = lambda x: x.color.fillna('Missing')) #Fill in missing values with "Missing"

    .assign(bw = lambda x: np.where(x.color != 'Color', 1, 0)) #Create a new column called "bw"

    .assign(bw_after_1939 = lambda x: np.where((x.title_year > 1939) & (x.color.str.startswith('B') | x.color.str.startswith('b')), 1, 0)) 

)

# Print some statistics

m1.pivot_table(index = 'color', values = 'bw', aggfunc = ['count','sum'])

m1.pivot_table(index = 'color', values = 'bw_after_1939', aggfunc = ['count','sum'])

*********************End of Code Snippet****************************;

About np.where

np.where is a function that returns elements chosen from two arrays depending on the condition. It's used when we want to select elements from two different arrays based on a condition. Here, it's used to generate a new array with values conditionally selected from two alternatives - 1 and 0.

Code Walkthrough

The code begins with loading a CSV file from a URL into a pandas DataFrame. It then uses the assign method to create or modify columns.

The color column is first filled with the string 'Missing' wherever NaN values are found.
A new column, bw, is created, where if the color of the movie isn't 'Color', it's assumed to be a black-and-white movie and marked with 1, otherwise 0. The Google Colab describes how this got me in trouble. 

Another column, bw_after_1939, is created to mark black-and-white movies made after 1939, considering that the movie color starts with 'B' or 'b'. If the condition is satisfied, the entry in the column is 1; otherwise, it's 0.

After the DataFrame is prepared, the code creates pivot tables summarizing the count and sum of movies for different colors based on the two 'bw' columns.

I prefer np.select when I need to resolve to more then a boolean value.  For example, if color = "color" and title_year > 1939, return "After 1939", else if color = "color" and title_year <1939, return "Before color technology was available," Else return "After 1939 and utilize color technology."  That is anther post on np.select.

Conclusion

Using np.where can simplify conditional logic and make your pandas code more efficient and readable. It's worth noting that it's not the only way to handle conditional logic in pandas. Alternatives include using apply with a function, or more pandas-native methods like where or mask - but I consider a "code smell" and should be not used unless there is no other option, if you see it in your code, remove it immediately! np.where can provide a performance benefit, especially with larger datasets - it is very quick. Happy data wrangling!

Comments

Popular posts from this blog

Blog Topics

Drawing Tables with ReportLab: A Comprehensive Example

DataFrame groupby agg style bar