drop_duplicates DataFrame

Finding unduplicated lists is task number 1 on day 2, so enjoy this quick review on deduplicating a list based on a few paramaters.

This narration discusses uses for drop_duplicates. Usually, drop_duplicates is not used in isolation. There are usually steps before the process.

The movies DataFrame has a list of movies with director_name and gross.

I first want all movies with a gross above 1 MM, and of those movies, the top-grossing moving by the director.  This is how I would approach the task with pandas and chaining.  First,

Import the packages and data: Google Colab space with all the executable code


import pandas as pd
import numpy as np
loc = r'C:\....movie.csv'
m = pd.read_csv(loc)
pd.set_option('display.max_columns',None)
pd.options.display.min_rows = 10

Second, apply the desired steps.  At 1, filter out the unnecessary data.  At 2, sort the values by the director's name and gross amount for the movie.  At 3, drop the duplicates by the director's name.  By default, it keeps the first observation.  However, you could remove the ascending=False from 2, add, keep= ‘last’ at 3 and get the same results. 


(
m
    [(m.gross >=1_000_000)] #1
    .sort_values(by=['director_name','gross'], ascending=False) #2
    .drop_duplicates(subset='director_name') #3 
)

That is a quick rundown of chaining, sort_values, and drop_duplicates. Enjoy!

Comments

Popular posts from this blog

Recode with np.select

Python Topic Roadmap

PD.CUT