drop_duplicates DataFrame
Finding unduplicated lists is task number 1 on day 2, so enjoy this quick review on deduplicating a list based on a few paramaters.
This narration discusses uses for drop_duplicates. Usually, drop_duplicates is not used in isolation. There are usually steps before the process.
The movies DataFrame has a list of movies with director_name
and gross.
I first want all movies with a gross above 1 MM, and of those
movies, the top-grossing moving by the director.
This is how I would approach the task with pandas and chaining. First,
Import the packages and data: Google Colab space with all the executable code
import pandas as pd
import numpy as np
loc = r'C:\....movie.csv'
m = pd.read_csv(loc)
pd.set_option('display.max_columns',None)
pd.options.display.min_rows = 10
Second, apply the desired steps. At 1, filter out the unnecessary data. At 2, sort the values by the director's name
and gross amount for the movie. At 3,
drop the duplicates by the director's name. By
default, it keeps the first observation.
However, you could remove the ascending=False from 2, add, keep= ‘last’ at 3 and get the same results.
(
m
[(m.gross >=1_000_000)] #1
.sort_values(by=['director_name','gross'], ascending=False) #2
.drop_duplicates(subset='director_name') #3
)
That is a quick rundown of chaining, sort_values, and drop_duplicates. Enjoy!
Comments
Post a Comment