Pandas rank
The goal of this article is to dive into rank in Python's Pandas. We will rank the total gross of a movie and keep the top most-grossing movies. Google Colab Link
import pandas as pd
import numpy as np
loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv'
m = pd.read_csv(loc)
pd.set_option('display.max_columns',None)
pd.options.display.min_rows = 10
Now that we have our data, we will walk through ranking movies by the variable gross.
Below is my first attempt. I wanted to chain everything together to be as efficient as possible - which I define as ease of reading and debugging, don't make a bunch of DataFrames between analysis steps, and accuracy, of course.
Below is my first attempt. I wanted the top ten grossing movies from the movies DataFrame. So, I created a variable called ranked at #1. I knew I wanted to query in the same chain, so I created a lambda function so the column/variable is usable immediately.
(
m
.assign(ranked = lambda df_: df_.gross.rank()) #1
#.query('ranked <= 10') #2
.sort_values(by='ranked', ascending=False) #3
.reset_index() #4
.loc[0:9,['director_name','movie_title','gross','ranked','index']] #5
)
At #2, you can see where I tried to use the .query method. However, the way ranked worked, it ranked gross ascending. Meaning, the smallest value was ranked 1. This is the opposite of what I wanted. I could mess around with -1 ideas with loc, but I didn't want to go down that path.
But, I knew we could get around the order issue. So, at #3 I sorted the values. This gave me the highest-grossing movies at the top of the data frame. Next, I used loc to keep only the top ten rows and the specific columns of interest. At #4, I resent the index and added 'index' to keep the original index. Adding index to loc gives me the original "ID," aka index. I wanted to keep this just in case we wanted to join back to the original DataFrame to pull in other columns.
This produced the exact DataFrame I needed.
But, it all felt awkward with the sort_values and loc method. I felt like there was a better method. So, I read the documentation. Fortunately, we can reverse the order of the rankings with asceding=False. Now, it ranks the top-grossing movie as 1.
(
m
.assign(ranked = m.gross.rank(ascending=False))
.query('ranked <= 10')
.loc[:,['director_name','movie_title','gross','ranked']]
)
Output:
Comments
Post a Comment