Pandas rank

 The goal of this article is to dive into rank in Python's Pandas. We will rank the total gross of a movie and keep the top most-grossing movies. Google Colab Link

import pandas as pd
import numpy as np
loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv'
m = pd.read_csv(loc)
pd.set_option('display.max_columns',None)
pd.options.display.min_rows = 10

Now that we have our data, we will walk through ranking movies by the variable gross. 

Below is my first attempt.  I wanted to chain everything together to be as efficient as possible - which I define as ease of reading and debugging, don't make a bunch of DataFrames between analysis steps, and accuracy, of course. 

Below is my first attempt. I wanted the top ten grossing movies from the movies DataFrame.  So, I created a variable called ranked at #1.  I knew I wanted to query in the same chain, so I created a lambda function so the column/variable is usable immediately.  

(
m
    .assign(ranked = lambda df_: df_.gross.rank()) #1
    #.query('ranked <= 10')                        #2
    .sort_values(by='ranked', ascending=False)     #3
    .reset_index()                                 #4
    .loc[0:9,['director_name','movie_title','gross','ranked','index']]    #5

)

At #2, you can see where I tried to use the .query method.  However, the way ranked worked, it ranked gross ascending.  Meaning, the smallest value was ranked 1.  This is the opposite of what I wanted.  I could mess around with -1 ideas with loc, but I didn't want to go down that path. 

But, I knew we could get around the order issue.  So, at #3 I sorted the values.  This gave me the highest-grossing movies at the top of the data frame.  Next, I used loc to keep only the top ten rows and the specific columns of interest.  At #4, I resent the index and added 'index' to keep the original index. Adding index to loc gives me the original "ID," aka index.  I wanted to keep this just in case we wanted to join back to the original DataFrame to pull in other columns.

This produced the exact DataFrame I needed.  


But, it all felt awkward with the sort_values and loc method.  I felt like there was a better method. So, I read the documentation.  Fortunately, we can reverse the order of the rankings with asceding=False.  Now, it ranks the top-grossing movie as 1.  

(

m
    .assign(ranked = m.gross.rank(ascending=False))
    .query('ranked <= 10')
    .loc[:,['director_name','movie_title','gross','ranked']]
)

Output:


No, they are no longer sorted by ranked, but it gave us the same data.  

The next question is, which is more efficient.  Well, the %%timeit in Jupyter Notebook was marginally faster with the first method - 11.4 vs. 12.8 ms.  In my world, a difference measured in single-digit ms is not an advantage.  

Which do I prefer?  I liked the first method because it required me to learn a few tricks.  For analysis code that may be inherited, I like the second method for its simplicity. 


Comments

Popular posts from this blog

Recode with np.select

Python Topic Roadmap

PD.CUT