Pandas rank

December 27, 2022

The goal of this article is to dive into rank in Python's Pandas. We will rank the total gross of a movie and keep the top most-grossing movies. Google Colab Link

import pandas as pd
import numpy as np
loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv'
m = pd.read_csv(loc)
pd.set_option('display.max_columns',None)
pd.options.display.min_rows = 10

Now that we have our data, we will walk through ranking movies by the variable gross.

Below is my first attempt. I wanted to chain everything together to be as efficient as possible - which I define as ease of reading and debugging, don't make a bunch of DataFrames between analysis steps, and accuracy, of course.

Below is my first attempt. I wanted the top ten grossing movies from the movies DataFrame. So, I created a variable called ranked at #1. I knew I wanted to query in the same chain, so I created a lambda function so the column/variable is usable immediately.

(
m
    .assign(ranked = lambda df_: df_.gross.rank()) #1
    #.query('ranked <= 10')                        #2
    .sort_values(by='ranked', ascending=False)     #3
    .reset_index()                                 #4
    .loc[0:9,['director_name','movie_title','gross','ranked','index']]    #5

)

At #2, you can see where I tried to use the .query method. However, the way ranked worked, it ranked gross ascending. Meaning, the smallest value was ranked 1. This is the opposite of what I wanted. I could mess around with -1 ideas with loc, but I didn't want to go down that path.

But, I knew we could get around the order issue. So, at #3 I sorted the values. This gave me the highest-grossing movies at the top of the data frame. Next, I used loc to keep only the top ten rows and the specific columns of interest. At #4, I resent the index and added 'index' to keep the original index. Adding index to loc gives me the original "ID," aka index. I wanted to keep this just in case we wanted to join back to the original DataFrame to pull in other columns.

This produced the exact DataFrame I needed.

But, it all felt awkward with the sort_values and loc method. I felt like there was a better method. So, I read the documentation. Fortunately, we can reverse the order of the rankings with asceding=False. Now, it ranks the top-grossing movie as 1.

(

m
    .assign(ranked = m.gross.rank(ascending=False))
    .query('ranked <= 10')
    .loc[:,['director_name','movie_title','gross','ranked']]
)

Output:

No, they are no longer sorted by ranked, but it gave us the same data.

The next question is, which is more efficient. Well, the %%timeit in Jupyter Notebook was marginally faster with the first method - 11.4 vs. 12.8 ms. In my world, a difference measured in single-digit ms is not an advantage.

Which do I prefer? I liked the first method because it required me to learn a few tricks. For analysis code that may be inherited, I like the second method for its simplicity.

Search This Blog

Data Analytics With Python

Pandas rank

Comments

Post a Comment

Popular posts from this blog

Blog Topics

fillna

Course Data