Advanced List Comp in a DataFrame with Movie Data

Understanding Pandas and Data Manipulation in Python

Exploring Movie Data with Python and Pandas

As a beginner in Python, diving into data analysis can be a bit intimidating. However, with libraries like Pandas, it becomes a lot more manageable and fun. In this blog post, we'll explore a movie dataset using Pandas, demonstrating how to perform various data manipulation tasks.

Getting Started with the Dataset

First, let's import our required libraries, Pandas and NumPy. We'll then load our movie dataset from a CSV file:


import pandas as pd
import numpy as np

loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv'
m = pd.read_csv(loc)
pd.set_option('display.max_columns', None)
pd.options.display.min_rows = 10
m.head()

This code block does several things:

  • Imports Pandas and NumPy: Essential libraries for data manipulation.
  • Reads the CSV file: The pd.read_csv function loads the data from the given URL into a DataFrame.
  • Setting display options: Adjusts how Pandas displays data in the output.

Analyzing Genres in the Dataset

Our next step is to analyze the movie genres. We want to identify the top 5 genres and create a single string representing them:


genres = m['genres'].str.split('|', expand=True).stack().value_counts()[0:5].index.tolist()
genres_string = ', '.join(genres)

This code does the following:

  • Splits the genres: The str.split('|', expand=True) method is used to split the genres in the 'genres' column at each '|'. The expand=True parameter transforms the split elements into separate DataFrame columns.
  • Counts the genres: We then stack these columns and count the occurrences of each genre using value_counts(), selecting the top 5.
  • Creates a genre string: The join method concatenates the top 5 genres into a single string.

Advanced Data Manipulation

Now, let's dive into more complex data manipulation tasks:


col = m.columns

# Reverse the genres list and assign ranks
ranked_genres = {genre: rank for rank, genre in enumerate(reversed(genres), start=1)}

# Create a pattern from the genres list
genre_pattern = '|'.join(genres)

m_genres = (m[col]
            .assign(bool_genre=lambda x: x['genres'].str.contains(genre_pattern, na=False))
            .assign(specific_genre_first=lambda x: x['genres'].apply(lambda genre_str: next((genre for genre in genres if genre in genre_str), None)))
            .assign(specific_genre_last=lambda x: x['genres'].apply(lambda genre_str: [genre for genre in genres if genre in genre_str][-1] if any(genre in genre_str for genre in genres) else None))
            .assign(genre_count=lambda x: x['genres'].apply(lambda genre_str: sum(genre in genre_str for genre in genres)))
            .assign(genre_score=lambda x: x['genres'].apply(lambda genre_str: sum(ranked_genres[genre] for genre in genres if genre in genre_str)))
            .assign(genre_list=lambda x: x['genres'].str.findall(genre_pattern))
            .assign(list = lambda x: genres_string)
            .iloc[:,[9,-6,-5,-4,-3,-2,-1]]
            )

This section of code is rich in data manipulation techniques. Here's what each line does:

  • Creating a ranking system: The genres are ranked based on their position in the list, with the most popular genre having the highest rank.
  • Using lambda functions: These are anonymous functions used here to apply complex operations within the assign method.
  • Applying various operations: From checking if a genre exists in each movie (bool_genre) to counting the number of genres (genre_count), and calculating a score based on the ranking system (genre_score).
  • Extracting genre information: The code extracts both the first and the last genre that matches the popular genres list.
  • Creating a list of genres: The findall method is used to create a list of all matching genres in the 'genres' column.
  • Checking if a Genre Exists in Each Movie (bool_genre):

    This operation creates a new column named bool_genre in the DataFrame. The lambda function applies the str.contains() method on the 'genres' column to check if any of the genres from our list appear in each movie's genres. It uses the genre_pattern (a string of top genres joined by '|') for matching. The result is a boolean value for each movie, indicating whether it contains at least one of the top genres.

  • Finding the First Matching Genre (specific_genre_first):

    A new column, specific_genre_first, is created using a lambda function. This function examines each movie's genres and returns the first genre from our predefined list that appears in it. If none of the listed genres are found, it returns None. This helps in identifying the primary genre of each movie from the top genres list.

  • Identifying the Last Matching Genre (specific_genre_last):

    Similar to the previous operation, specific_genre_last is a column that captures the last genre from our top genres list found in each movie's genre list. If no listed genre is found, it returns None. This column can provide insights into the secondary genre preference in movies.

  • Counting the Number of Matching Genres (genre_count):

    The genre_count column is created to count how many of the top genres are present in each movie's genre list. This is achieved using a lambda function that sums the number of times genres from our list appear in the movie's genres. This column helps in understanding the diversity of popular genres in each movie.

  • Calculating a Score Based on Genre Popularity (genre_score):

    This operation adds a genre_score column, which assigns a score to each movie based on the presence and rank of the top genres in its genre list. The score is calculated by summing the ranks of the matched genres, with the ranks determined by their order in the reversed top genres list. This scoring system quantifies the prominence of popular genres in each movie.

  • Extracting a List of Matched Genres (genre_list):

    The genre_list column is generated using the str.findall() method, which extracts all the genres from our list present in each movie's genres. This provides a list of all matched top genres for each movie, offering a quick view of the genres' composition.

  • Repeating the String of Top Genres (list):

    The list column simply replicates the string of top genres (contained in genres_string) for each row in the DataFrame. This constant column is useful for reference or comparison purposes across the dataset.

Conclusion

This post provided a glimpse into the power of Pandas for data manipulation in Python. By exploring a real-world dataset, we demonstrated how to perform basic operations like reading data and adjusting display settings, as well as more complex tasks like data transformation and analysis. We hope this serves as a useful guide for beginners in Python and Pandas. Review the Google Colab Notebook for the code

Comments

Popular posts from this blog

Drawing Tables with ReportLab: A Comprehensive Example

Blog Topics

DataFrame groupby agg style bar