Pandas str

Exploring String Operations in Pandas with Movie Genres

Exploring String Operations in Pandas with Movie Genres

In data analysis, string manipulation is a frequent requirement, and Pandas offers powerful tools to make this task more efficient and intuitive. In this post, we delve into the .str accessor in Pandas, utilizing a dataset of movie genres to showcase its capabilities.

Dataset Overview

The dataset used in this example contains information about movies, specifically their genres. We'll be using Pandas to manipulate and extract information from the 'genres' column. Here's a preview of the dataset:

import pandas as pd
import numpy as np

loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv'
m = pd.read_csv(loc)
pd.set_option('display.max_columns', None)
pd.options.display.min_rows = 10
m.head()
    

String Operations

Let's explore various string operations using the .str accessor:

  • Extracting First and Last Genre: We can split the genre string and extract the first and last genre using .str.split('|').str[0] and .str.split('|').str[-1].
  • Counting Genres: The number of genres per movie can be found with .str.split('|').str.len().
  • Creating Genre Lists: Transforming the genre string into a list is as simple as .str.split('|').
  • Counting Characters: The total number of characters in the genres string is obtained using .str.len().
  • Extracting Specific Characters: We demonstrate extracting the first character, the last character, and the first three characters of each genre string.
  • Counting Occurrences of 'a' and 'A': Counting specific characters in a string, case-sensitive and case-insensitive, is another handy operation.

Here's how these operations are implemented:

m_str_fun = (m[['genres']]
     .assign(#This first_genere pulls the first genre from the list
     first_genre=lambda x: x['genres'].str.split('|').str[0],
     #This last_genre pulls the last genre from the list
     last_genre=lambda x: x['genres'].str.split('|').str[-1],
     #This genre_count pulls the number of genres in the list
     genre_count_col=lambda x: x['genres'].str.split('|').str.len(),
     #Creates a list from the genres column
     genre_list_col=lambda x: x['genres'].str.split('|'),
     #This counts the number of characters in the genres column
     count_of_characters=lambda x: x['genres'].str.len(),
     #This pulls the first character from the first genre
     first_character=lambda x: x['genres'].str[0],
     #This pulls the last character from the last genre
     last_character=lambda x: x['genres'].str[-1],
     #This pulls the first 3 characters from the first genre
     first_three_characters=lambda x: x['genres'].str[0:3],
     #This counts the number of times the letter 'a' appears in the genres column
     Count_of_a = lambda x: x['genres'].str.count('a'),
     #This counts the number of times the letter 'A' appears in the genres column
     Count_of_A_a = lambda x: x['genres'].str.count('[Aa]')
))
m_str_fun.head(10)
    

Case Transformations

The case of the text can be transformed using different methods:

  • All Upper Case: Convert all text to upper case.
  • All Lower Case: Convert all text to lower case.
  • Capitalize: Capitalize the first letter of each string.
  • Title Case: Capitalize the first letter of each word.
  • Swap Case: Swap the case of each letter in the string.
all_upper=lambda x: x['genres'].str.upper(),
all_lower=lambda x: x['genres'].str.lower(),
capitalize=lambda x: x['genres'].str.capitalize(),
title=lambda x: x['genres'].str.title(),
swapcase=lambda x: x['genres'].str.swapcase()
    

This code snippet showcases how to implement these string modification operations on the 'genres' column:

m_str_fun1 = (m[['genres']]
    .assign(#Replace all "Action" with "Act"
    replace_action=lambda x: x['genres'].str.replace('Action', 'Act'),
    #All Upper
    all_upper=lambda x: x['genres'].str.upper(),
    #All Lower
    all_lower=lambda x: x['genres'].str.lower(),
    #Capitalize
    capitalize=lambda x: x['genres'].str.capitalize(),
    #Title
    title=lambda x: x['genres'].str.title(),
    #Swapcase
    swapcase=lambda x: x['genres'].str.swapcase()                            
))
m_str_fun1.head(10)
    
Random String Operations in Pandas

Random String Operations in Pandas

In this section, we explore a variety of more specialized string operations in Pandas. These demonstrate the flexibility of the .str accessor when manipulating the 'genres' column in our movie dataset.

Creating Custom Prefixes

Custom prefixes can be added to the first and last word of the genre strings:

  • Prefix for First Genre: Adding a '~' before the first genre.
  • Prefix for Last Genre: Adding a '~' before the last genre.
cat=lambda x: '~' + x['genres'].str.split('|').str[0],
dog=lambda x: '~' + x['genres'].str.split('|').str[-1],
    

String Replacement and Partitioning

Using methods like slice_replace() and partition() for targeted string manipulation:

  • Replacing First Word: Replace the first word in the genres column with "Nothing".
  • Partitioning: Split the genres column into three parts, keeping the first part.
slice_replace=lambda x: x['genres'].str.slice_replace(0, 6, 'Nothing'),
partition=lambda x: x['genres'].str.partition(sep='|')[0],
    

Justification and Character Counting

Adjusting text alignment and counting characters in strings:

  • Left Justification: Left-justify the genre strings, filling with '~'.
  • Keeping First 20 Characters: Retain the first 20 characters of the genre strings.
  • Character Count: Count the number of characters in certain columns.
ljust=lambda x: x['genres'].str.slice(0,20).str.ljust(width=20, fillchar='~'),
first_20=lambda x: x['genres'].str.slice(0,20).str.ljust(width=20, fillchar='~'),
count_of_characters=lambda x: x['first_20'].str.len(),
count_of_genres=lambda x: x['genres'].str.len(),
    

Here is the implementation of these random string operations:

m_str_fun2 = (m[['genres']]
   .assign(
       cat=lambda x: '~' + x['genres'].str.split('|').str[0],
       #Create a column called dog that is the last word in the genres column with a "~" in front of it
       dog=lambda x: '~' + x['genres'].str.split('|').str[-1],
       #use slice_replace() to replace the first word in the genres column with "Nothing"
       slice_replace=lambda x: x['genres'].str.slice_replace(0, 6, 'Nothing'),
       #using partition() to split the genres column into 3 parts
       partition=lambda x: x['genres'].str.partition(sep='|')[0],
       #using ljust() to left justify the genres column
       ljust=lambda x: x['genres'].str.slice(0,20).str.ljust(width=20, fillchar='~'),
       #Keep the first 20 characters in the genres column
       first_20=lambda x: x['genres'].str.slice(0,20).str.ljust(width=20, fillchar='~'))
   .assign(#count the number of characters in the first_20 column
       count_of_characters=lambda x: x['first_20'].str.len(),
       #count the number of characters in the genres column
       count_of_genres=lambda x: x['genres'].str.len(),                        ))
m_str_fun2.head(10)
    

Conclusion

The .str accessor in Pandas is a versatile tool for string manipulation. It simplifies complex operations and makes code more readable, as demonstrated with our movie genre dataset. Whether you're a beginner or an experienced data analyst, these techniques are fundamental in the realm of data processing and analysis. Explore the code in Google Colab.

Comments

Popular posts from this blog

Blog Topics

Drawing Tables with ReportLab: A Comprehensive Example

DataFrame groupby agg style bar