Pandas str
Exploring String Operations in Pandas with Movie Genres
In data analysis, string manipulation is a frequent requirement, and Pandas offers powerful tools to make this task more efficient and intuitive. In this post, we delve into the .str
accessor in Pandas, utilizing a dataset of movie genres to showcase its capabilities.
Dataset Overview
The dataset used in this example contains information about movies, specifically their genres. We'll be using Pandas to manipulate and extract information from the 'genres' column. Here's a preview of the dataset:
import pandas as pd import numpy as np loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv' m = pd.read_csv(loc) pd.set_option('display.max_columns', None) pd.options.display.min_rows = 10 m.head()
String Operations
Let's explore various string operations using the .str
accessor:
- Extracting First and Last Genre: We can split the genre string and extract the first and last genre using
.str.split('|').str[0]
and.str.split('|').str[-1]
. - Counting Genres: The number of genres per movie can be found with
.str.split('|').str.len()
. - Creating Genre Lists: Transforming the genre string into a list is as simple as
.str.split('|')
. - Counting Characters: The total number of characters in the genres string is obtained using
.str.len()
. - Extracting Specific Characters: We demonstrate extracting the first character, the last character, and the first three characters of each genre string.
- Counting Occurrences of 'a' and 'A': Counting specific characters in a string, case-sensitive and case-insensitive, is another handy operation.
Here's how these operations are implemented:
m_str_fun = (m[['genres']] .assign(#This first_genere pulls the first genre from the list first_genre=lambda x: x['genres'].str.split('|').str[0], #This last_genre pulls the last genre from the list last_genre=lambda x: x['genres'].str.split('|').str[-1], #This genre_count pulls the number of genres in the list genre_count_col=lambda x: x['genres'].str.split('|').str.len(), #Creates a list from the genres column genre_list_col=lambda x: x['genres'].str.split('|'), #This counts the number of characters in the genres column count_of_characters=lambda x: x['genres'].str.len(), #This pulls the first character from the first genre first_character=lambda x: x['genres'].str[0], #This pulls the last character from the last genre last_character=lambda x: x['genres'].str[-1], #This pulls the first 3 characters from the first genre first_three_characters=lambda x: x['genres'].str[0:3], #This counts the number of times the letter 'a' appears in the genres column Count_of_a = lambda x: x['genres'].str.count('a'), #This counts the number of times the letter 'A' appears in the genres column Count_of_A_a = lambda x: x['genres'].str.count('[Aa]') )) m_str_fun.head(10)
Case Transformations
The case of the text can be transformed using different methods:
- All Upper Case: Convert all text to upper case.
- All Lower Case: Convert all text to lower case.
- Capitalize: Capitalize the first letter of each string.
- Title Case: Capitalize the first letter of each word.
- Swap Case: Swap the case of each letter in the string.
all_upper=lambda x: x['genres'].str.upper(), all_lower=lambda x: x['genres'].str.lower(), capitalize=lambda x: x['genres'].str.capitalize(), title=lambda x: x['genres'].str.title(), swapcase=lambda x: x['genres'].str.swapcase()
This code snippet showcases how to implement these string modification operations on the 'genres' column:
m_str_fun1 = (m[['genres']] .assign(#Replace all "Action" with "Act" replace_action=lambda x: x['genres'].str.replace('Action', 'Act'), #All Upper all_upper=lambda x: x['genres'].str.upper(), #All Lower all_lower=lambda x: x['genres'].str.lower(), #Capitalize capitalize=lambda x: x['genres'].str.capitalize(), #Title title=lambda x: x['genres'].str.title(), #Swapcase swapcase=lambda x: x['genres'].str.swapcase() )) m_str_fun1.head(10)
Random String Operations in Pandas
In this section, we explore a variety of more specialized string operations in Pandas. These demonstrate the flexibility of the .str
accessor when manipulating the 'genres' column in our movie dataset.
Creating Custom Prefixes
Custom prefixes can be added to the first and last word of the genre strings:
- Prefix for First Genre: Adding a '~' before the first genre.
- Prefix for Last Genre: Adding a '~' before the last genre.
cat=lambda x: '~' + x['genres'].str.split('|').str[0], dog=lambda x: '~' + x['genres'].str.split('|').str[-1],
String Replacement and Partitioning
Using methods like slice_replace()
and partition()
for targeted string manipulation:
- Replacing First Word: Replace the first word in the genres column with "Nothing".
- Partitioning: Split the genres column into three parts, keeping the first part.
slice_replace=lambda x: x['genres'].str.slice_replace(0, 6, 'Nothing'), partition=lambda x: x['genres'].str.partition(sep='|')[0],
Justification and Character Counting
Adjusting text alignment and counting characters in strings:
- Left Justification: Left-justify the genre strings, filling with '~'.
- Keeping First 20 Characters: Retain the first 20 characters of the genre strings.
- Character Count: Count the number of characters in certain columns.
ljust=lambda x: x['genres'].str.slice(0,20).str.ljust(width=20, fillchar='~'), first_20=lambda x: x['genres'].str.slice(0,20).str.ljust(width=20, fillchar='~'), count_of_characters=lambda x: x['first_20'].str.len(), count_of_genres=lambda x: x['genres'].str.len(),
Here is the implementation of these random string operations:
m_str_fun2 = (m[['genres']] .assign( cat=lambda x: '~' + x['genres'].str.split('|').str[0], #Create a column called dog that is the last word in the genres column with a "~" in front of it dog=lambda x: '~' + x['genres'].str.split('|').str[-1], #use slice_replace() to replace the first word in the genres column with "Nothing" slice_replace=lambda x: x['genres'].str.slice_replace(0, 6, 'Nothing'), #using partition() to split the genres column into 3 parts partition=lambda x: x['genres'].str.partition(sep='|')[0], #using ljust() to left justify the genres column ljust=lambda x: x['genres'].str.slice(0,20).str.ljust(width=20, fillchar='~'), #Keep the first 20 characters in the genres column first_20=lambda x: x['genres'].str.slice(0,20).str.ljust(width=20, fillchar='~')) .assign(#count the number of characters in the first_20 column count_of_characters=lambda x: x['first_20'].str.len(), #count the number of characters in the genres column count_of_genres=lambda x: x['genres'].str.len(), )) m_str_fun2.head(10)
Conclusion
The .str
accessor in Pandas is a versatile tool for string manipulation. It simplifies complex operations and makes code more readable, as demonstrated with our movie genre dataset. Whether you're a beginner or an experienced data analyst, these techniques are fundamental in the realm of data processing and analysis. Explore the code in Google Colab.
Comments
Post a Comment