Course Data
Generating and Manipulating Random Data in Pandas
Creating a DataFrame with random data is a common task in data analysis for testing and simulation. This example demonstrates the generation and manipulation of a Pandas DataFrame that represents a dataset of college courses.
The dataset includes course prefixes, locations, modes, and a mapping for course names and numbers. The first method in the provided code employs Python's random module to generate random data for the DataFrame. This module is crucial for simulating real-world variability in datasets. In our example, random.choice() is used to randomly select elements from predefined lists. Specifically, the course prefixes are chosen from a list of department abbreviations (pre_list), and the location for each course is randomly selected from a list of campus locations. This random selection process is repeated for each row of the DataFrame, creating a diverse and randomized dataset. This approach is particularly useful in scenarios where we need to simulate data that mimics real-life randomness, such as in mock datasets for testing or training purposes.
On the other hand, the map method with dictionaries is used to create a direct mapping from one set of values to another. In this script, it's employed to translate course prefixes to full course names and numbers. The map function takes a dictionary as an argument, where the keys are the items in the original series (in this case, the course prefixes) and the values are the corresponding items we want to map to (course names and numbers). By using map, the code efficiently translates each course prefix into a more descriptive course name and assigns a corresponding course number. This method is highly efficient for data transformation tasks where each item in a series has a direct, predefined corresponding value.Initial Setup and Data Generation
We begin by importing necessary libraries and defining key data elements:
- Pandas: For DataFrame operations.
- Random: To generate random choices.
Creating the DataFrame
Using random choices, we construct a DataFrame that includes:
- Course Prefix: A random selection from predefined prefixes.
- Course Name: Mapped from the course prefix.
- Course Number: Also mapped from the prefix.
- Course Level: Determined by the course number (graduate or undergraduate).
- Location: Randomly chosen location.
- Mode of Delivery: Randomly selected from modes like lecture, lab, etc., based on the prefix.
Here's a snippet of the code that achieves this:
import pandas as pd import random # Define the list and series length series_length = 50 locations = ['Centenial','Main'] modes = ['Lecture', 'Lab', 'Tutorial'] pre_list = ['ENG', 'PSY', 'MAT', 'HIT', 'PHY', 'CHM', 'BIO', 'CIS'] course_map = ['English', 'Psychology', 'Mathematics', 'History', 'Physics', 'Chemistry','Biology', 'Computer Science'] number_map = [100, 200, 300, 400, 500, 600, 700, 800] # Generate the random series random_series = [random.choice(pre_list) for _ in range(series_length)] random_location = [random.choice(locations) for _ in range(series_length)] # Create a DataFrame df = pd.DataFrame(random_series, columns=['prefix']) df = (df[['prefix']] .assign(CourseName = lambda x: x['prefix'].map(dict(zip(pre_list, course_map))), Number = lambda x: x['prefix'].map(dict(zip(pre_list, number_map))), #if number is greater than 500, then it is a graduate course CourseLevel = lambda x: x['Number'].apply(lambda y: 'Graduate' if y > 500 else 'Undergraduate'), Location = random_location, Mode = lambda x: x['prefix'].apply(lambda y: random.choice(modes)), )) print(df)
Below is the example output from Google Colab.
This code effectively showcases the versatility of Pandas in creating and manipulating data, providing a solid foundation for simulating more complex data scenarios. Run the code on Google Colab.
Comments
Post a Comment