Simulating 5k Runner's Data

Introduction

In this tutorial, we'll explore how to create a simulated dataset for a running event that spans multiple years using Python and the powerful pandas library. This step-by-step guide is designed for beginners and will help you understand the process of data simulation and manipulation using Python.

Scenario

Imagine you're tasked with simulating a 5k running event that takes place annually for five years, from 2023 to 2028. Each year, the event sees participants of various ages and running times, and your goal is to generate a dataset that represents this dynamic event. Nuances, if a person participated in a previous year, we don't want an random age, we want their previous age. The age minimum is 16 and the assumed fastes time is 15 minutes - flat! For anyone who ran a 5k, you know this is fast!

Setting Up

Before we dive into the code, make sure you have Python and the necessary libraries installed. You can use a Jupyter Notebook, Google Colab, or any Python environment of your choice.

The Code


import numpy as np
import pandas as pd
import random
import plotly.express as px

start_year = 2023
num_runners = 1000
min_age = 16  # Updated minimum age
mean_age = 30
age_std_dev = 10
min_time = 15.0  # Updated minimum time
mean_time = 21
time_std_dev = 3
num_years = 5
annual_participants_increase = 100
non_returner_rate_min = 0.30
non_returner_rate_max = 0.35

# Create an empty DataFrame
columns = ['ID', 'Year', 'Age', 'Time']
df = pd.DataFrame(columns=columns)

# Initialize a variable to keep track of runner IDs
runner_id = 1

for year in range(start_year, start_year + num_years):
    # Calculate the number of returners and new participants
    returners = int(num_runners * 0.60)
    new_participants = num_runners + annual_participants_increase - returners

    # Generate data for returners
    if year == start_year:
        returner_data = pd.DataFrame(columns=columns)
    else:
        returner_data = df[df['Year'] == year - 1].sample(n=returners, replace=True)
        # Increment the age of returners by one
        returner_data['Age'] = np.maximum(min_age, returner_data['Age'] + 1)
        # Generate race times for returners based on the number of returners
        returner_data['Time'] = np.maximum(min_time, np.random.normal(mean_time, time_std_dev, returners))

    # Generate data for new participants
    new_data = pd.DataFrame(columns=columns)
    new_data['ID'] = range(runner_id, runner_id + new_participants)
    new_data['Age'] = np.maximum(min_age, np.random.normal(mean_age, age_std_dev, new_participants))
    new_data['Time'] = np.maximum(min_time, np.random.normal(mean_time, time_std_dev, new_participants))

    # Combine returners and new participants data
    combined_data = pd.concat([returner_data, new_data])

    # Set the year for the current data
    combined_data['Year'] = year

    # Concatenate the current data with the main DataFrame
    df = pd.concat([df, combined_data], ignore_index=True)

    # Update the runner_id for the next year
    runner_id += new_participants

    # Randomize the non-returner rate for the next year
    non_returner_rate = random.uniform(non_returner_rate_min, non_returner_rate_max)
    num_runners = int(num_runners * (1 - non_returner_rate)) + annual_participants_increase

# Display the descriptive statistics
stats_by_year = df.groupby('Year')[['Age', 'Time']].describe()
print(stats_by_year)

print(df.sample(10))


# %%

In the main for loop section of the code, we are calculating the number of returners (participants returning from the previous year) and new participants for each year of the running event simulation. This calculation is crucial for managing the participant dynamics across multiple years.

We begin by using a `for` loop to iterate through each year in the specified range, which spans from `start_year` to `start_year + num_years`. This loop allows us to simulate the running event over a series of years, making the simulation more dynamic.

Within each iteration of the loop, the following calculations are performed:

1. `returners` calculation: We calculate the number of returners as 60% of the total number of runners (`num_runners`) for the current year. This percentage represents the portion of participants from the previous year who decide to return and participate again.

2. `new_participants` calculation: The number of new participants for the current year is determined by subtracting the number of returners from the sum of the total number of runners (`num_runners`) and the annual increase in participants (`annual_participants_increase`). This calculation accounts for both the existing participants who return and the new runners joining the event.

By performing these calculations within the loop for each year, the code dynamically adjusts the number of returners and new participants based on the simulation parameters. This way, the running event can grow, and the participant composition can change from year to year, creating a more realistic and evolving dataset for analysis.


    if year == start_year:
        returner_data = pd.DataFrame(columns=columns)
    else:
        returner_data = df[df['Year'] == year - 1].sample(n=returners, replace=True)
        # Increment the age of returners by one
        returner_data['Age'] = np.maximum(min_age, returner_data['Age'] + 1)
        # Generate race times for returners based on the number of returners
        returner_data['Time'] = np.maximum(min_time, np.random.normal(mean_time, time_std_dev, returners))

In the If/Else section of the code, we are specifically handling the generation of data for returners, which are participants returning from the previous year's event. This is a crucial step in simulating the multi-year running event dataset.

First, we check if the current year is the starting year of the simulation, which is represented by the variable `start_year`. If it is the starting year, we initialize an empty DataFrame called `returner_data` with the specified columns, which include 'ID,' 'Year,' 'Age,' and 'Time.'

However, if the current year is not the starting year, we perform the following steps:

We extract data from the DataFrame `df` for the previous year using the condition `df['Year'] == year - 1`. This gives us the data for participants from the year before. We then use the `sample` function to randomly select a sample of returners from the previous year's data. The number of returners is determined by the variable `returners` calculated earlier. We use the `replace=True` argument to allow participants to be selected more than once if necessary.

Next, we increment the age of these returners by one to simulate the passage of time from one year to the next. This is achieved by updating the 'Age' column in the `returner_data` DataFrame using `returner_data['Age'] = np.maximum(min_age, returner_data['Age'] + 1)`.

Lastly, we generate race times for these returners based on the number of returners and a specified distribution. The `np.random.normal` function generates random numbers following a normal distribution with a mean of `mean_time` and a standard deviation of `time_std_dev`. We use `np.maximum` to ensure that the generated times are not below a minimum threshold defined by `min_time`. The resulting race times are added to the 'Time' column of the `returner_data` DataFrame.

This code snippet ensures that returners from previous years are appropriately aged up and assigned new race times as the simulation progresses.

In the # Generate data for new participants, we focus on generating data for new participants for the current year, combining it with returner data, and updating various parameters to prepare for the next year's simulation. Let's break down each part:

1. `new_data` DataFrame creation: We start by creating an empty DataFrame called `new_data` with columns defined by the variable `columns`. These columns typically include 'ID,' 'Age,' and 'Time,' which are essential attributes for each participant.

2. Assigning unique IDs: Within `new_data`, we assign unique IDs to the new participants. The `range` function is used to create a sequence of IDs starting from `runner_id` and ending at `runner_id + new_participants`. This ensures that each new participant receives a distinct identifier.

3. Generating ages: For the new participants, we generate random ages based on a normal distribution. The `np.random.normal` function is employed with parameters such as `mean_age` and `age_std_dev` to control the age distribution. The `np.maximum` function ensures that the generated ages are not below a specified minimum age, which is set as `min_age`.

4. Generating race times: Similar to ages, we generate random race times for the new participants. The `np.random.normal` function is used, with parameters like `mean_time` and `time_std_dev` to control the time distribution. Again, the `np.maximum` function is employed to ensure that the generated times are not below a specified minimum time, set as `min_time`.

5. Combining data: We concatenate the data for returners and new participants to create a combined dataset called `combined_data`. This step merges the data for participants from the previous year who are returning with the data for new participants joining the event in the current year.

6. Setting the year: Within `combined_data`, we set the 'Year' attribute to the current year using `combined_data['Year'] = year`. This assigns the correct year to each participant's data in the combined dataset.

7. Concatenating with the main DataFrame: We concatenate the `combined_data` with the main DataFrame `df`. This operation appends the data for the current year to the cumulative dataset, ensuring that we have a growing dataset reflecting multiple years of the running event.

8. Updating `runner_id`: To prepare for the next year's simulation, we update the `runner_id` by incrementing it with the number of new participants (`runner_id += new_participants`). This ensures that new participants in the next year receive unique IDs without overlap with previous years.

9. Randomizing non-returner rate: Finally, we randomize the non-returner rate for the next year. This rate determines how many participants from the previous year do not return. By adjusting `num_runners` based on this rate and the annual increase, we control the total number of runners for the next year's simulation.

This code section manages the creation of data for new participants, combines it with returner data, and prepares the simulation for the subsequent year by updating relevant parameters.

Run the code in Google Colab - Happy coding!

Search This Blog

Data Analytics With Python