Posts

Image
Automating Data Import from CSV to DuckDB Learn how to efficiently import multiple CSV files into a DuckDB database using Python. Overview This guide will walk you through the process of writing a Python script to automatically loop through a list of CSV files in a directory and import them into a DuckDB database. This method is particularly useful for consolidating data storage and simplifying data management. Step 1: List All CSV Files Use Python's os module to list all files in the directory and filter to include only CSV files. csv_files = [f for f in os.listdir(directory_path) if f.endswith('.csv')] Step 2: Iterate and Load Each CSV For each CSV file, use DuckDB's COPY command to load the file into a table. for file in csv_files: table_name = os.path.splitext(file)[0] file_path = os.path.join(directory_path, file) conn.execute(f"COPY {table_name} FROM '{file_path}' (HEADER)") Step 3: Complete Python Script Here's...

Using DuckDB as a Centralized Data Storage Solution

Image
Using DuckDB as a Centralized Data Storage Solution Using DuckDB as a Centralized Data Storage Solution Transitioning from multiple CSV files to DuckDB for data storage and management. Step 1: Import CSV Data into DuckDB Start by importing your existing CSV data into DuckDB. DuckDB can load data directly from CSV files and infer the schema automatically. import duckdb conn = duckdb.connect('my_data.duckdb') conn.execute("COPY my_table FROM 'path/to/your/csvfile.csv' (HEADER)") print(conn.execute("SELECT * FROM my_table LIMIT 10").fetchall()) Step 2: Insert New Data For new data, insert it directly into the DuckDB table: conn.execute("INSERT INTO my_table SELECT * FROM ?", (new_data,)) Step 3: Query and Analyze Use SQL queries for data analysis, which is more efficient compared to handling CSV files: ...

Simulating 5k Runner's Data

Image
Introduction In this tutorial, we'll explore how to create a simulated dataset for a running event that spans multiple years using Python and the powerful pandas library. This step-by-step guide is designed for beginners and will help you understand the process of data simulation and manipulation using Python. Scenario Imagine you're tasked with simulating a 5k running event that takes place annually for five years, from 2023 to 2028. Each year, the event sees participants of various ages and running times, and your goal is to generate a dataset that represents this dynamic event. Nuances, if a person participated in a previous year, we don't want an random age, we want their previous age. The age minimum is 16 and the assumed fastes time is 15 minutes - flat! For anyone who ran a 5k, you know this is fast! Setting Up Before we dive into the code, make sure you have Python and the necessary libraries installed. You can use a Jupyte...

5 Year Journal

Image
Creating a 5 Year Journal with Python and ReportLab Creating a 5 Year Journal with Python and ReportLab If you're new to Python or ReportLab and interested in creating PDF documents, this tutorial is for you. We'll walk through the process of creating a "5 Year Journal" PDF using Python and the ReportLab library. This journal will have a page for each day of the year, with sections to write entries for each of the next five years. Setting Up Your Environment First, ensure you have Python installed on your computer. Then, install ReportLab, which is a powerful and versatile PDF generation library. You can install it using pip: pip install reportlab Starting the Project Begin by importing the necessary modules from ReportLab, along with Python's datetime module for handling dates: from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas from datetime import datetime, timedelta Creating the PDF File Set up t...

Grants.gov data analysis

Image
 Grants.Gov Data Download and Analysis Project Description A friend approached me with a data task that required 1) downloading a dataset from Grants.gov, 2) cleaning the data for preparation, and 3) providing starter code so another group could query the data and come to general conclusions. To make the project more manageable, the project was first broken up into two general activities.  The first group listed everything that could be done without thought- accessing and cleaning the data.  The second group of tasks was more interactive - filtering and searching the data for insight.  As a result,  I like to give the user more power to detail where they store the raw data and where they would like to store data for archiving.  Some folks skip the project location space, but I don't like to leave it up to change.  I like to be explicit whenever possible.   Below is the "main" program.  I don't use Jupyter Notebooks, but "cells" in VS Cod...

Course Data

Image
Generating and Manipulating Random Data in Pandas Generating and Manipulating Random Data in Pandas Creating a DataFrame with random data is a common task in data analysis for testing and simulation. This example demonstrates the generation and manipulation of a Pandas DataFrame that represents a dataset of college courses. The dataset includes course prefixes, locations, modes, and a mapping for course names and numbers. The first method in the provided code employs Python's random module to generate random data for the DataFrame. This module is crucial for simulating real-world variability in datasets. In our example, random.choice() is used to randomly select elements from predefined lists. Specifically, the course prefixes are chosen from a list of department abbreviations (pre_list), and the location for each course is randomly selected from a list of campus locations. This random selection process is repeated for each row of the DataFrame, crea...

Pandas str

Image
Exploring String Operations in Pandas with Movie Genres Exploring String Operations in Pandas with Movie Genres In data analysis, string manipulation is a frequent requirement, and Pandas offers powerful tools to make this task more efficient and intuitive. In this post, we delve into the .str accessor in Pandas, utilizing a dataset of movie genres to showcase its capabilities. Dataset Overview The dataset used in this example contains information about movies, specifically their genres. We'll be using Pandas to manipulate and extract information from the 'genres' column. Here's a preview of the dataset: import pandas as pd import numpy as np loc = 'https://raw.githubusercontent.com/aew5044/Python---Public/main/movie.csv' m = pd.read_csv(loc) pd.set_option('display.max_columns', None) pd.options.display.min_rows = 10 m.head() String Operations Let's explore various string operations usi...