Using DuckDB as a Centralized Data Storage Solution

December 02, 2023

Transitioning from multiple CSV files to DuckDB for data storage and management.

Step 1: Import CSV Data into DuckDB

Start by importing your existing CSV data into DuckDB. DuckDB can load data directly from CSV files and infer the schema automatically.


        import duckdb

        

        conn = duckdb.connect('my_data.duckdb')

        conn.execute("COPY my_table FROM 'path/to/your/csvfile.csv' (HEADER)")

        print(conn.execute("SELECT * FROM my_table LIMIT 10").fetchall())

Step 2: Insert New Data

For new data, insert it directly into the DuckDB table:


        conn.execute("INSERT INTO my_table SELECT * FROM ?", (new_data,))

Step 3: Query and Analyze

Use SQL queries for data analysis, which is more efficient compared to handling CSV files:


        result = conn.execute("SELECT column1, AVG(column2) FROM my_table GROUP BY column1").fetchdf()

        print(result)

Step 4: Regular Data Operations

Perform regular data operations like updates, deletions, or schema alterations using SQL commands.

Step 5: Backup and Maintenance

Regularly backup your DuckDB file to ensure data safety. The single-file format of DuckDB simplifies this process.

By adopting DuckDB, you gain advantages in data integrity, query capability, and efficiency, making it ideal for complex and varied data in research environments.

Search This Blog

Data Analytics With Python