data analysis

 Grants.Gov Data Download and Analysis

Project Description

A friend approached me with a data task that required 1) downloading a dataset from, 2) cleaning the data for preparation, and 3) providing starter code so another group could query the data and come to general conclusions. To make the project more manageable, the project was first broken up into two general activities.  The first group listed everything that could be done without thought- accessing and cleaning the data.  The second group of tasks was more interactive - filtering and searching the data for insight. 

As a result, I like to give the user more power to detail where they store the raw data and where they would like to store data for archiving.  Some folks skip the project location space, but I don't like to leave it up to change.  I like to be explicit whenever possible.  

Below is the "main" program.  I don't use Jupyter Notebooks, but "cells" in VS Code are defined by # %% and #%%, where an individual can run chunks of code and see the results in an Interactive Window.  It has the same principles as a Jupyter Notebook but without the look of a Notebook. This format is useful for others to walk through the code and explore the project.  

A quick note on the level of detail for descriptions.  Python is open source with extensive official and non-official documentation.  This space is used to look at a project from a high level and see how each piece interacts as a system.  If you have a question about specific functions, methods, or similar, please search official documentation, and you will find your answer.  That being said, let us dig into the main program. 

After we define the folders, user-defined functions are imported from "download_and_clean_raw_data." This is where all the code is stored to complete the automated tasks.  Each task was chucked into smaller tasks. I started with the first task, downloading the data.

The File Download

This process was interesting because the data is not stored with a static link. Nope, it is stored with a dynamic file name where each file name contains the report date.  As a result, the Python project was required to look at today's date, format it in a way that conformed to the file name, and then create the file name before requesting the data.  I knew I would need a few other variables, so I created a user-defined module called global_variables, as seen below. 

The next user-defined function is the download function with all the logic. The function accepts the URL, the project location (py_space), the downloads folder (download_space), and today's data in the correct format (today_date).  I first define five different variables: xml_file_name, xml_file_path, zip_file_name, download_file_path, and zip_path.   I use "os" or operating systems to correctly create file paths ubiquitous to different operating systems - it can work on Mac or Windows. 

The next thing I do is start with the logic.  If the download file does not exist, run the following code to this website, download the file, and print "Downloading zip file." If it is already downloaded, "Zip file was already downloaded for today" is printed.  The print statements can be skipped in an automated system.  However, this project is made for a human to run and review.  So, feedback is desired to give the user updates on the system's status.  

The next step is to ask if the extracted data already exists.  In this project, the extracted file is saved to the py_space (I'm a data analyst, I keep everything). If the XML is extracted, call the user-defined module "xml_to_df" and give it the xml_file_path.  That is where we extract the data. The code below reads all the data based on the pre-define file structure. 

The product of xml_to_df is not ready for data analysis yet.  Dates need formatting, objects must be transformed into categories, and other variables must be created.  As a result, we feed "df" to another user-defined function: 

As we can see, we convert dates to date types, try to save memory by assigning more efficient numeric types and convert strings/objects to categories. 

Up to this point, all of the automated requirements are completed.  We have our data ready for analysis.  However, I wanted to go further than expected, creating a simple report demonstrating the power of Python.  As a result, I looked towards the Python Module Report Lab to create a one-page PDF that showed a dynamic paragraph, a line chart, and a table.  

Let us start with the line graph, a.k .a. series chart.  Graphing is a mix of "start with the end in mind" and create as you go, resulting in a rigid but fluid process.  Below is the finished product, but there was a mix of knowing where I wanted to go and changing things as I went. 

There are several things I would continue to change, but this is where it landed - it did the job.  Below is the code that created the chart:

There are several things I like about this code. But the thing I like the most is the dynamic month selector.  

Now that the line graph is set, we need to start with the next item, the first paragraph.  I work in education, so I created a generic paragraph as a filler:

I wanted to demonstrate how we can dynamically insert values into a paragraph.  This was accomplished with an "f" string.  

The DataFrame "DE" represents data for The Department of Education. We use it to group and sum by month. Before we get too far into the project, I want to show you the finished project

The last interesting piece to delve into is the bottom table created with the following code:

We can feed a DataFrame as a list of lists to the table constructor (e.g., table = Table(data)) and then define and apply a style with the TableStyle constructor.  This looks very intuitive, and there is ample documentation to help define different styles to your heart's content.  Not all the code for this entire project is in a snippet and shown in this blog - they would be too much.  However, the data extraction and cleaning code is available at Google Colab, and the entire project can be downloaded from my GitHub space


Popular posts from this blog

Drawing Tables with ReportLab: A Comprehensive Example

Blog Topics

DataFrame groupby agg style bar