Grants.gov data analysis
Grants.Gov Data Download and Analysis
Project DescriptionA friend approached me with a data task that required 1) downloading a dataset from Grants.gov, 2) cleaning the data for preparation, and 3) providing starter code so another group could query the data and come to general conclusions. To make the project more manageable, the project was first broken up into two general activities. The first group listed everything that could be done without thought- accessing and cleaning the data. The second group of tasks was more interactive - filtering and searching the data for insight.
As a result, I like to give the user more power to detail where they store the raw data and where they would like to store data for archiving. Some folks skip the project location space, but I don't like to leave it up to change. I like to be explicit whenever possible.
Below is the "main" program. I don't use Jupyter Notebooks, but "cells" in VS Code are defined by # %% and #%%, where an individual can run chunks of code and see the results in an Interactive Window. It has the same principles as a Jupyter Notebook but without the look of a Notebook. This format is useful for others to walk through the code and explore the project.
A quick note on the level of detail for descriptions. Python is open source with extensive official and non-official documentation. This space is used to look at a project from a high level and see how each piece interacts as a system. If you have a question about specific functions, methods, or similar, please search official documentation, and you will find your answer. That being said, let us dig into the main program.
After we define the folders, user-defined functions are imported from "download_and_clean_raw_data." This is where all the code is stored to complete the automated tasks. Each task was chucked into smaller tasks. I started with the first task, downloading the data.
The File Download
This process was interesting because the data is not stored with a static link. Nope, it is stored with a dynamic file name where each file name contains the report date. As a result, the Python project was required to look at today's date, format it in a way that conformed to the file name, and then create the file name before requesting the data. I knew I would need a few other variables, so I created a user-defined module called global_variables, as seen below.
The next thing I do is start with the logic. If the download file does not exist, run the following code to this website, download the file, and print "Downloading zip file." If it is already downloaded, "Zip file was already downloaded for today" is printed. The print statements can be skipped in an automated system. However, this project is made for a human to run and review. So, feedback is desired to give the user updates on the system's status.
The product of xml_to_df is not ready for data analysis yet. Dates need formatting, objects must be transformed into categories, and other variables must be created. As a result, we feed "df" to another user-defined function:
There are several things I would continue to change, but this is where it landed - it did the job. Below is the code that created the chart:
There are several things I like about this code. But the thing I like the most is the dynamic month selector.
We can feed a DataFrame as a list of lists to the table constructor (e.g., table = Table(data)) and then define and apply a style with the TableStyle constructor. This looks very intuitive, and there is ample documentation to help define different styles to your heart's content. Not all the code for this entire project is in a snippet and shown in this blog - they would be too much. However, the data extraction and cleaning code is available at Google Colab, and the entire project can be downloaded from my GitHub space.
Comments
Post a Comment