Exploratory Data Analysis

Exploratory Data Analysis (EDA)

I’m at a new analyst job, and I’m given access to a new database. What next? Do I wait for a data request?  Do I wait for a random task from my boss? Do I review established reports? 

What do I do?

I just started a new job in February of 2022. Every day I continually see new datasets, inherit established code, and update old code.  I like to ask open questions:

  1. What does each row represent?
  2. What are the main “filter” columns?
  3. What are the data types?
  4. How did the last analyst handle the data?

Each list of questions and associated processes are unique to each person and the tasks at hand.  By now, this process is second nature. The purpose of this blog is to learn about Pandas. How would I approach this in Pandas? Let's dig in!

Assumptions, I have access to the table. In this example, m associated with movies as a DataFrame 

First, I want to look at everything. The describe method is a method to quickly look at everything in the DataFrame.  The include='all' option tells Pandas to look at all continuous and categorical columns.  By default, the method only looks at continuous data (e.g., int/numerical data). The .T transforms the axis for readability.  Finally, the sort_values('unique') sorts the values by the unique column. 


Next, I want to review two columns. In this example, duration and title_year are called, and we review the mean() and sort the values by the outcome. The result produces a series.  Can we get it back to a DataFrame - yes we can with to_frame()!



Fortunately, “there is a package for that!” the pandas_profiling package that creates a report detailing descriptive statistics, metadata, and just about everything you would like to explore a DataFrame.  It is not standard, so you may need to install it.  This is an amazing report! It calculates all the basic stats, and packages the report into an HTML file. 

#install pandas-profiling as pp
report = pp.ProfileReport(m)

I'm an analyst, not a software developer. So, I need efficient ways to produce reports that are easily shared with others. This is about as easy as it gets for sharable EDA!

*File name http://localhost:8888/notebooks/0000_Set_up.ipynb;

However, it does not meet the needs of this space - Learn Pandas.  So, let's look at a specific example combining group by and then aggregate.  

For example, with movies data, we want to group by rating (e.g., PG, PG-13, R, etc.) and count the non-missing rows, sum the budget for all their movies, then get the min, max, median, and mean budget for all their movies.  That is a specific report. Theeeen, get the min, max, and sum for gross.  We can do that simply with Pandas! 

Below is the way we approach the request by chaining


We can see where we detail the budget column and indicate each measure.  Next, we then tell Pandas to look at the gross column and detail the min, max, and total sum for the budget.  I did not like the One thing I did for the one is used the format function.  This is very interesting.  We tell Python to format all float columns/series/variables with 

pd.options.display.float_format = '${:,.0f}'.format

The format looks very powerful! But for now, we are going to leave it alone and leave this topic for another day! 


Popular posts from this blog

Recode with np.select

Python Topic Roadmap