Recreate PROC FREQ in Python

September 10, 2023

Mastering Descriptive Tables in Python: A Nod to SAS's PROC FREQ

The art of data analysis often begins with understanding the landscape of your dataset. In SAS, PROC FREQ has long been the de facto tool for generating descriptive tables. Python, via its Pandas library, offers comparable functionalities albeit with a different approach. This post aims to bridge the gap between SAS's PROC FREQ and Python's Pandas, focusing on generating descriptive tables that are replete with counts, row percentages, column percentages, and overall percentages.

A Quick Dive into SAS's PROC FREQ

Before we dive into Python, let's understand what PROC FREQ in SAS is capable of. This procedure is immensely powerful for categorical data analysis. It provides easy ways to calculate counts, percentages, and additional statistics with simple syntax. For instance, one might execute:

  PROC FREQ DATA=dataset;
    TABLES variable1 * variable2;
  RUN;

The output is an organized set of tables showing the counts and percentages, sometimes with additional statistics like Chi-square values.

Replicating PROC FREQ in Python

Python's Pandas library may not have a built-in equivalent to PROC FREQ, but its arsenal of data manipulation methods allows us to achieve similar results. Let's discuss the step-by-step process.

Step 1: Count Frequencies with Pandas Crosstab

The first step in our Python approach is to use the crosstab function to generate a count table:

  count_table = pd.crosstab(df['Group'], df['Value'], margins=True, margins_name="Total")

This provides us with raw counts similar to the frequency output in SAS.

Step 2: Percentages by Group

The second step involves calculating the percentages by group, which are often referred to as row percentages:

  percent_row = pd.crosstab(df['Group'], df['Value']).apply(lambda r: r/r.sum(), axis=1)

Step 3: Percentages by Value

Another useful metric is the column percentage, which can be calculated as follows:

  percent_col = pd.crosstab(df['Group'], df['Value']).apply(lambda c: c/c.sum(), axis=0)

Step 4: Overall Percentages

The final step involves calculating the overall percentages across the entire dataset:

  overall_percent = pd.crosstab(df['Group'], df['Value']).apply(lambda x: x / x.sum().sum())

Step 5: Creating the Final Table

With all these individual pieces in place, we can concatenate them into a comprehensive summary table:

  final_table = pd.concat([count_table, percent_row, percent_col, overall_percent], keys=['Count', 'Row%', 'Column%', 'Overall%'])

Automating the Process: A Custom Function

If you find yourself generating these tables frequently, a custom function can streamline this process. Below is an example function named proc_freq that encapsulates the logic discussed above:

  def proc_freq(df, group_col, value_col):
    ...
  print(proc_freq(df, 'Group', 'Value'))

To use this function, simply import it into your code and pass in your DataFrame along with the relevant columns for grouping and value. The function will return a Pandas DataFrame resembling a PROC FREQ output, equipped with counts and percentages.

Comparison with SAS's PROC FREQ

While SAS provides a compact, one-stop-shop for frequency tables, Python's Pandas allows for a more tailored approach, albeit at the cost of slightly verbose code. In Pandas, you have full control over the calculations and the output format, making it easier to integrate into a broader data analysis workflow.

Conclusion

Both SAS and Python offer compelling solutions for generating descriptive tables. SAS's PROC FREQ excels in convenience and compactness but may lack the customization and extensibility that Python provides. With Python's Pandas, you can achieve similar results, especially when you encapsulate the logic in a custom function like proc_freq.

The transition from SAS to Python need not be a leap into the unknown. Understanding how to replicate essential tasks like generating descriptive tables can smooth the transition and enrich your data analysis toolkit.

Google Colab Example

Search This Blog

Data Analytics With Python