Comparing DataFrames

July 06, 2023

Comparing Relationships in Dataframes with Python

If you work with data, chances are you will need to compare different datasets to each other. One specific task could be to compare the relationships between categorical variables in different datasets. In this blog post, we will use Python and its powerful libraries such as Pandas and NumPy to accomplish this task.

What are we trying to achieve?

Imagine you have two datasets, a 'base' and a 'comparison' dataset. Each dataset has the same categorical variables, for example 'degree_type' and 'acad_degree'. You want to check if the relationship between these variables is the same in both datasets. For instance, if in the base dataset 'AAS' always corresponds to 'Associates', you want to ensure the same applies to the comparison dataset.

How do we do it?

We start by creating a function, let's call it 'compare_relationships'. This function will take as input two dataframes and the names of two variables we want to compare. The function will then create contingency tables for each dataset using the Pandas 'crosstab' function.

    
    def compare_relationships(base_df, compare_df, var1, var2):
        # Create crosstabs for each dataframe
        contingency_base = pd.crosstab(base_df[var1], base_df[var2])
        contingency_compare = pd.crosstab(compare_df[var1], compare_df[var2])

Next, it will retrieve the unique categories of the first variable from both datasets, and for each unique category, it will get the corresponding categories from the second variable. Finally, it will check if these sets of categories are the same in both datasets.

    
    for cat in all_var1_cats:
        base_var2_cats = set(contingency_base.loc[cat][contingency_base.loc[cat]>0].index) if cat in base_var1_cats else set()
        compare_var2_cats = set(contingency_compare.loc[cat][contingency_compare.loc[cat]>0].index) if cat in compare_var1_cats else set()

        # If the relationships are not the same, add them to the list
        if base_var2_cats != compare_var2_cats:
            different_relationships.append((cat, base_cats, compare_cats))

Now, we have a function that can take two dataframes and two variables, and return a list of categories for which the relationships in the two dataframes are different.

Printing the results

Finally, we can use a simple loop to print out these differences. We add checks to see if a category is not present in one of the dataframes.

    
    for rel in diff_relationships:
        degree, base_cats, compare_cats = rel
        if not base_cats:
            print(f"For '{degree}', this category was not present in the base data.")
        elif not compare_cats:
            print(f"For '{degree}', this category was not present in the comparison data.")
        else:
            print(f"For '{degree}', base categories are {base_cats'}, but comparison categories are {compare_cats}.")

Conclusion

By using Python's powerful data manipulation libraries like Pandas, we can write concise code to compare relationships in dataframes. This approach is flexible and doesn't require hardcoding any relationships. By systematically going through each category in a variable, we can determine how it relates to another variable in different datasets and highlight any discrepancies.

Search This Blog

Data Analytics With Python