1

I am comparing with Chi Square the distributions of two categorical variables. Both have the same number of classes. After counting each class per variable, I obtain very similar counts but the p-value result of the chi-square test is 0 - rejecting the null hypothesis. I am not sure what I am missing.

Here is the code:

    import numpy as np
from scipy.stats import chi2_contingency

var1_arr = np.array([361837, 94360, 1533308]) # counts per class for var 1 var2_arr = np.array([355572, 93285, 1544745]) # counts per class for var 2

observed_counts = np.vstack((var1_arr,var2_arr))

# Given class counts

observed_counts = np.array([[361837, 94360.67, 1533308.67],

[355572, 93285, 1544745]])

Calculate expected frequencies

N = observed_counts.sum() expected_counts = (observed_counts / N) * N

Perform chi-square test

chi2, p_value, dof, expected = chi2_contingency(observed_counts)

print(f"Chi-Square Statistic: {chi2:.4f}") print(f"P-value: {p_value:.4f}")

The result is: Chi-Square Statistic: 99.1516 P-value: 0.0000

crbl
  • 111
  • 1

1 Answers1

1

You have a huge number of observations. Consequently, the test is highly sensitive to small differences. The test has considerable statistical power to detect these.

You are within your rights to say that these differences lack practical significance, but a tiny p-value is not surprising.

Significance test for large sample sizes

Dave
  • 4,542
  • 1
  • 10
  • 35