Performing Chi Square Test on Python
Correlation between variables is one of some important step in Data Science. It's important to determine which features have correlation with target feature, and even to check auto correlation between the features.
Chi Square Test in one of methods commonly used to test correlation between variables. The method can check correlation between categorical variable. For example, is there any correlation between age ranges with marriage status.
Unlike numerical variable, Chi Square test can not be directly calculated by using some module. For example, pearson correlation can be calculated using numpy module, np.corrocoef.
Chi square test must first done by calculate chi square value using some formula. And then, use the value to calculate p-value which can determine whether two variables are correlated or not.
So, let's elaborate a formula to calculate chi square value, and implemented it to check correlation between categorical variables using Python.
Full python script and used dummy data can be checked at:
https://github.com/WahyuNoorIchwan/DataScience/tree/main/Chi%20Square%20Test
Steps in Chi Square Test
There are several steps to calculate chi square value.
- Create a cross tab matrix, it can be done using pandas module. Cross tab matrix show how many combination of each category data occur.
- The picture below is example of cross tab matrix created using pandas, of two features of age ranges and marriage status.
- Age ranges variable have 4 categories and status variable has 3 categories. Number of categories pairs are presented in the table.
- "Total" cells show total of each category of each rows or columns. For example, total data of "15-19" is located at first row of right total, the value is 40. And "20-25" total data is 60.
- Cells inside cross tab show number of category pair inside the data. For example, "31-40" and "Divorced" status occur 10 times. "Married" status of "20-25" ages occur 40 times.
- Values inside cross tab matrix are put inside a formula below to calculate chi square value
$ chi = \sum_\ \frac{(O-E)^2}{E}\ $ dimana $ E = \frac{R_i.C_j}{Total}\ $
- O is number of data of i-th categories of first variable and j-th categories of second variable. For example, 2nd and 3rd pair are "20-25" and "Single" with total of 20 data.
- R_i is total data of i-th category of first variable, for example 1st category is "15-19" with total of 40 data.
- C_j is total data of j-th category of second variable, for example 1s category is "Divorced" with total of 10 data.
- Chi square value then used to calculate p-value. It can be calculated using stats module from scipy. The full syntax is:
- p-value = stats.chi2.cdf(chi2_value, n_col.n_row)
- If the p-value is smaller than treshold value, usually 0.05, two variables are correlated.
Perform Chi Square Test in Python
Here are elaboration of python script used for performing Chi Square test.
2 from scipy import stats
3
4
5 df = pd.read_csv("dummy_data.csv")
6
7 # Separate Categories
8 cat_1 = df["Age"]
9 cat_2 = df["Status"]
10
11 # Cross Tabulation of Categories Pair
12 ctab = pd.crosstab(cat_1, cat_2, margins=True, margins_name="Total")
13
14 # Calculate Chi Square
15 row, col = ctab.shape
16 chi = 0
17
18 for i in range(row-1):
19 for j in range(col-1):
20 E = ctab.iloc[i, -1]*ctab.iloc[-1, j]/ctab.iloc[-1, -1]
21 chi += (ctab.iloc[i, j] - E)**2 / E
22
23 # Calculate p-value from Chi Square Value
24 p = 1 - stats.chi2.cdf(chi, (row-1)*(col-1))
25
26 treshold = 0.05
27 if p < 0.05:
28 print("{} and {} are related".format("Age", "Status"))
29 else::
30 print("{} and {} are not related")
- First, import modules used in the script. They are pandas and scipy.stats.
- Read sample data. The data can be found inside provided GITHUB repository. The data contain two features, "Age" and "Status".
- Separete features itno cat_1 and cat_2.
- Create cross tab using pandas, pd.crosstab(). Margins=True show total of each categories and total off all data.
- It is important to calculate "total" since chi square formula need it.
- Get row and col number of cross tab. The values will be used to loop through all cross tab cells.
- Initiate variable chi=0, to handle summation formula of chi square.
- Create for loop through row and col number. The loops use row-1 and col-1 range so only cells which contain categories pair are calculated.
- Inside the loops, first calculate E value of each cells.
- Calculate chi square value of each cells and add them into chi variable.
- Chi square value is calculated when the loops end. Use chi value to calculate p-value using the syntax.
- p-value calculation is at 24th row.
- Determine treshold for p-value, usually is 0.05.
- Make condition, if p-value is lower than treshold, than the variables is correlated.
Posting Komentar untuk "Performing Chi Square Test on Python"