Chi-Square Test

The Chi-Square test is one of the most widely used techniques for hypothesis testing. Whether you’re analyzing categorical data, testing for independence between variables, or assessing goodness-of-fit, the Chi-Square test provides a robust method for validating assumptions and drawing meaningful conclusions.

This tutorial will guide you through the fundamentals of the Chi-Square test, including its formula, different types, and real-world applications. By the end, you'll understand the chi-square test types, properties, applications, how to perform these tests, and more.

Elevate Your Data Analytics Career in 2025

PL-300 Microsoft Power BI Certification TrainingExplore Program
Elevate Your Data Analytics Career in 2025

What is Chi Square Test?

The Chi-Square test is a statistical procedure for determining the difference between observed and expected data. It can also be used to decide whether the data correlates with our categorical variables. Thus, it helps determine whether a difference between two categorical variables is due to chance or a relationship between them.

A Chi-Square or comparable nonparametric test is required to test a hypothesis regarding the distribution of a categorical variable. Categorical variables, which indicate categories such as animals or countries, can be nominal or ordinal. They cannot have a normal distribution because they have only a few particular values.

Chi-Square Test Formula

Chi_Sq_formula.

where

c = Degrees of freedom

O = Observed Value

E = Expected Value

The degrees of freedom in a statistical calculation represent the number of variables that can vary. The degrees of freedom can be calculated to ensure that Chi-Square tests are statistically valid. These tests frequently compare observed data with data expected to be obtained if a particular hypothesis were true.

  • The Observed values are those you gather yourselves.
  • The Expected values are the anticipated frequencies, based on the null hypothesis.

What are the Fundamentals of Hypothesis Testing?

Hypothesis testing is a technique for interpreting and drawing inferences about a population based on sample data. It aids in determining which sample data best support mutually exclusive population claims.

  • Null Hypothesis (H0) - The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no bearing on the study's outcome unless rejected. H0 is the symbol for it, and it is pronounced H-naught.
  • Alternate Hypothesis(H1 or Ha) - The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.

Become an Expert in Data Analytics

With Our Unique Data Analyst Master’s ProgramExplore Program
Become an Expert in Data Analytics

What are the Types of Chi-Square Tests?

There are two main types of Chi-Square tests:

  1. Independence
  2. Goodness-of-Fit

1. Independence 

The Chi-Square test of Independence is a derivable ( also known as inferential ) statistical test that examines whether the two sets of variables are likely to be related to each other or not. This test is used when we have counts of values for two nominal or categorical variables and is considered a non-parametric test. A relatively large sample size and independence of observations are the required criteria for conducting this test.

Example: 

In a movie theatre, suppose we made a list of movie genres. Let us consider this as the first variable. The second variable is whether or not the people who came to watch those genres of movies bought snacks at the theatre. Here, the null hypothesis is that the genre of the film and whether people bought snacks are unrelatable. If this is true, the movie genres don’t impact snack sales. 

2. Goodness-Of-Fit

In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines whether a variable is likely to come from a given distribution. We must have a set of data values and an idea of the distribution of this data. We can use this test when we have value counts for categorical variables. This test demonstrates a way of deciding if the data values have a “ good enough” fit for our idea or if it is a representative sample data of the entire population. 

Example:

Suppose we have bags of balls with five different colours in each bag. The given condition is that the bag should contain an equal number of balls of each color. The idea we would like to test here is that the proportions of the five colours of balls in each bag must be exact. 

Learn All The Tricks Of The BI Trade

PL-300 Microsoft Power BI Certification TrainingExplore Program
Learn All The Tricks Of The BI Trade

What are the Examples of the Chi-Square Test?

1. Chi-Square Test for Independence

Example: A researcher wants to determine if there is an association between gender (male/female) and preference for a new product (like/dislike). The test can assess whether preferences are independent of gender.

2. Chi-Square Test for Goodness of Fit

Example: A dice manufacturer wants to test if a six-sided die is fair. They roll the die 60 times and expect each face to appear 10 times. The test checks if the observed frequencies match the expected frequencies.

3. Chi-Square Test for Homogeneity

Example: A fast-food chain wants to see if the preference for a particular menu item is consistent across different cities. The test can compare the distribution of preferences in multiple cities to see if they are homogeneous.

4. Chi-Square Test for a Contingency Table

Example: A study investigates whether smoking status (smoker/non-smoker) is related to the presence of lung disease (yes/no). The test can evaluate the relationship between smoking and lung disease in the sample.

5. Chi-Square Test for Population Proportions

Example: A political analyst wants to see if voter preference (candidate A vs. candidate B) is the same across different age groups. The test can determine if the proportions of preferences differ significantly between age groups.

Want to Become a Data Analyst? Learn From Experts!

Data Analyst Master’s ProgramExplore Program
Want to Become a Data Analyst? Learn From Experts!

How to Perform a Chi-Square Test?

Let's say you want to know if gender has anything to do with political party preference. You poll 440 voters in a simple random sample to determine their preferred political party. The results of the survey are shown in the table below:

chi-1.

To see if gender is linked to political party preference, perform a Chi-Square test of independence using the steps below.

Step 1: Define the Hypothesis

H0: There is no link between gender and political party preference.

H1: There is a link between gender and political party preference.

Step 2: Calculate the Expected Values

Now, you will calculate the expected frequency.

Chi_Sq_formula_1.

For example, the expected value for Male Republicans is:

Chi_Sq_formula_2

Similarly, you can calculate the expected value for each of the cells.

chi-2.

Step 3: Calculate (O-E)2 / E for Each Cell in the Table

Now, you will calculate the (O - E)2 / E for each cell in the table.

where

O = Observed Value

E = Expected Value

chi-3.

Step 4: Calculate the Test Statistic X2

X2  is the sum of all the values in the last table

 =  0.743 + 2.05 + 2.33 + 3.33 + 0.384 + 1

 = 9.837

Before you can conclude, you must determine the critical statistic, which requires determining our degrees of freedom. The degrees of freedom in this case are equal to the table's number of columns minus one multiplied by the table's number of rows minus one, or (r-1) (c-1). We have (3-1)(2-1) = 2.

Finally, you compare the obtained statistics to the critical ones in the chi-square table. As you can see, for an alpha level of 0.05 and two degrees of freedom, the critical statistic is 5.991, less than our obtained statistic of 9.83. You can reject our null hypothesis because the critical statistic is higher than your obtained statistic.

This means you have sufficient evidence to say that there is an association between gender and political party preference.

Chi_Sq_formula_3

What Are Categorical Variables?

Categorical variables are a subset of variables that can be divided into discrete categories. Names or labels are the most common categories. These variables are also known as qualitative because they depict the variable's quality or characteristics.

Categorical variables can be divided into two categories:

1. Nominal Variable: A nominal variable's categories have no natural ordering. Example: Gender, Blood groups.

2. Ordinal Variable: A variable that allows the categories to be sorted is an ordinal variable. An example is customer satisfaction (Excellent, Very Good, Good, Average, Bad, and so on).

Become a Data Science Expert & Get Your Dream Job

Professional Certificate Program In Data ScienceView Course
Become a Data Science Expert & Get Your Dream Job

Chi-Square Practice Problems

1. Voting Patterns

Problem

A researcher wants to know if voting preferences (party A, party B, or party C) and gender (male, female) are related. Apply a chi-square test to the following set of data:

  • Male: Party A - 30, Party B - 20, Party C - 50
  • Female: Party A - 40, Party B - 30, Party C - 30

Solution

To determine if gender influences voting preferences, run a chi-square test of independence.

2. State of Health

Problem

In a sample population, a medical study examines the association between smoking status (smoker, non-smoker) and the occurrence of lung disease (yes, no). The information is as follows:

  • Smoker: Yes - 90, No - 60
  • Non-smoker: Yes - 30, No - 120 

Solution

To find out if smoking status is related to the incidence of lung disease, do a chi-square test.

3. Consumer Preferences

Problem

A company surveys customers to determine their age group (under 20, 20-40, over 40) and their preferred product categories (food, apparel, or electronics). The information gathered is:

  • Under 20: Electronic - 50, Clothing - 30, Food - 20
  • 20-40: Electronics - 60, Clothing - 70, Food - 50
  • Over 40: Electronic - 30, Clothing - 40, Food - 80

Solution

Use a chi-square test to investigate the connection between product preference and age group.

4. Academic Performance

Problem

An educational researcher looks at the relationship between students' success on standardized tests (pass, fail) and whether or not they participate in after-school programs. The information is as follows:

  • Yes: Pass - 80, Fail - 20
  • No: Pass - 50, Fail - 50

Solution

Use a chi-square test to determine if involvement in after-school programs and test scores are connected.

5. Genetic Inheritance

Problem

A geneticist investigates how a particular trait is inherited in plants and seeks to ascertain whether the expression of a trait (trait present, trait absent) and the existence of a genetic marker (marker present, marker absent) are significantly correlated. The information gathered is:

  • Marker Present: Trait Present - 70, Trait Absent - 30
  • Marker Absent: Trait Present - 40, Trait Absent - 60

Solution

Do a chi-square test to determine if there is a correlation between the trait's expression and the genetic marker.

Master Data Science and Unlock Top-Tier Roles

With the Data Scientist Master's ProgramStart Learning
Master Data Science and Unlock Top-Tier Roles

How to Solve Chi-Square Problems?

1. State the Hypotheses

  • Null hypothesis (H0): There is no association between the variables
  • Alternative hypothesis (H1): There is an association between the variables.

2. Calculate the Expected Frequencies

  • Use the formula: E=(Row Total×Column Total)Grand TotalE = \frac{(Row \ Total \times Column \ Total)}{Grand \ Total}E=Grand Total(Row Total×Column Total)​

3. Compute the Chi-Square Statistic

  • Use the formula: χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}χ2=∑E(O−E)2, where O is the observed frequency and E is the expected frequency.

4. Determine the Degrees of Freedom (df)

  • Use the formula: df=(number of rows−1)×(number of columns−1)df = (number \ of \ rows - 1) \times (number \ of \ columns - 1)df=(number of rows−1)×(number of columns−1)

5. Find the Critical Value and Compare

  • Use the chi-square distribution table to find the critical value for the given df and significance level (usually 0.05).
  • Compare the chi-square statistic to the critical value to decide whether to reject the null hypothesis.

These practice problems help you understand how Chi-Square analysis tests hypotheses and explores relationships between categorical variables in various fields.

When to Use a Chi-Square Test?

A Chi-Square test examines whether the observed results correspond to the expected values. When the data to be analysed is from a random sample, and when the variable is the question is a categorical variable, then Chi-Square proves the most appropriate test for the same.

A categorical variable consists of selections such as breeds of dogs, types of cars, genres of movies, educational attainment, male v/s female etc. Survey responses and questionnaires are the primary sources of these types of data.

The Chi-Square test is most commonly used for analysing this kind of data. This type of analysis is helpful for researchers who are studying survey response data. The research can range from customer and marketing research to political sciences and economics.

Become an Expert in Data Analytics

With Our Unique Data Analyst Master’s ProgramExplore Program
Become an Expert in Data Analytics

Chi-Square Distributions

Chi-Square distributions (X2) are a type of continuous probability distribution. They're commonly utilized in hypothesis testing, such as the chi-square goodness of fit and independence tests. The parameter k, which represents the degrees of freedom, determines the shape of a chi-square distribution.

Very few real-world observations follow a chi-square distribution. Chi-square distributions aim to test hypotheses, not to describe real-world distributions. In contrast, other commonly used distributions, such as normal and Poisson distributions, may explain important things like birth weights or illness cases per year.

Chi-Square distributions are excellent for hypothesis testing because they closely resemble the conventional normal distribution. Many essential statistical tests rely on the traditional normal distribution.

In statistical analysis, the Chi-Square distribution is used in many hypothesis tests and is determined by the parameter k degree of freedom. It belongs to the family of continuous probability distributions. The Sum of the squares of the k-independent standard random variables is called the Chi-Squared distribution. Pearson’s Chi-Square Test formula is - 

Chi_Square_Distribution_1

Where X^2 is the Chi-Square test symbol

Σ is the summation of observations

O is the observed results

E is the expected results 

The shape of the distribution graph changes with the increase in the value of k, i.e., the degree of freedom. 

When k is 1 or 2, the Chi-square distribution curve is shaped like a backwards ‘J’. It means there is a high chance that X^2 becomes close to zero. 

Chi_Square_Distribution_2

Source: Scribbr

When k is greater than 2, the shape of the distribution curve looks like a hump and has a low probability that X^2 is very near to 0 or very far from 0. The distribution occurs much longer on the right-hand side and shorter on the left-hand side. The probable value of X^2 is (X^2 - 2).

Chi_Square_Distribution_3

Source: Scribbr

A normal distribution is seen when k is greater than ninety, approximating the Chi-square distribution.

Learn All The Tricks Of The BI Trade

PL-300 Microsoft Power BI Certification TrainingExplore Program
Learn All The Tricks Of The BI Trade

What is the P-Value in a Chi-Square Test?

The P-value in a Chi-Square test is a statistical measure that helps assess your test results' importance.

Here, P denotes the probability; hence, for calculating p-values, the Chi-Square test comes into the picture. The different p-values indicate different types of hypothesis interpretations.

  1. P <= 0.05 (Hypothesis interpretations are rejected)
  2. P>= 0.05 (Hypothesis interpretations are accepted) 

The concepts of probability and statistics are entangled with the Chi-Square Test. Probability is the estimation of something that is most likely to happen. Simply put, it is the possibility of an event or outcome of the sample.

Probability can understandably represent bulky or complicated data. Statistics involves collecting, organising, analysing, interpreting, and presenting the data. 

Finding P-Value

When you run all of the Chi-square tests, you'll get a test statistic called X2. You have two options for determining whether this test statistic is statistically significant at some alpha level:

  1. Compare the test statistic X2 to a critical value from the Chi-square distribution table.
  2. Compare the p-value of the test statistic X2 to a chosen alpha level.

Test statistics are calculated by considering the sampling distribution of the test statistic under the null hypothesis, the sample data, and the approach chosen for performing the test. 

The p-value will be as mentioned in the following cases.

  • A lower-tailed test is specified by: P(TS ts | H0 is true) p-value = cdf (ts)
  • Lower-tailed tests have the following definition: P(TS ts | H0 is true) p-value = cdf (ts)
  • A two-sided test is defined as follows: if we assume that the test static distribution  of H0 is symmetric about 0, 2 * P(TS |ts| | H0 is true) = 2 * (1 - cdf(|ts|))

Where

P: probability Event

TS: Test statistic is computed observed value of the test statistic from your sample cdf(): Cumulative distribution function of the test statistic's distribution (TS).

Tools and Software for Chi-Square Analysis

Here are some commonly used tools and software for performing Chi-Square analysis:

1. SPSS (Statistical Package for the Social Sciences) is a widely used software for statistical analysis, including Chi-Square tests. It provides an easy-to-use interface for performing Chi-Square tests for independence, goodness of fit, and other statistical analyses.

2. R is a powerful open-source programming language and software environment for statistical computing. The chisq.test() function in R allows for the easy conducting of Chi-Square tests.

3. The SAS suite is used for advanced analytics, including Chi-Square tests. It is often used in research and business environments for complex data analysis.

4. Microsoft Excel offers a Chi-Square test function (CHISQ.TEST) for users who prefer working within spreadsheets. It’s a good option for basic Chi-Square analysis with smaller datasets.

5. Python (with libraries like SciPy or Pandas) offers robust tools for statistical analysis. The scipy.stats.chisquare() function can be used to perform Chi-Square tests.

Data Analysts are shaping the future and this is your chance to become one of them. 🎯

Properties of the Chi-Square Test 

  1. Variance is double the times the number of degrees of freedom.
  2. Mean distribution is equal to the number of degrees of freedom.
  3. When the degree of freedom increases, the Chi-Square distribution curve becomes normal.

What are the Limitations of the Chi-Square Test?

You should be aware of two limitations to using the chi-square test. 

  • The chi-square test, for starters, is extremely sensitive to sample size. Even insignificant relationships can appear statistically significant when a large enough sample is used. Remember that "statistically significant" does not always imply "meaningful" when using the chi-square test.
  • Be mindful that the chi-square can only determine whether two variables are related. It does not necessarily follow that one variable has a causal relationship. It would require a more detailed analysis to establish causality.

Become a Data Analytics Expert in Just 8 Months!

With Purdue University's Data Analytics PG ProgramLearn More
Become a Data Analytics Expert in Just 8 Months!

Advanced Chi-Square Test Techniques

1. Chi-Square Test with Yates' Correction (Continuity Correction)

This technique is used in 2x2 contingency tables to reduce the Chi-Square value and correct for the overestimation of statistical significance when sample sizes are small. The correction is achieved by subtracting 0.5 from the absolute difference between each observed and expected frequency.

2. Mantel-Haenszel Chi-Square Test

This technique assesses the association between two variables while controlling for one or more confounding variables. It’s particularly useful in stratified analyses that examine the relationship between variables across different strata (e.g., age groups, geographic locations).

3. Chi-Square Test for Trend (Cochran-Armitage Test)

This test is used when the categorical variable is ordinal, and you want to assess whether there is a linear trend in the proportions across the ordered groups. It’s commonly used in epidemiology to analyze trends in disease rates over time or across different exposure levels.

4. Monte Carlo Simulation for Chi-Square Test

When the sample size is very small or when expected frequencies are too low, the Chi-Square distribution may not provide accurate p-values. The Monte Carlo simulation can generate an empirical test statistic distribution in such cases, providing a more accurate significance level.

5. Bayesian Chi-Square Test

In Bayesian statistics, the Chi-Square test can be adapted to incorporate prior knowledge or beliefs about the data. This approach is useful when existing information should influence the analysis, leading to potentially more accurate conclusions.

Step into one of the most demanding careers in 2025 and become a Data Analyst in 11 months. 🎯

Conclusion

In this tutorial titled ‘The Complete Guide to the Chi-square test, ’ you explored the concept of Chi-square distribution and how to find the related values. You also look at how the critical value and Chi-Square value are related.

If you want to gain more insight, get a work-ready understanding of statistical concepts, and learn how to use them to get into a career in Data Analytics, our Post Graduate Program in Data Analytics in partnership with Purdue University should be your next stop. A comprehensive program with training from top practitioners and in collaboration with IBM will be all you need to kickstart your career in the field. Get started today!

Upskill yourself with our trending Data Analytics Courses and Certifications

  1. Data Analyst Course
  2. Professional Certificate Course in Data Analytics and Generative AI
  3. Professional Certificate in Data Analytics and Generative AI

FAQs

1. What is the Chi-Square test used for?

The Chi-Square test determines whether there is a significant association between two categorical variables. It helps in hypothesis testing to check whether observed frequencies differ from expected ones.

2. What are the types of Chi-Square tests?

There are two main types of Chi-Square tests:

  • Chi-Square Test for Independence: Determines whether two categorical variables are related.
  • Chi-Square Goodness-of-Fit Test: Checks if a sample distribution matches an expected distribution.

3. What are the assumptions of the Chi-Square test?

The Chi-Square test assumes:

  • The data consists of categorical variables.
  • Observations are independent.
  • The expected frequency in each category is at least 5 for valid results.

4. When should you use a Chi-Square test instead of a t-test?

Use a Chi-Square test when analyzing categorical data to check for relationships, whereas a t-test compares means of continuous variables between groups.

5. How do you interpret the results of a Chi-Square test?

If the p-value is less than the significance level (typically 0.05), reject the null hypothesis, indicating a significant relationship between the variables.
If the p-value is greater than 0.05, fail to reject the null hypothesis, meaning no significant relationship was found.

6. Can the Chi-Square test be used for small sample sizes?

The Chi-Square test may not be reliable for small sample sizes, especially if expected frequencies are less than 5. In such cases, Fisher’s Exact Test is a better alternative.

7. What is an example of a Chi-Square test in real life?

A common example is market research, where a company uses survey data to analyze if customer preference for a product is independent of their geographic location.

About the Author

Avijeet BiswalAvijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

View More
  • Acknowledgement
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.