It states that the distribution of sample means approximates a Gaussian distribution (normal distribution) as the sample size grows, regardless of the population's original distribution. This is crucial for making inferences about populations based on sample data.
Understanding the CLT can greatly enhance your data analysis skills, providing a solid foundation for hypothesis testing and confidence interval estimation. However, it does have some limitations:
- Sample Size: The CLT requires a sufficiently large sample size (>=30) to be effective. Small samples may not produce accurate results.Independence: The samples must be independent. Dependencies among data points can skew results.
- Identical Distribution: Samples must come from the same distribution. Note: This applies to the classical CLT (Lindeberg-Levy), but newer versions like Lyapunov or Lindeberg-Feller relax this condition.
- Identical Distribution: Samples must come from the same distribution. Note: This applies to the classical CLT (Lindeberg-Levy), but newer versions like Lyapunov or Lindeberg-Feller relax this condition.
The left graph: x-axis represents the actual values (e.g. incomes) of observations in the population.
y-axis representins probability density or relative probability/frequency.
The right graph is a “distribution of averages,” not a distribution of raw data. X-axis represents the sample mean computed from a sample of size n (out of population size N where n << N). For example:
Step 1: Start with a population
Suppose the population values are:
This population may have any shape.
Step 2: Take many samples
Take samples of size .
Example samples:
- Sample A:
- Sample B:
- Sample C:
- Sample D:
Step 3: Compute a mean for each sample
Each sample produces ONE sample mean:
| Sample | Mean |
|---|---|
| (2,4) | 3 |
| (4,10) | 7 |
| (6,8) | 7 |
| (2,10) | 6 |
So now we have many values of:
namely:
Step 4: Plot all those means
The right graph plots the frequencies/probabilities of these sample means.
So the x-axis contains many possible values of:
because different samples produce different averages.