I am new to regression. Can someone explain to me how the regression sum of squares shows the explained variation? Essentially, why is it (y hat - y bar)? I hope i'm explaining my question accurately. i tried drawing a graph with the regression line, the actual values and the mean and for some reason i cannot wrap my head around how it shows the explained part. How is it explained? is there a formula i'm not aware of that would explain why the predicted value - mean, shows how X affects Y?
1 Answers
Both ANOVA (Analysis of Variance) and (multivariable) linear regression are instances of the general linear model (GLM) framework. While ANOVA typically compares group means for categorical predictors, regression generally applies to continuous predictors. However, ANOVA can be seen as a specific type of regression when categorical predictors are coded appropriately.
Using Categorical Predictors in Regression (ANOVA)
ANOVA can be viewed as a regression model with categorical predictors, identifying if significant differences exist between group means. To do this, we encode a categorical predictor with $ k $ levels using $ k-1 $ dummy variables, where each dummy variable represents a group comparison against a reference group.
Interpreting the F-Test
In both ANOVA and regression, the F-test evaluates the model’s capacity to explain variability in the outcome. For ANOVA, it tests whether all group means are equal; in regression, it tests whether the predictors (categorical or continuous) significantly explain the variance in the outcome.
Understanding the Regression Sum of Squares (SSR)
The regression sum of squares, denoted $ \text{SSR} $, represents the variation in the outcome that is explained by the model. It is calculated as:
$$ \text{SSR} = \sum (\hat{y}_i - \bar{y})^2 $$
where:
- $ \hat{y}_i $ are the predicted (fitted) values from the regression model,
- $ \bar{y} $ is the mean of the observed $ y $-values.
Why $ (\hat{y}_i - \bar{y}) $ Represents Explained Variation
In regression, $ \hat{y}_i $ represents each fitted value, which is the model’s best estimate for $ y $ given the predictors. The difference $ (\hat{y}_i - \bar{y}) $ measures how much each fitted value deviates from the overall mean $ \bar{y} $. This deviation indicates how much the model’s prediction moves away from a simple average (which would assume no predictor effects) to a more refined prediction that incorporates the relationships captured by the regression.
Thus, by summing $ (\hat{y}_i - \bar{y})^2 $ across all observations, the SSR quantifies the portion of the total variability that can be attributed to the effects of the predictors, rather than random variation.
Connecting SSR to Total and Residual Variation
In regression analysis, the total sum of squares (SST) is decomposed as:
$$ \text{SST} = \text{SSR} + \text{SSE} $$
where:
- $ \text{SST} = \sum (y_i - \bar{y})^2 $, the total variation in $ y $,
- $ \text{SSE} = \sum (y_i - \hat{y}_i)^2 $, the residual (or unexplained) variation.
This decomposition shows that the total variation in $ y $ is the sum of explained variation (SSR) and unexplained variation (SSE). Therefore, $ \text{SSR} $ captures how much of the total variation in $ y $ is explained by the model’s fitted values, making it a key measure of the model's explanatory power.
Extending ANOVA to ANCOVA
Including continuous covariates in an ANOVA model leads to Analysis of Covariance (ANCOVA), combining both categorical and continuous predictors. This approach highlights the regression framework’s flexibility, integrating both predictor types within a single model.
Practical Example
For example, consider assessing the effect of three diets on weight loss. Using ANOVA, we would model diet as a factor with levels for each diet type. In regression, we would encode diet with two dummy variables and fit a linear model. The coefficients for the dummy variables in regression correspond to group differences estimated in ANOVA, illustrating how ANOVA functions as a regression model for categorical predictors.
Summary
In summary, ANOVA is a specific form of regression tailored to categorical predictors. By appropriately coding these predictors, we can frame ANOVA within the regression context, demonstrating the theoretical equivalence of the two approaches. For a comprehensive exploration, see Kutner et al. (2004), which thoroughly discusses this equivalence in applied statistics.
Reference
Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2004). Applied linear statistical models (5th ed.). McGraw-Hill Irwin.
- 3,518
- 12
- 30