Applied Statistics – A Practical Course
2024-12-16
\[ y_i = \alpha + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p} + \varepsilon_i \]
Fundamental for many statistical methods
\[ RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \varepsilon^2 \qquad \text{(residual sum of squares)} \]
\[\begin{align} \text{total variance} &= \text{explained variance} &+& \text{residual variance}\\ s^2_y &= s^2_{y|x} &+& s^2_{\varepsilon} \end{align}\]
\[\begin{align} r^2 & = \frac{\text{explained variance}}{\text{total variance}}\\ & = \frac{s^2_{y|x}}{s^2_y}\\ \end{align}\]
It can also be expressed as ratio of residual (RSS) and total (TSS) sum of squares:
\[ r^2 = 1-\frac{s^2_{\varepsilon}}{s^2_{y}} = 1-\frac{RSS}{TSS} = 1- \frac{\sum(y_i -\hat{y}_i)^2}{\sum(y_i - \bar{y})^2} \]
For the example: \(r^2= 1-\) 15.3 \(/\) 40.8 \(=\) 0.625
For \(y=a \cdot x + b\) with 2 parameters: \(\frac{\partial\sum \varepsilon^2}{\partial{a}}=0\), \(\frac{\partial\sum \varepsilon^2}{\partial{b}}=0\):
\[\begin{align} \frac{\partial \sum(\hat{y_i} - y_i)^2}{\partial a} &= \frac{\partial \sum(a + b \cdot x_i - y_i)^2}{\partial a} = 0\\ \frac{\partial \sum(\hat{y_i} - y_i)^2}{\partial b} &= \frac{\partial \sum(a + b \cdot x_i - y_i)^2}{\partial b} = 0 \end{align}\]
Solution of the linear system of equations:
\[\begin{align} b &=\frac {\sum x_iy_i - \frac{1}{n}(\sum x_i \sum y_i)} {\sum x_i^2 - \frac{1}{n}(\sum x_i)^2}\\ a &=\frac {\sum y_i - b \sum x_i}{n} \end{align}\]
\[
\hat{F}_{1;n-2;\alpha}= \frac{s^2_{explained}}{s^2_{residual}}
= \frac{r^2(n-2)}{1-r^2}
\]
Assumptions
See: Gelman & Hill (2007) : Data analysis using regression …
No regression analysis without graphical diagnostics!
Use graphical methods for normality, don’t trust the Shapiro-Wilks in that case.
\[\begin{align} a & \pm t_{1-\alpha/2, n-2} \cdot s_a\\ b & \pm t_{1-\alpha/2, n-2} \cdot s_b \end{align}\]
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.4451 -1.0894 -0.4784 1.5065 3.1933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.50740 0.87338 2.871 0.0102 *
x 2.04890 0.07427 27.589 3.51e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.885 on 18 degrees of freedom
Multiple R-squared: 0.9769, Adjusted R-squared: 0.9756
F-statistic: 761.1 on 1 and 18 DF, p-value: 3.514e-16
Example: CI of a: \(a \pm t_{1-\alpha/2, n-2} \cdot s_a = 2.5074 \pm
2.09 \cdot 0.87338\)
## generate example data
x <- 1:10
y <- 2 + 0.5 * x + 0.5 * rnorm(x)
## fit model
reg <- lm(y ~ x)
summary(reg)
## plot data and regression line
plot(x,y, xlim = c(0, 10), ylim = c(0, 10), pch = 16)
abline(reg, lwd = 2)
## calcuate and plot intervals
newdata <- data.frame(x=seq(-1, 11, length=100))
conflim <- predict(reg, newdata=newdata, interval = "confidence")
predlim <- predict(reg, newdata=newdata, interval = "prediction")
lines(newdata$x, conflim[,2], col = "blue")
lines(newdata$x, conflim[,3], col = "blue")
lines(newdata$x, predlim[,2], col = "red")
lines(newdata$x, predlim[,3], col = "red")
newdata
:
x
values in small steps to get a smooth curvex
as in the model formulaRainbow-Test (linearity)
Rainbow test
data: y ~ x
Rain = 0.79952, df1 = 5, df2 = 3, p-value = 0.6153
Breusch-Pagan-test (variance homogeneity)
outlierTest
from package carlibrary("MASS")
## test data with 2 "outliers"
x <- c(1, 2, 3, 3, 4, 5, 7, 7, 7, 8, 8, 9, 10, 14, 15, 15, 16, 17, 18, 18)
y <- c(8.1, 20, 10.9, 8.4, 9.6, 16.1, 17.3, 15.3, 16, 15.9, 19.3,
21.3, 24.8, 31.3, 4, 31.9, 33.7, 36.5, 42.4, 38.5)
## fit the models
ssq <- lm(y ~ x)
iwls <- rlm(y ~ x)
iwlsmm <- rlm(y ~ x, method = "MM")
## plot the models
plot(x, y, pch = 16, las = 1)
abline(ssq, col = "blue", lty = "dashed")
abline(iwls, col = "red")
abline(iwlsmm, col = "green")
legend("topleft", c("OLS", "IWLS-M", "IWLS-MM"),
col = c("blue", "red", "green"),
lty = c("dashed", "solid", "solid"))