iris
datasetThere is a built-in dataset called iris
that has measurements of different parts of flowers. (See ?iris
for information about the dataset.)
Use ggplot2 to make a scatterplot that visualizes the relationship between sepal length (horizontal axis) and petal width (vertical axis). Watch out for overplotting.
ggplot(iris, aes(Sepal.Length, Petal.Width)) +
geom_point(alpha = .25)
Run a regression model that predicts the petal width from the sepal length, and store the model object in the variable iris_mod
. End the block by printing out the summary of the model.
iris_mod <- lm(Petal.Width ~ Sepal.Length, iris)
summary(iris_mod) #print out the model summary
##
## Call:
## lm(formula = Petal.Width ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96671 -0.35936 -0.01787 0.28388 1.23329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.20022 0.25689 -12.46 <2e-16 ***
## Sepal.Length 0.75292 0.04353 17.30 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.44 on 148 degrees of freedom
## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668
## F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16
Make a histogram of the residuals of the model using ggplot2.
residuals <- residuals(iris_mod)
ggplot() +
geom_histogram(aes(residuals), color = "black")
Write code to predict the petal width for two plants, the first with a sepal length of 5.25cm, and the second with a sepal length of 6.3cm. Store the vector of predictions in the variable iris_pred
.
iris_pred <- predict(iris_mod, tibble(Sepal.Length = c(5.25, 6.3)))
iris_pred # print the predicted values
## 1 2
## 0.7526022 1.5431657
NOTE: You can knit this file to html to see formatted versions of the equations below (which are enclosed in $
characters); alternatively, if you find it easier, you can hover your mouse pointer over the $
in the code equations to see the formatted versions.
Write code to randomly generate 10 Y values from a simple linear regression model with an intercept of 3 and a slope of -7. Recall the form of the linear model:
\(Y_i = \beta_0 + \beta_1 X_i + e_i\)
The residuals (\(e_i\)s) are drawn from a normal distribution with mean 0 and variance \(\sigma^2 = 4\), and \(X\) is the vector of integer values from 1 to 10. Store the 10 observations in the variable Yi
below. (NOTE: the standard deviation is the square root of the variance, i.e. \(\sigma\); rnorm()
takes the standard deviation, not the variance, as its third argument).
X <- 1:10
err <- rnorm(10, sd = 2)
Yi <- 3 - 7 * X + err
Yi # print the values of Yi
## [1] -0.7426784 -7.7594656 -20.9941041 -22.8379955
## [5] -31.3152752 -40.3503036 -45.3260546 -51.4067219
## [9] -60.7134652 -68.6375608
Write a function to simulate data with the form.
\(Y_i = \beta_0 + \beta_1 X_i + e_i\)
The function should take arguments for the number of observations to return (n
), the intercept (b0
), the effect (b1
), the mean and SD of the predictor variable X (X_mu
and X_sd
), and the SD of the residual error (err_sd
). The function should return a tibble with n
rows and the columns id
, X
and Y
.
sim_lm_data <- function(n = 100, b0 = 0, b1 = 0,
X_mu = 0, X_sd = 1, err_sd = 1) {
tibble(
id = 1:n,
X = rnorm(n, X_mu, X_sd),
err = rnorm(n, 0, err_sd),
Y = b0 + b1*X + err
) %>%
select(id, X, Y)
}
dat6 <- sim_lm_data(n = 10) # do not edit
knitr::kable(dat6) # print table
id | X | Y |
---|---|---|
1 | -0.2677796 | -1.1495673 |
2 | -0.3635290 | 1.2939085 |
3 | 1.0291560 | -0.4317920 |
4 | -0.2638047 | 0.3443449 |
5 | 0.4261811 | -0.4751273 |
6 | 1.2697468 | 0.7615962 |
7 | -2.4711206 | 0.0701522 |
8 | -0.8314822 | 0.4079785 |
9 | 0.5470372 | -1.0239402 |
10 | 0.0803618 | 1.2476843 |
Use the function from Question 6 to generate a data table with 10000 subjects, an intercept of 80, an effect of X of 0.5, where X has a mean of 0 and SD of 1, and residual error SD of 2.
Analyse the data with lm()
. Find where the analysis summary estimates the values of b0
and b1
. What happens if you change the simulation values?
dat7 <- sim_lm_data(n = 10000, b0 = 80, b1 = 0.5,
X_mu = 0, X_sd = 1, err_sd = 2)
mod7 <- lm(Y ~ X, data = dat7)
summary(mod7) # print summary
##
## Call:
## lm(formula = Y ~ X, data = dat7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5128 -1.3554 0.0174 1.3832 7.2877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.00069 0.02015 3970.43 <2e-16 ***
## X 0.51867 0.02012 25.77 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.015 on 9998 degrees of freedom
## Multiple R-squared: 0.0623, Adjusted R-squared: 0.06221
## F-statistic: 664.3 on 1 and 9998 DF, p-value: < 2.2e-16
Use the function from Question 6 to calculate power by simulation for the effect of X on Y in a design with 50 subjects, an intercept of 80, an effect of X of 0.5, where X has a mean of 0 and SD of 1, residual error SD of 2, and alpha of 0.05.
Hint: use broom::tidy()
to get the p-value for the effect of X.
# ... lets you include any arguments to send to sim_lm_data()
sim_lm_power <- function(...) {
dat <- sim_lm_data(...)
lm(Y~X, dat) %>%
broom::tidy() %>%
filter(term == "X") %>%
pull(p.value)
}
p_values <- replicate(1000, sim_lm_power(n = 50, b0 = 80, b1 = 0.5, X_mu = 0, X_sd = 1, err_sd = 2))
power <- mean(p_values < .05)
power # print the value
## [1] 0.403
Calculate power (i.e., the false positive rate) for the effect of X on Y in a design with 50 subjects where there is no effect and alpha is 0.05.
p_values <- replicate(1000, sim_lm_power(n = 50))
false_pos <- mean(p_values < .05)
false_pos # print the value
## [1] 0.046