Simulating data for use in a linear regression model is fairly straightforward in R because of how the language’s vector operations work.

For instance, to simulate 1000 observations for the following population model:

Y_{i} = 5 + *X*_{1}_{i}(0.03) + *X*_{2}_{i}(0.05) + *Ε*_{i}

*Ε*_{i} ~ *N*(0, 0.25)

We could run the following R code:

n <- 1000 x1 <- rnorm(n) x2 <- rnorm(n) y <- 5 + x1*.03 + x2*.05 + rnorm(n, m=0, sd=.25)

Notice that to calculate y, we only had to enter our regression model. If we calculated a regression using our simulated data, we should see something similar to the following:

Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -0.71103 -0.17190 -0.00386 0.17086 0.84411 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.010316 0.007763 645.370 < 2e-16 *** x1 0.038880 0.007751 5.016 6.24e-07 *** x2 0.049713 0.007548 6.586 7.28e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2455 on 997 degrees of freedom Multiple R-squared: 0.06377, Adjusted R-squared: 0.06189 F-statistic: 33.95 on 2 and 997 DF, p-value: 5.436e-15

Notice that the coefficients and the residual standard error are close to those in our population model.