We are often interested in population parameters. For example, the mean salary of all adults in a country. But collecting data of the entire population is almost always infeasible. Therefore, we use samples of the population to get a point estimate of our parameter of interest.

But, what is the 95% confidence interval of your estimate? What is the standard error? We use resampling techniques to answer such question. Bootstrapping is one of these resampling techniques.

Wikipedia defines bootstrapping as –

Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation coefficient or regression coefficient.

**Intuition**

**Why use Bootstrapping?**

- We cannot obtain data for the entire population. Hence, we draw multiple samples from the population and use central limit theorem based methods to form the sampling distribution.
**But at times we cannot even gather multiple samples from the population. Therefore, we form the bootstrap distribution using the only sample we have.** - We cannot use central limit theorem based methods to construct confidence intervals or hypothesis tests for parameters like the median.

**Why sampling with replacement?**

The idea behind this is that our sample is a good representation of the original population. Hence, there will be more than one observation in the population similar to a particular observation in the sample.

For example, we have heights of 20 male adults. We can assume that if one of the individuals in this sample has a height of 60 inches, there will be more than one adult in the original population with a height of 60 inches.

**Working**

**Step 1**

Create a random sample **with replacement** from the original sample with sample size as the original sample.

**Step 2**

Calculate the sample statistic. For example, the median of the sample.

**Step 3**

Repeat steps 1 and 2, N_{b } number of times to obtain the bootstrap distribution.

**Step 4**

Use this bootstrap distribution to calculate confidence intervals, standard error etc.

**Conditions**

- We will get accurate estimates only if the sample size is sufficiently large.
- The obtained bootstrap distribution is not extremely skewed or sparse.
- The sample should be a good representation of the population.

**Bootstrapping in R**

Performing a bootstrap analysis in R involves only two steps.

- Create a function that computes the statistic of interest. It will take two parameters as input – data and a vector of indices to subset the data.
- Perform the bootstrap, for which I use the boot( ) function of the boot library.

Here, we will estimate the accuracy of a linear regression model slope coefficient.

> library(boot) > set.seed(80) > head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > bootFunction <- function(data, index){ + coef(lm(mpg ~ hp, data=data, subset=index))[[2]] + } > bootFunction(mtcars, 1:10) [1] -0.05049867 > boot(mtcars, bootFunction, R=1000) ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = mtcars, statistic = bootFunction, R = 1000) Bootstrap Statistics : original bias std. error t1* -0.06822828 -0.00169249 0.01365336

Thank You.

Nice insight. Must read!!

LikeLike