- simple random sampling: every item has an equal chance of being selected
- can be done by assigning a number to each item and using random numbers to select or by systematically choosing every nth item
- sampling error = the difference between the sample stat (e.g. mean) and the corresponding population parameter (e.g. pop mean) i.e. how (un)representative is the sample stat
- sampling distribution of the sample stat is probability distribution of all possible sample stats from a set of equal sized samples randomly drawn from same population
- simple random is just random or systematic sampling
- stratified random sampling is proportionate - ensuring that the random sample contains a representative number of observations from each category e.g. different stocks
- time-series is looking at one category across multiple time periods
- cross-sectional is looking at multiple categories during one single time period
- central limit theorem states that for a large enough sample size n (usually > 30) from a pop with a mean μ and a variance σ2, the prob distribution for the sample mean will be approx. normal with a mean μ and a variance of σ2/n
- Theory allows us to use normal distribution to test hypotheses about pop mean, regardless of distrib. of the pop
- As the sample size grows, the sample stats become closer to the pop parameters
- The sample mean will be approximately normally distributed.
- The sample mean will be equal to the population mean (μ).
- The sample variance will be equal to the population variance (σ2) divided by the size of the sample (n)
- Thus the central limit theorem can help make probability estimates for a sample of a non-normal population (e.g. skewed, lognormal), based on the fact that the sample mean for large sample sizes will be a normal distribution.
LOS 10e. calculate and interpret the standard error of the sample mean;
- standard error is the standard deviation (of the pop or, if not available, the sample) divided by the square root of the sample size
- the sample mean and standard error can be used to calculate approximate confidence intervals for the mean i.e. the actual pop mean will lie between a and b with 95% confidence
LOS 10f. distinguish between a point estimate and a confidence interval estimate of a population parameter;
- point estimate is a single sample value used to estimate pop parameters e.g. sample mean representing the pop mean where sample mean is a point estimate of the pop mean
- confidence interval gives a range of values within which the actual value of a parameter will lie, given a probability of 1 - α (α is the level of significance)
- unbiased = the expected value of the estimator is equal to parameter you are trying to estimate
- efficient = variance of sampling distribution is smaller than all other unbiased estimators
- consistent = as sample size grows, estimator accurace increases i.e. standard error decreases
- confidence intervals are the point estimate ± (reliability factor * standard error)
- Student's t-distribution is used when sample size is <>
- It results in more conservative confidence intervals (curve is platykurtic - fat tails)
- t-distribution is symetrical
- defines by degrees of freedom (df) calculated by n-1 (sample size minus one)
- t distribution converges to z distribution as sample size (degrees of freedom) becomes sufficiently large
LOS 10j. calculate and interpret a confidence interval for a population mean, given a normal distribution with 1) a known population variance, 2) an unknown
population variance, or 3) an unknown variance and a large sample size;
- here we are trying to calculate the probability of the pop mean being within a certain range of values based on the sample mean distribution
- when available, use population parameters to calculate the confidence interval
- the calculation for when distribution is normal with known variance is:
where x is the sample mean,
zα/2 is the reliability factor i.e. the z-score that leaves α/2 in the upper tail,
e.g. zα/2 = 1.65 for 90% confidence (sig. level is 10% i.e. 5% in each tail) - might want to just think of this as 10% instead of thinking about the tails bit
and the last part is the standard error
So for example, you have a sample mean test score of 80% with a standard error of 5 at 95% confidence, then the true pop mean would be between 75% and 85% with 95% confidence
- when variance is unknown, use t distribution:
- here the tα/2 part is the t-statistic corresponding to a t-distributed random variable with n-1 degrees of freedom
Rules of thumb for when to use t or z
- if distribution is non-normal then small sample sizes do not work
- if normal w/ known pop variance then use z statistic
- if normal w/ unknown variance use t statistic
- non-normals only work with large samples, use z or t depending on whether you know variance
- data mining = overestimating significance of a pattern in a data set; test pattern on out of sample data to confirm or deny overestimation of significance
- sample selection bias = systematic exclusion of data from analysis, usually because unavailable (creates non-random samples)
- survivorship bias = exclusion of samples such as using only surviving mutual funds in sample
- look-ahead bias = basing the test at a point in time on data not available at that time
- time-period bias = relation does not hold over other time periods
No comments:
Post a Comment