Tuesday, January 12, 2010

Study Session 3 - Reading 10 Sampling and Estimation

LOS 10a. define simple random sampling, sampling error, and a sampling distribution, and interpret sampling error;

  • simple random sampling: every item has an equal chance of being selected
  • can be done by assigning a number to each item and using random numbers to select or by systematically choosing every nth item
  • sampling error = the difference between the sample stat (e.g. mean) and the corresponding population parameter (e.g. pop mean) i.e. how (un)representative is the sample stat
  • sampling distribution of the sample stat is probability distribution of all possible sample stats from a set of equal sized samples randomly drawn from same population
LOS 10b. distinguish between simple random and stratified random sampling;
  • simple random is just random or systematic sampling
  • stratified random sampling is proportionate - ensuring that the random sample contains a representative number of observations from each category e.g. different stocks
LOS 10c. distinguish between time-series and cross-sectional data;

  • time-series is looking at one category across multiple time periods
  • cross-sectional is looking at multiple categories during one single time period
LOS 10d. interpret the central limit theorem and describe its importance;
  • central limit theorem states that for a large enough sample size n (usually > 30) from a pop with a mean μ and a variance σ2, the prob distribution for the sample mean will be approx. normal with a mean μ and a variance of σ2/n
  • Theory allows us to use normal distribution to test hypotheses about pop mean, regardless of distrib. of the pop
  • As the sample size grows, the sample stats become closer to the pop parameters
  • The sample mean will be approximately normally distributed.
  • The sample mean will be equal to the population mean (μ).
  • The sample variance will be equal to the population variance (σ2) divided by the size of the sample (n)
  • Thus the central limit theorem can help make probability estimates for a sample of a non-normal population (e.g. skewed, lognormal), based on the fact that the sample mean for large sample sizes will be a normal distribution.

LOS 10e. calculate and interpret the standard error of the sample mean;

  • standard error is the standard deviation (of the pop or, if not available, the sample) divided by the square root of the sample size
  • the sample mean and standard error can be used to calculate approximate confidence intervals for the mean i.e. the actual pop mean will lie between a and b with 95% confidence

LOS 10f. distinguish between a point estimate and a confidence interval estimate of a population parameter;

  • point estimate is a single sample value used to estimate pop parameters e.g. sample mean representing the pop mean where sample mean is a point estimate of the pop mean
  • confidence interval gives a range of values within which the actual value of a parameter will lie, given a probability of 1 - α (α is the level of significance)
LOS 10g. identify and describe the desirable properties of an estimator;
  • unbiased = the expected value of the estimator is equal to parameter you are trying to estimate
  • efficient = variance of sampling distribution is smaller than all other unbiased estimators
  • consistent = as sample size grows, estimator accurace increases i.e. standard error decreases
LOS 10h. explain the construction of confidence intervals;
  • confidence intervals are the point estimate ± (reliability factor * standard error)
LOS 10i. describe the properties of Student’s t-distribution and calculate and interpret its degrees of freedom;

  • Student's t-distribution is used when sample size is <>
  • It results in more conservative confidence intervals (curve is platykurtic - fat tails)
  • t-distribution is symetrical
  • defines by degrees of freedom (df) calculated by n-1 (sample size minus one)
  • t distribution converges to z distribution as sample size (degrees of freedom) becomes sufficiently large

LOS 10j. calculate and interpret a confidence interval for a population mean, given a normal distribution with 1) a known population variance, 2) an unknown
population variance, or 3) an unknown variance and a large sample size;

  • here we are trying to calculate the probability of the pop mean being within a certain range of values based on the sample mean distribution
  • when available, use population parameters to calculate the confidence interval
  • the calculation for when distribution is normal with known variance is:

  • where x is the sample mean,
    zα/2 is the reliability factor i.e. the z-score that leaves α/2 in the upper tail,
    e.g. zα/2 = 1.65 for 90% confidence (sig. level is 10% i.e. 5% in each tail) - might want to just think of this as 10% instead of thinking about the tails bit
    and the last part is the standard error

So for example, you have a sample mean test score of 80% with a standard error of 5 at 95% confidence, then the true pop mean would be between 75% and 85% with 95% confidence

  • when variance is unknown, use t distribution:
  • here the tα/2 part is the t-statistic corresponding to a t-distributed random variable with n-1 degrees of freedom

Rules of thumb for when to use t or z

  • if distribution is non-normal then small sample sizes do not work
  • if normal w/ known pop variance then use z statistic
  • if normal w/ unknown variance use t statistic
  • non-normals only work with large samples, use z or t depending on whether you know variance
LOS 10k. discuss the issues regarding selection of the appropriate sample size, data-mining bias, sample selection bias, survivorship bias, look-ahead bias, and time-period bias.
  • data mining = overestimating significance of a pattern in a data set; test pattern on out of sample data to confirm or deny overestimation of significance
  • sample selection bias = systematic exclusion of data from analysis, usually because unavailable (creates non-random samples)
  • survivorship bias = exclusion of samples such as using only surviving mutual funds in sample
  • look-ahead bias = basing the test at a point in time on data not available at that time
  • time-period bias = relation does not hold over other time periods

No comments:

Post a Comment