Module 9: OLS Assumptions and Extensions

Overview

Module 9 introduces the idea that the OLS estimator is BLUE: It is the best, linear, unbiased estimator available. But this requires that some important assumptions hold. Module 9 thus lays out these assumptions as well as methods for checking for potential violations. Module 9 then lays out some common OLS extensions, including dummy and categorical independent variables and interaction terms.

Objectives

  • Explain what we mean when we say that the OLS estimator is BLUE.
  • Grasp the intuition behind the core OLS assumptions.
  • Examine and begin checking assumptions in R; explain what violations of different assumptions mean for statistical inference.
  • Incorporate and interpret dummy and categorical independent variables as well as interaction terms in linear regression.

Assignment

Assignment 4 is due by 11:59pm CST on Wednesday, 3/17. Be sure to download the Assignment 4 RMD and review it early in the week so that you can plan your time accordingly.

The Module

As you work your way through the video lectures, note that I again draw many examples from the “CASchools” dataset that is available upon installing and loading the “AER” package in R. You should feel free to follow along with these examples in R, pausing the video as necessary. To do so, start by running this code, which loads the required package and creates our main IV and DV:

# Load the data
require(AER)
data(CASchools)

# Dependent variable
CASchools$score <- (CASchools$read + CASchools$math)/2

# Independent variable
CASchools$STR <- CASchools$students/CASchools$teachers

# Our simple model
model <- lm(score ~ STR, data = CASchools)

OLS Assumptions: BLUE

First read OIS Section 8.3 on “Checking Model Assumptions Using Graphs.” Then watch the video lecture series on “OLS Assumptions.”

OLS Assumptions: The Core Assumptions

OLS Assumptions: Checking Assumptions

OLS Extensions

We have so far mainly included numeric variables in our regression models. These are by far the easiest to interpret. But we might also wish include other variable types in our models, including dummy and categorical variables. The main challenges here concern interpretation and procedure. In terms of interpretation, we can’t straightforwardly interpret the coefficient for a dummy independent variable as “a change in Y is linked to a change in Y.” In terms of procedure, we need to be somewhat careful in our syntax whenever we run a regression that includes a categorical independent variable.

Start by reading Kellstedt and Whitten, the Fundamentals of Political Science Research, pp. 202-212. Then watch the brief video series on “OLS Extensions.”

Extensions: Dummy independent variables

As you watch the next video, note that there is an error on my last, which I neglected to correct during recording. Specifically, when you come to the last slide, note the final bullet point: STR should equal 0. Thus, the correct interpretation should read “On average, when STR = 0, we expect schools with Hi_ELL to have test scores around 692.361 – 19.533 = 672.8.”

Extensions: Categorical independent variables

Extensions: Interaction terms

Standard error and confidence intervals

Now read OIS, Sections 4.1 through 4.2 (pp. 169-179). Note that the Syllabus incorrectly states that you should also read Section 4.3. This is an error; we will read Section 4.3 next week when we begin performing hypothesis tests.

Now watch the videos on “Standard error” and “Confidence intervals”.

Extending confidence intervals to indicator variables

Let’s now see how we might extend the logic of CIS to indicator variables. We have already encountered indicator variables while working with the British Election Study (BES) data in Exercise 1 and Assignment 1: In particular, we created the indicator variable female, which took the value 1 if the respondent was female and 0 if the respondent was male. Can we apply CIs to this sort of variable?

Think about this for a moment: In Assignment 1, many of you correctly noted that female is a (nominal) categorical variable, and accordingly, it seems strange to calculate its mean and standard deviation. But many of you also noted that the mean does supply us with some useful information about the variable: namely, the proportion of respondents in the data that are female. This is a good intuition. When we deal with indicator variables, we are – strictly speaking – interested in a proportion rather than an average or central tendency (which are captured by the arithmetic mean). Accordingly, we have to shift our thinking and terminology somewhat. But don’t be thrown by this shift! Although the next video introduces some new parameters (π) and estimators (p-hat) – as well as some math – the concepts themselves are fairly straightforward. For instance, you will find that the formula for calculating the standard deviation of the sample proportion (p-hat) initially looks very similar to the formula for calculating the standard deviation of a numeric variable.

With this in mind, watch the video on “Confidence Intervals: Extensions”.

Note that we did not read the sections from OIS that deal with the sample proportion. If you would like to do so, the relevant section is Section 3.3 (“Geometric distribution”) on pages 141-145. (The following section on the “Binomial distribution” is also a useful.) We will read the section on the sampling distribution of the sample proportion next week (Section 4.5).

Practice questions

Now complete the practice questions, which can be downloaded at the “Files/Practice Questions” section at Canvas. You’ll note that the functions that you used to complete Exercise 2 will be very handy here. In fact, we are going to “prove” the logic of CIs in the very same way that we “proved” the CLT: experimentally, through simulation. The real payoff in using R to simulate our data and build CIs is that we can actually visualize what we mean when we interpret our CIs by saying that “we constructed this CI using a method that produces CIs that contain the true population parameter 95 of 100 times.” Let’s attach some real meaning to this phrase!

Assignment 2

Once you have completed the practice questions, complete Assignment 2. You can download the RMD at the “Files/Exercises and Assignments” section at Canvas. Submit your completed PDF via Canvas by 5pm on Monday, 2/15. I repeat my standing admonition that you collaborate with your fellow students to complete the practice questions, exercises, and assignments. Of, course, this does not mean copy-pasting answers or code; it does mean that you should talk about the questions, discuss your answers, and help each other to better understand both the concepts and code.

Module 4: Foundations for inference 1

Overview

Module 4 introduces foundations for statistical inference. In particular, it introduces the concepts of random variables, probability density functions, the law of large numbers, sampling distributions, and the central limit theorem. In addition to introducing these concepts, the Module aims to help you become familiar with their mechanics by applying them in R using simulated data.

Objectives

  • Define random variables and probability density functions (PDFs) as well as explain their usage in statistics
  • Become familiar with some common PDFs and the real-world processes that they model, or describe
  • Explain what the Law of Large Numbers tells us
  • Explain what the Central Limit Theorem says and why it matters

Exercise 2

Exercise 2 will ask you to explain the mechanics of the Central Limit Theorem using simulated data and plots. It is due at 5pm CST on Monday, 2/08. It can be downloaded at the “Files/Assignments” section at Canvas.

And now, the Module

We switch gears somewhat in this module. So far, we have focused on actual, observed data (samples). We now want to begin thinking about how we can use these data to draw inferences about the populations from which they are drawn. Toward that end, we are going to examine some foundational statistical concepts. We start with the concepts of random variables and probability density functions (PDFs).

Start by reading OIS Sections 3.1 and 3.2 on “The Normal Distribution.” Then watch the video on “Random Variables and PDFs.”

If you would like to go into these topics in greater detail, note that the Syllabus has some suggested but optional readings. The readings examine Probability” and “Random Variables and Continuous Distributions” in greater detail. Again, these readings are suggested but optional.

Now watch the video on “The Normal Distribution.”

The Module 4 practice questions (parts 1 and 2) will further familiarize you with PDFs by guiding you through an exploration of some especially common ones using simulated data and plots. In the meantime, before moving on, take a moment to answer a few fundamental conceptual questions:

  • Conceptually, what are random variables?
  • What is their relationship to the observed variables that we have been working with up until now?
  • And, similarly, what are PDFs?
  • What are some of the real-life, chance processes modelled by one or two of the PDFs from the video?

Now read Lane, Introductory Statistics, pp. 300-315. The PDF is available at the “Files/Readings” section at Canvas.

Now watch the parts 1 and 2 of the video on the “LLN and CLT”.

The best way to really grasp the LLN and CLT is to first become very familiar with their mechanics. We are going to do this in a couple of ways. First, we will spend some time working with a java-based simulation of the Cental Limit Theorem at http://onlinestatbook.com/stat_sim/sampling_dist/. Second, in Exercise 2, we will explore the CLT using our own simulated data and graphics as well as teach the concept to the general public via a very brief article.

Watch the “CLT demo” below, in which I briefly demonstrate the usage of the simulation at onlinestatboook.com.

Note: While working with the wonky population distribution in the video, at roughly 7 mins. 20 secs. , I incorrectly say something like “even now the sampling distribution is beginning to look a bit like a normal distribution.” In fact, the opposite is true: At just 2 SRS of size 10, the sampling distribution looks anything but normally distributed. The point is that it approximates the normal distribution as we take more and more SRS. Sorry if that error caused confusion.

Now go to http://onlinestatbook.com/stat_sim/sampling_dist/. Spend some time playing with the simulation. In particular:

  • Play with parent populations with various distributions. Start with a normally distributed population and then move to a uniformly distributed one. Then draw some really wonky distributions of your own.
  • Play with different sample sizes (n). Start with small samples (e.g., n=5). Then take a bunch of samples and see what happens to the sampling distribution of your sample statistic. Do it again, this time increasing your sample size. What happens to the sampling distribution of your sampling statistic as you change the sample size?
  • Start slow. Use the “animated” button to draw one sample at a time as you get started. Then, gradually speed up the sampling process, taking 5 samples at a time, and so forth. What happens to the mean and standard deviation of the sampling distribution as you complete more repetitions?
  • Compare sampling distributions. Set both of the lower two sets of axes to generate sampling distributions of the sample mean. But, designate different sample sizes for the two plots. Then, compare the two sampling distributions as you draw more and more samples. How does the sample size affect the resulting distribution?

After you have spent some time with the online simulation, complete the Module 4 practice questions. As in past Modules, I have split the practice questions into two parts in order to ease the knitting process.

Once you have completed the practice questions, be sure to complete Exercise 2 on the Central Limit Theorem. As always, please collaborate and deliberate with each other on Ed Discussions. I will continue to monitor Ed Discussions regularly to clarify, guide, and help to resolve unanswered questions.

Module 1: Logistics

Overview

Module 1 is mainly logistical. It aims to familiarize you with the course and Module format as well as get you up and running in R and RStudio. It will also introduce some useful coding best practices that you can (and should!) implement as we move forward.

Objectives

  • Review the Course Syllabus
  • Enroll in a weekly section
  • Complete the course Survey and Pre-Assessment
  • Install R, RStudio, and LaTex onto your computer
  • Become familiar with the RStudio layout
  • Learn some basic operations in R and RStudio via practice questions

Course introduction

If you did not / could not attend our first Zoom meeting on Tuesday 1/12, go to the course Canvas page to download the Syllabus and access a video recording of our first meeting. The Syllabus is located at the “Files/Syllabus” section of the Canvas site. Shortly after our first Zoom meeting on Tuesday, I will upload a video recording of the meeting to “Files/Video.” Read the Syllabus carefully and watch the video.

Weekly sections

Starting next week (Week 2), you will attend a weekly, 40-minute section. These sections will meet in lieu of our regularly scheduled lectures. See the Syllabus for section times.

Before proceeding below, send an email to me with a list of three section times that you can attend. You may rank the times by preference. I will reply with your assigned section time by the end of the first week. You should then log onto Zoom at your assigned time starting in Week 2.

Course survey and pre-assessment

Now complete the Course Survey and Pre-assessment, which you can find here. Answer the questions as best as you can and without referring to any sources. If you don’t know an answer, just leave it blank.

Rest assured that the pre-assessment is not graded! It simply helps me to adjust the course somewhat according to your prior knowledge, skills and expressed aims.

Install R, RStudio, and LaTex

Now install R, RStudio, and LaTex onto your personal computer. Installation instructions are available at the “Files” section of the course Canvas site. If you are unable to install these applications after carefully reading the instructions, I will be available during our regularly scheduled lecture on Thursday 1/14 to help you troubleshoot installation issues. Please attend.

Getting started in R, RStudio, and Markdown

Once you are up and running in R and RStudio, watch the “Intro. to R” video below.

Now open RStudio and watch the brief “R Demonstration” video below, which will familiarize you with RStudio’s layout. Feel follow along with the video in RStudio, pausing and restarting the video as necessary.

Now go to the course Canvas page. Download the Markdown (RMD) file entitled “module 1 practice questions.” It is located at “Files/Practice Questions.” Open the file, read the instructions, and then work your way through it. When you have completed it, try to “knit” it to PDF. Examine the result.

Note that if the document does not knit, your file likely contains missing or broken code. Review your RMD file – particularly code chunks – for completeness and accuracy and try again. If it still does not work, try to identify the likely culprit and then seek help on that portion via Piazza.

Once you have completed the Practice Questions, read “Coding and Good Computing Practices” by Nagler (1995). Did you find yourself naturally implementing any of the author’s suggested practices as you completed the Module 1 Practice Questions? Which ones? Are there practices that you did not implement but which you believe will be useful going forward? (You might revise your Practice Questions RMD file and implement some of them if you have time.)

That is all for this first Module. Be sure to review the Objectives at the top of this post. This section will serve as a sort of checklist and thus help to ensure that you have grasped the material and completed any deliverables.