Module 7: Bivariate regression

Overview

In Module 7, we begin conducting bivariate analysis. Toward this end, the Module introduces various tools for examining linear relationships between variables and testing them for statistical significance. In particular, the Module moves from measures of joint fluctuation such as covariance and correlation to bivariate linear regression.

Objectives

  • Calculate covariance and correlation as well as explain what these measures capture conceptually.
  • Explain how the OLS regression line helps us model the relationship between two variables.
  • Perform bivariate OLS regression in R and interpret the regression output.

Assignment

Assignment 3 is due at 11:59pm CST on Monday, 3/01. The Assignment 3 RMD file can be downloaded at the “Files/Exercises and Assignments” section at Canvas. Be sure to download and review the file early in the week so that you can plan your time accordingly.

Module

The video lectures for this week kick off with a very brief review of some of the assumptions that power the Central Limit Theorem. Watch the video on “CLT assumptions revisited” (I note that this title differs from the text shown on the title slide of my presentation.)

Let’s now turn to the central topic for this week’s Module: bivariate analysis. Note that so far, we have been performing what’s known as “univariate” analysis. This means that we have been examining single variables rather than relationships between variables. For instance, in Module 6, we focused primarily on the mean of a single variable such as annual income. (Recall our running Somersville example.) We then used hypothesis tests to assess whether some observed value of our sample mean provided evidence to refute some hypothesized value of the underlying population mean.

For the remainder of the quarter, we turn our attention to “bivariate” and “multivariate” analysis. That is, we will examine relationships between two or more variables. Ultimately, the basic procedures for such analysis are very similar to those we have used up until now: We will use sample statistics to estimate population parameters; we will calculate the typical error associated with our estimates; we will construct CIs; we will calculate test statistics; and we will perform hypothesis tests.

Measures of join fluctuation

Start by reading Imai, K. (2017). Quantitative Social Science: An Introduction. Read Sections 3.6, 4.2.1, and 4.2.2. The PDF is available at the “Files” section at Canvas.

Now watch the two videos on “Correlation” (Introduction and Significance Tests, respectively).

Bivariate regression: Introduction

Now read OIS Sections 7.1–7.4. Then, watch the video series on “Bivariate regression.” Note that I have divided what would normally be a single lecture into a series of short videos organized by sub-topic.

Bivariate regression: The regression line

Bivariate regression: Calculating regression coefficients

Bivariate regression: Prediction, tests of significance, and interpretation

Once you have completed your reading and watched the video lectures above, download and complete the Module 7 practice questions. They can be downloaded at the “Files/Practice Questions” section at Canvas.

Finally, be sure to download and complete Assignment 3, which is located at the “Files/Exercises and Assignments” section at Canvas.

Module 6: Hypothesis testing

Overview

Module 6 builds on Modules 4 and 5 on “Foundations for inference” by examining (univariate) hypothesis testing. We examine the t-distribution as a means of accounting for uncertainty in our estimates when our sample size is small or the population standard deviation is unknown. We then examine hypothesis tests for one and two populations using the t-distribution, and we begin to conduct such tests in R. Finally, we read and analyze one especially prominent and recent use of hypothesis testing in US politics.

Objectives

  • Explain what a t-distribution is and why we use it.
  • Define null hypothesis, alternative hypothesis, and type 1 and 2 errors.
  • Explain and put to practice the steps in conducting hypothesis testing.
    • State a null and alternative hypothesis
    • Calculate appropriate tests statistics
    • Calculate p-values
    • Compare p-values against different significance levels and interpret the results
  • Critically analyze real-life usages of hypothesis testing.

Assignment

Exercise 3 is due at 5pm CST on Monday, 2/22. The Exercise 2 RMD file is located at Canvas. Be sure to download and skim through the file early in the week so that you can plan your time accordingly.

Mid-course feedback

Please look for a Canvas announcement from me later in the week with a link to a Google Form that will allow you to provide me with anonymous, mid-course feedback. I am always looking for ways to improve my instruction and better meet your needs and aims, so please do take a moment to complete the Form once it is published.

Module

t-distribution

We’re going to begin by refining our grasp of standard errors and confidence intervals. In particular, we’re going to examine the t-distribution as a conservative alternative to the normal distribution that will allow us to account for additional uncertainty whenever our sample size is small or we estimate the population standard deviation using the sample standard deviation.

Now watch the video on the “t-distribution.”

(Univariate) Hypothesis testing

Now read OIS Sections 4.3 through 4.5. Note that the Syllabus lists some additional sections of OIS that are suggested but optional.

After you have completed the reading, watch the video on “Hypothesis testing 1.”

Now watch the video on “Hypothesis testing 2.”

I realize the the videos for this module are longer than usual, but I wanted to be sure to walk through some examples step-by-step so that you can begin conducting your own hypothesis tests in the practice questions and exercise.

Hypothesis testing in the 2020 US presidential election

I am now going to ask you to read an especially prominent and recent usage of hypothesis testing in US politics. Specifically, we are going to read excerpts from the State of Texas v. the Commonwealth of Pennsylvania, the State of Georgia, and the State of Michigan. If you are attentive to US politics and followed the aftermath of the most recent presidential election, you are probably already familiar with the case.

I should state at the outset that I selected this reading for its relevance. (What could be possibility be more relevant?) I recognize that the reading centers on a potentially sensitive political issue. So, as we read, analyze, and discuss these passages, I ask that we do so critically but cordially, and that we limit our discussion to what is relevant to the course objectives.

Now read State of Texas v. Pennsylvania, Georgia, and Michigan.

Here are some discussion questions to accompany your reading. Please try to answer the questions upon completing the reading, and be prepared to discuss them when we meet in Week 7:

  • What is the author’s main claim (i.e., the takeaway of the statistical test)?
  • Describe the nature of the statistical tests conducted by the author:
    • What are the null and alternative hypotheses for each test?
    • Given what you learned about hypothesis testing above, and assuming that the author’s math is correct, what are the results of the tests? (Be precise here.)
  • How do you you interpret the results of the tests? How does the author interpret them? What do you think about this interpretation?

Practice Questions and Exercise

Once you have worked through the videos and reading, complete the Module 6 Practice Questions, which guide you through several hypothesis tests in R. Once you have finished the Practice Questions, be sure to complete Exercise 3.

Module 3: Know your data (and where it came from)!

Overview

Module 3 builds on the concepts surrounding data and measurement that we tackled last week by asking you to think critically about some fundamental questions: Where do your data come from? Who made them and why? What do they look like? And so on.

In addition, Module 3 asks you to grapple with how we can move toward making causal claims in the social sciences as well as analyze how scholars (i.e., Koch and Nicholson) have tried to overcome obstacles to causal inference.

Objectives

  • Explain the distinction between experimental and observational data; explain why random assignment to treatment can facilitate causal inference
  • Lay out obstacles to causal inference when using observational data as well as some strategies for (potentially) addressing them; analyze scholarship in light of these obstacles and strategies.
  • Develop additional tools to assist data exploration in R.

Assignment 1

Reminder: Assignment 1 is due by 5pm CST on Monday 2/1. I suggest that you download and preview Assignment 1 early in the week so that you can plan your time accordingly.

Where do your data come from?

Watch the brief video “Where do your data come from?” Like last week, the video will sometimes ask you to press “pause” and answer some “class questions.”

Now watch the video “Where do your data come from? (Part 2)”.

I mentioned in the videos that our main aim as social scientists is to make valid causal claims. But, as the videos also noted, there are a host of obstacles in our path, including bad data.

Read Martin, Thinking through Statistics, Chapter 2 (“Know your data”). As you read, pay close attention to different data problems and, in particular, how those data problems have at times ruined prominent research. What can we do to keep ourselves from ruin?

Toward causal inference

Now watch the final video on “Toward causal inference.”

The video suggests that we can sometimes move toward causal inference by moving from the level of general laws to the level of mechanisms. How does this play out in scholarship?

Read Koch and Nicholson (2016, pp. 932–938). As you read, pay close attention to the authors’ argument and discussion of their empirical setup.

(1) What do the authors argue? (You might even take some time to diagram their argument). (2) How do thee authors try to move from establishing mere correlation to identifying causation? (3) Why does the structure of their argument potentially help them toward this end?

Now examine the regression table from the authors’ aggregate analysis. (4) Why do the authors control for the variables that they do? (5) How compelling is the aggregate analysis for helping the authors to make a causal claim? (6) Whether you think it is compelling or not, what work is the aggregate analysis doing for the authors?

Practice questions

In the spirit of really getting to know our data, the practice questions for this week will introduce some techniques that will ease the data cleaning and exploration process. In particular, we will practice using loops with for() as well as the dplyr package to summarize and reshape our data.

Download the “Module 3 practice questions” 1 and 2 RMD from Canvas. Note that there are two parts to the practice questions. Also note that the practice questions on loops have a couple of especially challenging questions (e.g., the de Montmort Problem). I suggest that you complete the simpler questions and then move on. You can return to the more challenging ones after you have completed Part 2 as well as Assignment 1.

Module 2: Cases, variables, and measurement

Overview

Module 2 asks you to engage some fundamental questions: What are data? Cases? Variables? It additionally askes you to grapple with critical issues surrounding measurement as well as consider how these issues play out in Daniel Treisman’s famous article on “The casues of corruption” (2005). Finally, the Module introduces important statistical and visual tools for exploring and describing variables.

Objectives

  • Define data, cases, and variables
  • Explain the qualities of “good” measurement; begin to analyze scholarship in light of these qualities
  • Grasp the intuitions behind common measures of central tendency and spread; become familiar with their notation and learn to calculate them in R
  • Explore data in R using summary functions, tables, and plots

Exercise 1

Download, complete, and submit Exercise 1 by 5pm CST on Monday, 1/25. The file is available at Canvas. I recommend that you preview the Exercise early in the week so that you can plan your time accordingly.

What is data?

Read OIS sections 1.1–1.5. Some of this material may be review, but don’t worry if it isn’t!

Now watch the brief video on “What is data?”

Variables and measurement

As the video mentions, we will dig into some actual data as we complete the practice questions. But first, watch the video on “Variables and measurement.” Note that the video will occasionally ask you to press “pause” and then spend some time answering some “class questions.” You don’t have to submit your answers to these questions. However, quickly jotting your answers down may be useful, as we will discuss some of the questions during our section meetings.

How do issues surrounding conceptual clarity, validity, and reliability play out in social science scholarship? In a moment, you will read “The causes of corruption,” by Daniel Treisman (2005). This is a famous article that uses linear regression to test common hypotheses about the causes of corruption worldwide. Before you read, takes some time to answer some questions. (As above, I suggest that jot your answers down somewhere): (1) What is corruption? Define it. (2) Imagine that you wanted to measure corruption levels across countries. How would you go about doing that? What kind of data would you look for? (3) What are some advantages of your approach? What are some of its disadvantages?

Now read the article.

Once you have finished reading, answer these questions: (1) Do you think that Transparency International (TI) Index scores are valid and reliable as a measure of corruption? (2) Treisman makes the case that they are. What does he argue to support the validity and reliability of his measure? (3) Are you persuaded? Why or why not?

Describing variables

Now read OIS sections 1.6 and 1.7. Then, watch the video on “Describing Variables: Measures of Central Tendency and Spread.”

And now, watch the video on “Describing Variables: Tables and Plots.”

Let’s apply some of the concepts presented above! Download the “module 2 practice questions” RMD from Canvas (“Files/Practice Questions”). Note that there are 2 parts to the practice questions. Complete the practice questions in RMD and then knit your files to PDF.

Remember: If you get stuck at any point… breathe. Coding can be frustrating at first, but we will work through it together. There a lots of ways to seek help:

  1. Use the “Help” tab in RStudio
  2. Internet search
  3. Post your question to Ed Discussion
  4. As a final option, email me directly or visit me during my office hours

As you seek help, try to specify the nature of the problem: Examine any warnings or error messages. What line of code seems to be the issue? Which function, specifically? (During knitting, Markdown will often tell you which line of code is stalling the knitting process.) If you are getting error messages, are you missing parentheses, commas, or quotations? (This happens to me all the time.) Answering these questions will help to ensure that you get the help you need.