Module 3: Theory, hypotheses, and arguments

Overview

Module 3 builds on Module 2 by examining theory, hypotheses, and arguments. We distinguish between these concepts but also reflect on how they work together within a final research report: In particular, we focus on the linkage between theory (our answer / logic) and hypotheses (the empirical implications of our theory). We examine some common pitfalls that crop up whenever we find ourselves working with imperfect data as well as some strategies for avoiding them.

Objectives

  • Differentiate theory, hypotheses, and arguments.
    • Explain how these concepts work together within a research report.
  • Identify common pitfalls in linking theory and hypotheses and explain how we can avoid them.
  • Analyze the linkage between theory and evidence as it crops up in Lessing and Willis.
  • Recall and practice the basics of cleaning and transforming data using dplyr.
  • Reflect on how to give substantive and constructive peer feedback during the week’s team meeting.

Due

Mon. @ 8pm CST. Write 1-2 paragraphs of feedback on each of your team members’ initial research designs. Post your comments as an attachment beneath your team’s sub-heading at Ed Discussion. If you are the first member of your team to post, please create a new thread.
Thu. / Fri. @ 8pm CST. Post your Q & A about the week’s reading at Ed Discussion.

Theory, hypotheses, and arguments

Before diving into the week’s concepts, let’s briefly review some of the concepts that we examined in Module 2 on “Research areas, topics, and questions.” Toward that end, watch the video on “Recap of Module 2.”

Now that we have examined research questions, let’s turn our attention to theories, hypotheses, and arguments. Before you watch the videos below on this topic, read the Chapter by Bryans et al. on “Explaining the Social World.” As your read, think about your answers to the following questions. Doing so will prepare you to better engage the video content and, hopefully, assist you as your progress in your research:

  • What is theory? What are hypotheses? How do these concepts go hand-in-hand?
    • Think about your own research proposal: Did it have a theory (or just hypotheses)? What is your theory?
  • How did you develop your theory? Did you mainly use induction or deduction?
  • What assumptions do you make in your theory?
    *If your theory is correct, what do expect to observe in the world? In other words, what is the best evidence you could find that would tell you that your theory is correct?

Now watch the videos on “Theory, hypotheses, and arguments” and “Why theory?”

The next video aims to get you thinking critically about how we can effectively link theory to hypotheses: “If your theory is correct, what should we observe in the world?” Doing this well is not easy, in part because we rarely find perfect evidence in support of our theories.  When this happens, it’s easy to step into some common pitfalls rather than take the more appropriate step of gathering additional evidence or modifying our theory to ensure it is useful despite empirical limitations. The next video examines some of these common pitfalls as well as some strategies for avoiding them. Be prepared to pause the video throughout, as this video is meant to be a sort of exercise in “spotting the fallacy.”

You will now read “Legitimacy in Criminal Governance” by Lessing and Willis. I selected the article not only because it is one of the most interesting academic articles you will ever read but also because it exemplifies the meme that I presented in the video on “Why theory?” above. That is, Lessing and Willis help us to interpret/connect/make sense of some rich data that I think are extremely puzzling.

Before you read, think about your intuitions to the following questions:

  • How do you think criminal organizations (e.g., mafias and drug gangs) keep their members in line?
  • If your answer (theory) is correct, what would you expect to find if you could examine the internal working of one of these criminal organizations?

Now read the article. When you have finished, watch the video on “Anatomy of an argument.”

R exercise on tidy data and dplyr review

If you are completing the R exercises each week, be sure to download the exercise for Module 3 from Canvas. This week’s exercise is meant to help you brush up on your data cleaning and transforming skills using the dplyr package. While this is less fun than the webscraping we did last week, it is probably more useful, as most of you will have to spend some time cleaning, transforming, and merging any data that you obtain for use in your final research report.

Module 2: Research areas, topics, and questions

Overview

Module 2 examines research questions: in particular, the attributes of a good research question, and a general process we can follow to move from a broad research area to a more specific topic and, finally, a focused question. The module will ask you to reflect

Objectives

  • Grasp the attributes of a good research question
  • Grasp the process for developing a good research question and begin to implement it
  • Examine the development of a research question by engaging a real-world example
  • Examine and practice web scraping in R

Due this week

  • Thu and Fri @ 8pm CST: Ed discussion Q & A.
  • Fri @ 8pm CST: Proposed research design. Submit via the Assignments section at Canvas. Also post as an attachment beneath your team’s heading at Ed Discussion. NOTE: You should read each others’ proposals and prepare 1-2 paragraphs of feedback for each one. You will be required to post your written feedback to Ed Discussion by Mon. 4/12 @ 8pm (Week 3).

Some guidance on research designs

Your proposed research design should be a roughly 1-page, single-spaced document. It should do three things: (1) lay out a research question and justify it; (2) delineate a brief answer to the question; and (3) discuss the sort of data you will need in order to answer the question and why.

The readings and videos for this week are meant to help you to develop your research question and design. So, if you can, try to complete the reading and watch the videos relatively early in the week.

I want to preview a couple of key points from the reading and videos here as means of reassuring and guiding you. In particular:

  • Your research question and design will evolve. In fact, it will likely continue to evolve even as you conduct your analysis and write up your results toward the end of the quarter. (I am still constantly revising the design document for my own dissertation as it becomes a book!) This can be frustrating, but it is inevitable. You will be moving in and out of the literature and data surrounding your question in upcoming weeks, and in doing so, you will get a much better idea about what makes a “good” question as well as the sorts of questions you can examine given the available data.
  • For part 3, it is not necessary that you have specific data / datasets in mind. You can instead approach this as a thought experiment: Given your question and tentative answer, what are your ideal data and why? What sort of measures would you obtain? What would the units of analysis be? Would the data be cross-sectional, panel / time-series, or something else? Are the data likely to derive from an experiment, survey, or machine-learning algorithm, or are they likely to have been hand-coded by scholars? The more you think through these and related questions, the better. Thinking carefully about your ideal data now will help you to identify workable data later.

The Module

Attributes of good questions and developing your question

First read the chapter on “Beginning the research process” by Buttolph Johnson. As you read, think about your answers to the following questions. You don’t need to write your answers down, but thinking about them will help you to develop your own research question:

  • What are some attributes of a good research question?
  • How does one actually go about developing a good research question?
  • How does the literature review go hand in hand with the development of a good question?

Once you have completed the Buttolph Johnson chapter, watch the three videos below on “Research areas, topics, and questions”. Note that you will need to enlarge the videos in order to view them.

Anatomy of a research question

Now read the article by Albertus and Deming on “Branching out.” You should read the entire article. In your reading, you might try to implement some of the advice laid out by Dane’s chapter on “Reading and structuring research” in Module 1. In addition, in line with this week’s overarching topic, here are some questions to think about as you read. To the degree that there is time, we may discuss some of these questions during our team meetings this week:

  • What is the central question that the authors ask?
  • What have other scholars found in terms of answers to this question (or very similar questions)?
  • Given that the question is not really new, what contribution – if any – do you think that the authors make?
  • What sort of data would you want – in theory – in order to answer the question posed by the authors?
  • What data do the authors actually use and how does it differ from the ideal data that you described above?
  • How do the authors justify the data that they use?

Once you have read the article, watch the video on “Anatomy of a research question.”

R exercise on web scraping

Remember that you are not required to complete and submit the weekly R exercise for a grade. They are a voluntary tool for introducing you to some operations in R that you may find useful this quarter and/or in the future. This week’s exercise introduces you to one procedure for scraping data from the web pages written using HTML. I am not an expert on this topic, so any data scientists with experience in web scraping should feel free to post additional resources and advice at Ed Discussion.

Note that for this exercise, you will need to add SelectorGadget to your web browser. I use Chrome and have installed the SelectorGadget extension from Chrome’s Web Store (https://tinyurl.com/298y44yt). For other browsers, you can simply drag the SelectorGadget bookmarklet to your bookmarks toolbar. You can find the bookmarklet at selectorgadget.com.

I have included a very brief video below that introduces the selector tool as it applies to this week’s R exercise.

Module 1: Reading and structuring research

Overview

Module 1 begins by walking through the course Syllabus. It then kicks off our examination of the different components of good research design by laying out strategies for critically reading and structuring a quantitative social science research report. It then asks you to critically examine a summary report by the Chicago Million Dollar Blocks Project as well as use the Project’s data to create a basic plot using GGplot.

Objectives

  • Logisitics
    • Examine the course Syllabus
    • Enroll in weekly team meeting via email
  • Reflect on strategies for critically reading quantitative social science research
  • Describe the main components of a social science research report, their contents, and their purpose
  • Reflect on the linkage between theoretical claims and data analysis via examination of the Chicago Million Dollar Blocks Project.

Due

  • You should send an email to me with 3 times that you are available to regularly meet with a team of 3-5 students. See the Syllabus for details and available times. Email me by Thursday evening.
  • Ed Discussion Q&A are due at 8pm CST on Thursday (Q) and Friday (A) each week. This week, you only need to post one comment by Friday at 8pm. Your comment should center on the Chicago Million Dollar Blocks Project. See below.
  • The Module 1 exercise is due at 8pm CST on Friday.

Module 1

Syllabus

Watch the video on “Course Syllabus.” If you have questions about the Syllabus after watching the video, please post your question to Ed Discussion under “Course logistics” so that other students can view the answer.

Course Intro.

Course format

Assignments and grading

Final research report

Course policies

Course schedule

Reading and structuring research

How to read and structure research may seem relatively straightforward; it may even seem obvious. You are at Chicago, after all, in part because you are a good reader. You are also probably a decent writer and have some good intuition about how to do research. But quantitative social science research can be a strange form. It is thus critical that we become deeply familiar with the form — its structure, conventions, jargon, and so forth — so that we can read it critically and, ultimately, apply it to our own research. With this in mind, I encourage you to reflect on your answers to the following questions as you read the Dane and Buttolph Johnson chapters:

  • How does critical reading of quantitative research differ from reading, say, an ethnographic study?
  • What are the different components of a quantitative research report?
  • And more critically perhaps: What should each component do? What is its contribution to the report as a whole?

Now read “Reading a research report” by Dane and “The research report: An annotated example” by Buttolph Johnson.

The Chicago Million Dollar Blocks Project

In addition to examining and reflecting on the different components of quantitative social science research, we are going to examine lots of data and data visualizations this quarter. We will do so mainly via readings, brief videos, and weekly exercises in R. The aim here is to get you thinking about how to effectively support a theoretical using good data and and detailed data analysis.

Go to the Chicago Million Dollar Blocks Project (https://chicagosmilliondollarblocks.com/). Carefully read the web page and examine the data presented there.

The Project begins to exemplify what you will do this quarter: In particular, it uses high-quality data to support a claim about a high-interest and pressing social topic. But there are also areas in which the “report” shown at the Project’s web page could be improved. In this vein, think about your answers to the following questions as you study the page:

  • What is the aim of the web page? What is it trying to do exactly?
  • What is a “war on neighborhoods”?
  • What evidence exists of a war on neighborhoods?
  • What sort of statistic would be clear and compelling evidence in support of the Project’s main claim (as you understand it)?
  • Are you persuaded? If not, how might you improve the web page to make it more compelling?

For this week’s Q&A, post one way that you might try to improve the “report” shown at the Project’s web page. Post your comment to the thread that I have created at Ed Discussion by Friday evening at 8pm CST. The thread is under “Reading Q&A / Chicago Million Dollar Blocks Project.” You are not required to post an additional reply this week, but you may do so if you wish.

Once you have scrutinized the Project web page and posted at Ed Discussion, complete the Module 1 exercise, which uses data from the Chicago Million Dollar Blocks Project. Download the exercise from the “Module 1” folder located at the “Files/Module 1” section of Canvas. Also download the corresponding dataset. Complete the RMD file, render it to PDF, and upload your PDF to the “Assignments” section of Canvas.

Module 9: OLS Assumptions and Extensions

Overview

Module 9 introduces the idea that the OLS estimator is BLUE: It is the best, linear, unbiased estimator available. But this requires that some important assumptions hold. Module 9 thus lays out these assumptions as well as methods for checking for potential violations. Module 9 then lays out some common OLS extensions, including dummy and categorical independent variables and interaction terms.

Objectives

  • Explain what we mean when we say that the OLS estimator is BLUE.
  • Grasp the intuition behind the core OLS assumptions.
  • Examine and begin checking assumptions in R; explain what violations of different assumptions mean for statistical inference.
  • Incorporate and interpret dummy and categorical independent variables as well as interaction terms in linear regression.

Assignment

Assignment 4 is due by 11:59pm CST on Wednesday, 3/17. Be sure to download the Assignment 4 RMD and review it early in the week so that you can plan your time accordingly.

The Module

As you work your way through the video lectures, note that I again draw many examples from the “CASchools” dataset that is available upon installing and loading the “AER” package in R. You should feel free to follow along with these examples in R, pausing the video as necessary. To do so, start by running this code, which loads the required package and creates our main IV and DV:

# Load the data
require(AER)
data(CASchools)

# Dependent variable
CASchools$score <- (CASchools$read + CASchools$math)/2

# Independent variable
CASchools$STR <- CASchools$students/CASchools$teachers

# Our simple model
model <- lm(score ~ STR, data = CASchools)

OLS Assumptions: BLUE

First read OIS Section 8.3 on “Checking Model Assumptions Using Graphs.” Then watch the video lecture series on “OLS Assumptions.”

OLS Assumptions: The Core Assumptions

OLS Assumptions: Checking Assumptions

OLS Extensions

We have so far mainly included numeric variables in our regression models. These are by far the easiest to interpret. But we might also wish include other variable types in our models, including dummy and categorical variables. The main challenges here concern interpretation and procedure. In terms of interpretation, we can’t straightforwardly interpret the coefficient for a dummy independent variable as “a change in Y is linked to a change in Y.” In terms of procedure, we need to be somewhat careful in our syntax whenever we run a regression that includes a categorical independent variable.

Start by reading Kellstedt and Whitten, the Fundamentals of Political Science Research, pp. 202-212. Then watch the brief video series on “OLS Extensions.”

Extensions: Dummy independent variables

As you watch the next video, note that there is an error on my last, which I neglected to correct during recording. Specifically, when you come to the last slide, note the final bullet point: STR should equal 0. Thus, the correct interpretation should read “On average, when STR = 0, we expect schools with Hi_ELL to have test scores around 692.361 – 19.533 = 672.8.”

Extensions: Categorical independent variables

Extensions: Interaction terms

Now watch the video on “Omitted variable bias: The mechanics.” Note that this videos uses maths to demonstrate the intuition that I laid out in the first video. The maths are limited to linear algebra, so I encourage you to follow along with the different steps as best a possible. I happen to think that this is one instance in which the maths do help us toward a better grasp of the concept.

Multivariate regression

Now that we have examined OVB, how can multivariate regression help to mitigate it? We examine that question in the next video. Note that in this videos and those that follow, I will walk you through some examples using the CASchools data that are contained in AER package in R. You can follow along. Simply use the code below to install and load the data as well as create our dependent and independent variables:

# Load the data
require(AER)
data(CASchools)

# Dependent variable
CASchools$score <- (CASchools$read + CASchools$math)/2

# Independent variable
CASchools$STR <- CASchools$students/CASchools$teachers

# Our simple model
model <- lm(score ~ STR, data = CASchools)

Now watch the videos on “Multivariate regression: The logic” and “Multivariate regression: Prediction”.

One of the last points I make in the video on “Prediction” is that we should avoid so-called “kitchen sink” or “garbage can” regression models that contain lots of control variables not rooted in good theory. So what criteria should guide our control strategy? Let’s see what other scholars suggest as to an answer to this question.

First read Chapter 4 from John Martin’s book on “Thinking through statistics.” What strategies does the author suggest? What are some of the potential pitfalls that potentially crop up as we embark on a control strategy, and how do we avoid them?

How do Martin’s arguments play out in actual scholarship? Earlier in the quarter, we read the first portion of Koch and Nicholson’s article on “Death and Turnout.” You should now read the remainder of the article. Here are some questions to guide your reading and prepare you for the practice questions and exercise:

  • What is the authors’ dependent variable?
  • What is their main independent variable?
  • What is the expected direction of the relationship between IV and DV?
  • What are some variables that may confound this relationship?
  • What variables do the authors control for and why? (Focus on their discussion of Table 4 on p. 942).
  • What do you think about these controls? Do you buy the authors’ reasoning? Why or why not?
  • Finally, pay close attention to the authors’ interpretation of their results. Note the specific language they use to discuss statistical and substantive significance. (As above, focus on their discussion of Table 4).

Practice questions and exercise

Now complete the Module 8 practice questions. These are short and straightforward; I have done much of the coding for you as a means of walking you through running and presenting multivariate regression in R. But note that Exercise 4 builds directly from the practice questions, so be sure to work through them.

Once you have completed the practice questions, be sure to download and complete Exercise 4.

Module 7: Bivariate regression

Overview

In Module 7, we begin conducting bivariate analysis. Toward this end, the Module introduces various tools for examining linear relationships between variables and testing them for statistical significance. In particular, the Module moves from measures of joint fluctuation such as covariance and correlation to bivariate linear regression.

Objectives

  • Calculate covariance and correlation as well as explain what these measures capture conceptually.
  • Explain how the OLS regression line helps us model the relationship between two variables.
  • Perform bivariate OLS regression in R and interpret the regression output.

Assignment

Assignment 3 is due at 11:59pm CST on Monday, 3/01. The Assignment 3 RMD file can be downloaded at the “Files/Exercises and Assignments” section at Canvas. Be sure to download and review the file early in the week so that you can plan your time accordingly.

Module

The video lectures for this week kick off with a very brief review of some of the assumptions that power the Central Limit Theorem. Watch the video on “CLT assumptions revisited” (I note that this title differs from the text shown on the title slide of my presentation.)

Let’s now turn to the central topic for this week’s Module: bivariate analysis. Note that so far, we have been performing what’s known as “univariate” analysis. This means that we have been examining single variables rather than relationships between variables. For instance, in Module 6, we focused primarily on the mean of a single variable such as annual income. (Recall our running Somersville example.) We then used hypothesis tests to assess whether some observed value of our sample mean provided evidence to refute some hypothesized value of the underlying population mean.

For the remainder of the quarter, we turn our attention to “bivariate” and “multivariate” analysis. That is, we will examine relationships between two or more variables. Ultimately, the basic procedures for such analysis are very similar to those we have used up until now: We will use sample statistics to estimate population parameters; we will calculate the typical error associated with our estimates; we will construct CIs; we will calculate test statistics; and we will perform hypothesis tests.

Measures of join fluctuation

Start by reading Imai, K. (2017). Quantitative Social Science: An Introduction. Read Sections 3.6, 4.2.1, and 4.2.2. The PDF is available at the “Files” section at Canvas.

Now watch the two videos on “Correlation” (Introduction and Significance Tests, respectively).

Bivariate regression: Introduction

Now read OIS Sections 7.1–7.4. Then, watch the video series on “Bivariate regression.” Note that I have divided what would normally be a single lecture into a series of short videos organized by sub-topic.

Bivariate regression: The regression line

Bivariate regression: Calculating regression coefficients

Bivariate regression: Prediction, tests of significance, and interpretation

Once you have completed your reading and watched the video lectures above, download and complete the Module 7 practice questions. They can be downloaded at the “Files/Practice Questions” section at Canvas.

Finally, be sure to download and complete Assignment 3, which is located at the “Files/Exercises and Assignments” section at Canvas.

Module 6: Hypothesis testing

Overview

Module 6 builds on Modules 4 and 5 on “Foundations for inference” by examining (univariate) hypothesis testing. We examine the t-distribution as a means of accounting for uncertainty in our estimates when our sample size is small or the population standard deviation is unknown. We then examine hypothesis tests for one and two populations using the t-distribution, and we begin to conduct such tests in R. Finally, we read and analyze one especially prominent and recent use of hypothesis testing in US politics.

Objectives

  • Explain what a t-distribution is and why we use it.
  • Define null hypothesis, alternative hypothesis, and type 1 and 2 errors.
  • Explain and put to practice the steps in conducting hypothesis testing.
    • State a null and alternative hypothesis
    • Calculate appropriate tests statistics
    • Calculate p-values
    • Compare p-values against different significance levels and interpret the results
  • Critically analyze real-life usages of hypothesis testing.

Assignment

Exercise 3 is due at 5pm CST on Monday, 2/22. The Exercise 2 RMD file is located at Canvas. Be sure to download and skim through the file early in the week so that you can plan your time accordingly.

Mid-course feedback

Please look for a Canvas announcement from me later in the week with a link to a Google Form that will allow you to provide me with anonymous, mid-course feedback. I am always looking for ways to improve my instruction and better meet your needs and aims, so please do take a moment to complete the Form once it is published.

Module

t-distribution

We’re going to begin by refining our grasp of standard errors and confidence intervals. In particular, we’re going to examine the t-distribution as a conservative alternative to the normal distribution that will allow us to account for additional uncertainty whenever our sample size is small or we estimate the population standard deviation using the sample standard deviation.

Now watch the video on the “t-distribution.”

(Univariate) Hypothesis testing

Now read OIS Sections 4.3 through 4.5. Note that the Syllabus lists some additional sections of OIS that are suggested but optional.

After you have completed the reading, watch the video on “Hypothesis testing 1.”

Now watch the video on “Hypothesis testing 2.”

I realize the the videos for this module are longer than usual, but I wanted to be sure to walk through some examples step-by-step so that you can begin conducting your own hypothesis tests in the practice questions and exercise.

Hypothesis testing in the 2020 US presidential election

I am now going to ask you to read an especially prominent and recent usage of hypothesis testing in US politics. Specifically, we are going to read excerpts from the State of Texas v. the Commonwealth of Pennsylvania, the State of Georgia, and the State of Michigan. If you are attentive to US politics and followed the aftermath of the most recent presidential election, you are probably already familiar with the case.

I should state at the outset that I selected this reading for its relevance. (What could be possibility be more relevant?) I recognize that the reading centers on a potentially sensitive political issue. So, as we read, analyze, and discuss these passages, I ask that we do so critically but cordially, and that we limit our discussion to what is relevant to the course objectives.

Now read State of Texas v. Pennsylvania, Georgia, and Michigan.

Here are some discussion questions to accompany your reading. Please try to answer the questions upon completing the reading, and be prepared to discuss them when we meet in Week 7:

  • What is the author’s main claim (i.e., the takeaway of the statistical test)?
  • Describe the nature of the statistical tests conducted by the author:
    • What are the null and alternative hypotheses for each test?
    • Given what you learned about hypothesis testing above, and assuming that the author’s math is correct, what are the results of the tests? (Be precise here.)
  • How do you you interpret the results of the tests? How does the author interpret them? What do you think about this interpretation?

Practice Questions and Exercise

Once you have worked through the videos and reading, complete the Module 6 Practice Questions, which guide you through several hypothesis tests in R. Once you have finished the Practice Questions, be sure to complete Exercise 3.

Module 5: Foundations for inference 2

Overview

Module 5 builds directly from Module 4 by examining how we can measure and report the uncertainty surrounding our estimates of population parameters. In particular, it introduces the concepts of standard error and confidence intervals. It then aims to build your familiarity with these concepts through a set of practice questions that will help you simulate and visualize their core logic.

Objectives

  • Explain what standard error and confidence intervals (CIs) capture, conceptually
    • Link these concepts to the CLT and properties of the normal distribution
  • Calculate standard error and build CIs for numeric and indicator variables manually
  • Use R to calculate standard error, build CIs, and examine the mechanics of these concepts
    • Grasp the precise meaning of CIs by building and visualizing 100 CIs around sample means drawn from a simulated population

Assignment

Assignment 2 is due at 5pm CST on Monday, 2/15. The Assignment 2 RMD file is located at Canvas. Be sure to download and skim through the file early in the week so that you can plan your time accordingly.

Module

Foundations for inference

We cover less new conceptual terrain this week than we did in Module 4. This is partly because we want to focus in this module on linking the different concepts that we have examined so far. In particular, as we examine standard error and confidence intervals below, you should at times pause and try to answer one very important question: “How are these ideas linked to the CLT and/or properties of the Normal distribution?”

Because the CLT and normal distribution are so critical for this week’s module, it’s worth taking a moment to revise them. Toward that end, watch the video on “Foundations for inference 2.”

Standard error and confidence intervals

Now read OIS, Sections 4.1 through 4.2 (pp. 169-179). Note that the Syllabus incorrectly states that you should also read Section 4.3. This is an error; we will read Section 4.3 next week when we begin performing hypothesis tests.

Now watch the videos on “Standard error” and “Confidence intervals”.

Extending confidence intervals to indicator variables

Let’s now see how we might extend the logic of CIS to indicator variables. We have already encountered indicator variables while working with the British Election Study (BES) data in Exercise 1 and Assignment 1: In particular, we created the indicator variable female, which took the value 1 if the respondent was female and 0 if the respondent was male. Can we apply CIs to this sort of variable?

Think about this for a moment: In Assignment 1, many of you correctly noted that female is a (nominal) categorical variable, and accordingly, it seems strange to calculate its mean and standard deviation. But many of you also noted that the mean does supply us with some useful information about the variable: namely, the proportion of respondents in the data that are female. This is a good intuition. When we deal with indicator variables, we are – strictly speaking – interested in a proportion rather than an average or central tendency (which are captured by the arithmetic mean). Accordingly, we have to shift our thinking and terminology somewhat. But don’t be thrown by this shift! Although the next video introduces some new parameters (π) and estimators (p-hat) – as well as some math – the concepts themselves are fairly straightforward. For instance, you will find that the formula for calculating the standard deviation of the sample proportion (p-hat) initially looks very similar to the formula for calculating the standard deviation of a numeric variable.

With this in mind, watch the video on “Confidence Intervals: Extensions”.

Note that we did not read the sections from OIS that deal with the sample proportion. If you would like to do so, the relevant section is Section 3.3 (“Geometric distribution”) on pages 141-145. (The following section on the “Binomial distribution” is also a useful.) We will read the section on the sampling distribution of the sample proportion next week (Section 4.5).

Practice questions

Now complete the practice questions, which can be downloaded at the “Files/Practice Questions” section at Canvas. You’ll note that the functions that you used to complete Exercise 2 will be very handy here. In fact, we are going to “prove” the logic of CIs in the very same way that we “proved” the CLT: experimentally, through simulation. The real payoff in using R to simulate our data and build CIs is that we can actually visualize what we mean when we interpret our CIs by saying that “we constructed this CI using a method that produces CIs that contain the true population parameter 95 of 100 times.” Let’s attach some real meaning to this phrase!

Assignment 2

Once you have completed the practice questions, complete Assignment 2. You can download the RMD at the “Files/Exercises and Assignments” section at Canvas. Submit your completed PDF via Canvas by 5pm on Monday, 2/15. I repeat my standing admonition that you collaborate with your fellow students to complete the practice questions, exercises, and assignments. Of, course, this does not mean copy-pasting answers or code; it does mean that you should talk about the questions, discuss your answers, and help each other to better understand both the concepts and code.

Module 4: Foundations for inference 1

Overview

Module 4 introduces foundations for statistical inference. In particular, it introduces the concepts of random variables, probability density functions, the law of large numbers, sampling distributions, and the central limit theorem. In addition to introducing these concepts, the Module aims to help you become familiar with their mechanics by applying them in R using simulated data.

Objectives

  • Define random variables and probability density functions (PDFs) as well as explain their usage in statistics
  • Become familiar with some common PDFs and the real-world processes that they model, or describe
  • Explain what the Law of Large Numbers tells us
  • Explain what the Central Limit Theorem says and why it matters

Exercise 2

Exercise 2 will ask you to explain the mechanics of the Central Limit Theorem using simulated data and plots. It is due at 5pm CST on Monday, 2/08. It can be downloaded at the “Files/Assignments” section at Canvas.

And now, the Module

We switch gears somewhat in this module. So far, we have focused on actual, observed data (samples). We now want to begin thinking about how we can use these data to draw inferences about the populations from which they are drawn. Toward that end, we are going to examine some foundational statistical concepts. We start with the concepts of random variables and probability density functions (PDFs).

Start by reading OIS Sections 3.1 and 3.2 on “The Normal Distribution.” Then watch the video on “Random Variables and PDFs.”

If you would like to go into these topics in greater detail, note that the Syllabus has some suggested but optional readings. The readings examine Probability” and “Random Variables and Continuous Distributions” in greater detail. Again, these readings are suggested but optional.

Now watch the video on “The Normal Distribution.”

The Module 4 practice questions (parts 1 and 2) will further familiarize you with PDFs by guiding you through an exploration of some especially common ones using simulated data and plots. In the meantime, before moving on, take a moment to answer a few fundamental conceptual questions:

  • Conceptually, what are random variables?
  • What is their relationship to the observed variables that we have been working with up until now?
  • And, similarly, what are PDFs?
  • What are some of the real-life, chance processes modelled by one or two of the PDFs from the video?

Now read Lane, Introductory Statistics, pp. 300-315. The PDF is available at the “Files/Readings” section at Canvas.

Now watch the parts 1 and 2 of the video on the “LLN and CLT”.

The best way to really grasp the LLN and CLT is to first become very familiar with their mechanics. We are going to do this in a couple of ways. First, we will spend some time working with a java-based simulation of the Cental Limit Theorem at http://onlinestatbook.com/stat_sim/sampling_dist/. Second, in Exercise 2, we will explore the CLT using our own simulated data and graphics as well as teach the concept to the general public via a very brief article.

Watch the “CLT demo” below, in which I briefly demonstrate the usage of the simulation at onlinestatboook.com.

Note: While working with the wonky population distribution in the video, at roughly 7 mins. 20 secs. , I incorrectly say something like “even now the sampling distribution is beginning to look a bit like a normal distribution.” In fact, the opposite is true: At just 2 SRS of size 10, the sampling distribution looks anything but normally distributed. The point is that it approximates the normal distribution as we take more and more SRS. Sorry if that error caused confusion.

Now go to http://onlinestatbook.com/stat_sim/sampling_dist/. Spend some time playing with the simulation. In particular:

  • Play with parent populations with various distributions. Start with a normally distributed population and then move to a uniformly distributed one. Then draw some really wonky distributions of your own.
  • Play with different sample sizes (n). Start with small samples (e.g., n=5). Then take a bunch of samples and see what happens to the sampling distribution of your sample statistic. Do it again, this time increasing your sample size. What happens to the sampling distribution of your sampling statistic as you change the sample size?
  • Start slow. Use the “animated” button to draw one sample at a time as you get started. Then, gradually speed up the sampling process, taking 5 samples at a time, and so forth. What happens to the mean and standard deviation of the sampling distribution as you complete more repetitions?
  • Compare sampling distributions. Set both of the lower two sets of axes to generate sampling distributions of the sample mean. But, designate different sample sizes for the two plots. Then, compare the two sampling distributions as you draw more and more samples. How does the sample size affect the resulting distribution?

After you have spent some time with the online simulation, complete the Module 4 practice questions. As in past Modules, I have split the practice questions into two parts in order to ease the knitting process.

Once you have completed the practice questions, be sure to complete Exercise 2 on the Central Limit Theorem. As always, please collaborate and deliberate with each other on Ed Discussions. I will continue to monitor Ed Discussions regularly to clarify, guide, and help to resolve unanswered questions.

Module 3: Know your data (and where it came from)!

Overview

Module 3 builds on the concepts surrounding data and measurement that we tackled last week by asking you to think critically about some fundamental questions: Where do your data come from? Who made them and why? What do they look like? And so on.

In addition, Module 3 asks you to grapple with how we can move toward making causal claims in the social sciences as well as analyze how scholars (i.e., Koch and Nicholson) have tried to overcome obstacles to causal inference.

Objectives

  • Explain the distinction between experimental and observational data; explain why random assignment to treatment can facilitate causal inference
  • Lay out obstacles to causal inference when using observational data as well as some strategies for (potentially) addressing them; analyze scholarship in light of these obstacles and strategies.
  • Develop additional tools to assist data exploration in R.

Assignment 1

Reminder: Assignment 1 is due by 5pm CST on Monday 2/1. I suggest that you download and preview Assignment 1 early in the week so that you can plan your time accordingly.

Where do your data come from?

Watch the brief video “Where do your data come from?” Like last week, the video will sometimes ask you to press “pause” and answer some “class questions.”

Now watch the video “Where do your data come from? (Part 2)”.

I mentioned in the videos that our main aim as social scientists is to make valid causal claims. But, as the videos also noted, there are a host of obstacles in our path, including bad data.

Read Martin, Thinking through Statistics, Chapter 2 (“Know your data”). As you read, pay close attention to different data problems and, in particular, how those data problems have at times ruined prominent research. What can we do to keep ourselves from ruin?

Toward causal inference

Now watch the final video on “Toward causal inference.”

The video suggests that we can sometimes move toward causal inference by moving from the level of general laws to the level of mechanisms. How does this play out in scholarship?

Read Koch and Nicholson (2016, pp. 932–938). As you read, pay close attention to the authors’ argument and discussion of their empirical setup.

(1) What do the authors argue? (You might even take some time to diagram their argument). (2) How do thee authors try to move from establishing mere correlation to identifying causation? (3) Why does the structure of their argument potentially help them toward this end?

Now examine the regression table from the authors’ aggregate analysis. (4) Why do the authors control for the variables that they do? (5) How compelling is the aggregate analysis for helping the authors to make a causal claim? (6) Whether you think it is compelling or not, what work is the aggregate analysis doing for the authors?

Practice questions

In the spirit of really getting to know our data, the practice questions for this week will introduce some techniques that will ease the data cleaning and exploration process. In particular, we will practice using loops with for() as well as the dplyr package to summarize and reshape our data.

Download the “Module 3 practice questions” 1 and 2 RMD from Canvas. Note that there are two parts to the practice questions. Also note that the practice questions on loops have a couple of especially challenging questions (e.g., the de Montmort Problem). I suggest that you complete the simpler questions and then move on. You can return to the more challenging ones after you have completed Part 2 as well as Assignment 1.