Module 2: Summarizing data, numerically and visually

SOSC 13200-2


Module 2 asks you to engage some fundamental questions: What are data? Cases? Variables? It additionally askes you to grapple with critical issues surrounding measurement, as well as consider how these issues play out in common measures of human behavior and well-being. Finally, the Module draws on the data from Card & Krueger (1994) to explore some important statistical and visual tools for exploring and describing single variables.


  • Define data, cases, and variables
  • Explain the qualities of “good” measurement; explore how measurement issue play out in social science research.
  • Grasp the intuitions behind common measures of central tendency and spread; become familiar with their notation and learn to calculate them in R
  • Explore data in R using summary functions, tables, and plots

To Do

Download, complete, and submit Assignment 2 by 11:5pm on 1/24. The file will be available for download on Tuesday, 1/18. I recommend that you preview the Assignment shortly after it is posted so that you can plan your time accordingly.

…And Now, The Module

What are data?

Start by watching the video on “Data” below.

Variables and measurement

As the video mentions, we will dig into some actual data as we complete the practice questions. And, Assignment 2 is an opportunity to replicate portions of Card & Krueger’s own summary of their data. But first, watch the video on “Variables & Measurement.” Note that the video will occasionally ask you to press “pause” and then spend some time answering some “class questions.” You don’t have to submit your answers to these questions. However, quickly jotting your answers down may be useful, as we may circle back to some of the questions in our next class meeting.

How do the issues surrounding conceptual clarity, validity, and reliability play out in real life? In a moment, you will read “Measuring and Understanding Behavior, Welfare, and Poverty” by Nobel-prize winning economist Angus Deaton. Before you read, take a moment to answer these questions:

  • What measures of human welfare and poverty can you think of?
  • How good do you think these measures are?
  • What are some potential problems with the conceptual clarity, validity, and reliability of these measures?

Now read the article. As you read, think about the different examples that Deaton lays out. What sort of violation does each one exemplify (conceptual clarity, validity, or reliability)? And critically, what is at stake?

Summarizing variables numerically

Before watching the videos below, you should first read the assigned excerpts from the Verzani (Simple R) and Wickham (ggplot2) texts. These will be useful as you engage, and in some cases, follow along with the videos.

You should also read Card & Krueger’s (1994) article on “Minimum Wages and Employment.” Many of you will remember the article from the fall quarter. As we revisit the article now, we are mainly concerned with getting a strong grasp of the authors’ data: What are the units? What sorts of variables do the data contain? How do the authors describe their data, numerically and visually? Accordingly, pay particular attention to the authors’ description of the dataset as well as any tables and figures that summarize the different variables.

Once you have completed the readings above, watch the videos on “Summarizing Variables Numerically” and “Summarizing Variables Visually.”

Let’s get some additional practice with summarizing data in R. On Monday evening, I will post a practice exercise for the week that will guide you through some basic operations that you will later apply on Assignment 2. The practice exercise will be located at Canvas under “Files/Practice Exercises.” Recall that although you should complete the practice exercise, you do not have to submit it to me for a grade. It is for your learning only.

Remember: If you get stuck at any point… breathe. Coding can be frustrating at first, but we will work through it together. There a lots of ways to seek help:

  1. Use the “Help” tab in RStudio
  2. Internet search
  3. Post your question to Ed Discussion
  4. As a final option, email me directly or visit me during my office hours

As you seek help, try to specify the nature of the problem: Examine any warnings or error messages. What line of code seems to be the issue? Which function, specifically? (During knitting, Markdown will often tell you which line of code is stalling the knitting process.) If you are getting error messages, are you missing parentheses, commas, or quotations? (This happens to me all the time.) Answering these questions will help to ensure that you get the help you need.

Module 1: Course Intro. & Research Questions

SOSC 13200-2 (WIN22)


Module 1 has three main components: (1) We begin by recapitulating some key points from Tuesday’s course overview. (2) We then get to know R, RStudio, and Markdown via a couple of brief videos and a practice exercise. (3) Finally, we start thinking about the nature of research questions in the social sciences: What are the attributes of a good research question? How do we go about formulating one?


  • Become familiar with the RStudio interface and R’s basic functionality.
    • Create a class directory on your personal laptop (using an R Project, if you are up for it).
    • Install and load external packages such as the ggplot2 and here packages.
    • Load excel-style datasets into R using functions such as read.table(), read_dta(), read.csv(), etc.
    • Begin to explore and manipulate datasets.
  • Lay out some attributes of good research questions in the social sciences, and how we can begin to assess them.
  • Explain the distinction between correlation and causation, and some of the assumptions that get us from the former to the latter.

Tasks to complete in Week 1

  • Send an email to Professor Deming with your preferred meeting times for Week 2.
  • Download and install R, RStudio, and LaTex. Installation instructions are located at the “Files” section of Canvas.
  • Complete this Module.
  • Download, complete, knit, and submit Assignment 1 as a PDF to Canvas by 11:59pm on Monday 1/17.

And now, Module 1…

Course overview

If you were not able to attend our first meeting on 1/11, you should first go to the course Canvas page to download the course Syllabus. Read it. This will ensure that you do not miss any important logistical points.

Introduction to R, RStudio, and Markdown

If you have not already done so, download and install R, RStudio, and LaTex. I have posted detailed installation instructions to the “Files” section of Canvas. If you are still having trouble with installation after carefully following the instructions, please meet with me during our regularly scheduled meeting time on Thursday 1/13. Together, we will get you up and running.

Once you are up and running in R, watch the “Introduction to R” video below.

Let’s now get to know the RStudio interface. Start by watching the “R Demonstration” video below. 

We will work in R and RStudio extensively this quarter. In class, my lecture slides will regularly display R code, and I will often ask you to work in small teams to complete practice exercises in R. The aim of these exercises is to boost your programming skills. But more importantly, they will help you apply key statistical concepts and more deeply engage the research that we examine by allowing you to “get under the hood” of the data used by different authors.

Let’s start building our programming skills via a short exercise. Go to the “Files” section of Canvas. You will see a folder entitled “Practice Exercises.” Open it. Then, download the file entitled “module1_practice_exercise.” It is an R Markdown (RMD) file. Save it to your personal computer, open it in RStudio, and work through it. When you are finished, knit it to PDF format. You do not have to submit it to me.

When you get stuck on a practice exercise or assignment: First, breathe! R programming can be frustrating, especially at first. Then, do some googling to see if you can find a solution to the problem. After that, post your question to the class Ed Discussion site. (Be sure to review older posts to see if your question has already been answered.) Finally, note that you are welcome to complete exercises and assignments in teams of 3-4 students (in fact, I encourage it!). Just be sure to follow the guidelines that I have laid out in the Syllabus.

Research questions

At the end of this course, you will submit a brief report in which you lay out a research question and begin to assess via analysis of some quantitative data. With this in mind, what makes a good research question? How do we go about formulating and, ultimately, assessing one?

The readings by King, Keohane, and Verba (1994) and Holland (1986) will help us begin answering these questions. We will discuss the readings when we meet in our small teams during Week 2, so be sure to jot down key points.

Let’s start with KKV (1994). This comes from what is perhaps the most famous (for some scholars, infamous) contemporary texts on the nature of social science research. Before you read, think about your own answers to the two questions that I raised above: What makes a good research question? How would you go about formulating one?

Once you have laid out your own preliminary answers to these questions, read the chapter. Here are some discussion questions that you should try to answer. We will circle back to them during our small-team meetings in Week 2:

  • The authors say that the aim of social science is descriptive and causal inference. What is inference? What is causal inference?
  • What are some attributes of a good research question?
  • Based on your reading, how might the authors answer the second question above? How does this coincide with and/or differ from your own preliminary answer?

Once you have read KKV (1994) and thought about your answers to the questions above, we can move to Holland (1986). This is not an easy read, but it is a good one. In particular, it will help us think carefully about an important distinction between simple correlation and causation.  Before you read, consider how you think about this distinction: What is the difference between correlation and causation?

You should now read Holland (1986). As with KKV (1994), here are some discussion questions that you should try to answer:

  • What does Holland mean when he refers to causation? How does this differ from the way other authors (and philosophers) might use the term?
  • What is the Fundamental Problem of Causal Inference (FPCI)?
  • How can we overcome the FPCI, according to the author?
  • Do you see any problems with the solutions in practice?
  • How do correlation and causation differ? What might Holland say here?

Assignment 1

Download “assignment1” from Canvas. It is located at at “Files/Assignments and Final Project.” It is in RMD format. Open Assignment 1 in RStudio and read the instructions carefully.  Complete it, knit it to PDF, and submit it to Canvas. It is due at 11:59pm on Monday 1/17.

Recall from the video above that, although practice exercises will guide you through some of the coding procedures that you will later use to complete assignments, it will not guide you through all of the required procedures. That is, completing the assignments will require some additional, independent learning of R. With this in mind, be sure to see my earlier note about how to go about learning R (i.e., googling, review of coding fora, Ed Discussion, and teaching and learning from each other).

SOSC 13100

Week 1: Thinking about social science concepts and asking good empirical questions.


We will first build on our discussion on Tuesday by revisiting some the central questions from our discussion while reading Mlodinow’s “Peering through the Eyes of Uncertainty.” We will then examine some of the challenges that crop up when studying many social concepts, and we will examine some solutions for overcoming them. Finally, we examine some attributes of good empirical questions and, ultimately, root them in Friedman’s theory of positivist social science.

How do we study random social events?

On Tuesday, we discussed three fundamental questions: (1) Why study random social events? (2) What do we mean by “random” event? And (3) How can we go about studying random social events? 

Read Mlodinow’s chapter on “Peering through the Window of Uncertainty.” The reading builds on our discussion surrounding these three questions. Accordingly, as you read, keep these questions in mind. In particular:

We saw that one reason to study random social events was to avoid policy and decision-making based on poor intuition. How do findings derived from empirical study diverge from intuition in the different examples laid out by Mlodinow? Can you think of other examples of such divergence?

How can we go about studying a random social event? Here, consider Roger Maris’ “fluke” breaking of Babe Ruth’s record for most home runs in single season in 1961. How does Mlodinow go about “studying” this random event in the chapter? What sorts of questions is he able to tackle through his study?

Once you have completed the reading, watch the brief recap video on “How do we study random social events?”

NOTE: A clarification: At one point in the video, I lay out an example to illustrate the “flavor” of Bayesian statistics: namely, that probability can be viewed as subjective; probabilities are what we “feel” they are, and we constantly update our beliefs about probabilities based on observed data and our experience. The six-sided die (99,999 heads in a row) and Sally Clark examples begin to illustrate this idea. But note that the Sally Clark example is *not* inconsistent with the frequentist approach to probability, which views this illustration as a simple question of independence v. dependence of two events. 

Thinking about social science concepts

Social science investigates relationships between big concepts that are often hard to define and hotly contested. What is democracy, for instance? Is democracy competitive elections? Competitive elections plus basic freedoms? 

Watch the video below. The video briefly lays out some of the challenges of conceptualization in the social sciences. It then lays out two common approaches to these challenges. Keep both the challenges and approaches in mind as you complete your reading further below.

Each of the solutions presented in the video has drawbacks. The first solution, in particular, isn’t a solution at all! To do quantitative social science research, we need to be able to apply concepts across a large number of cases. What are some problems with the second solution?

You will now read Fearon and Laitin’s paper on “Ordinary Language.” The authors grapple with the tricky issues raised in the video as they relate to the concepts of ethnicity and ethnic violence. The authors then lay out a third solution: rooting concepts in an analysis of their meaning. As you read, here are some questions to think about:

  • What do the authors say about the quantoid and interpretivist responses to confusion surrounding social science concepts? What do you think of the authors’ arguments here?
  • What is the third solution/response proposed by the authors? How does it differ, in particular, from the quantoid response? Do you see any problems with the authors’ approach?

Asking good empirical questions 

In SSI, we will focus on empirical questions. In the last video, we examine what this means exactly, and we begin thinking about some of the attributes of good empirical questions. 

In perhaps the most famous piece on social science methodology of the 20th century, Chicago economist Milton Friedman tackled some of the points that I raised in the video above. In particular, Friedman—as you will soon read for yourself—drew a strong distinction between normative and positivist theory. 

As you read, let’s push beyond the video content just presented. In particular, think about Friedman’s “theory of theory”: According to him, what should good social science be doing, such that we should be asking the sorts of questions that I presented in the video above?


We’re taking our discussion online this week. Specifically, after watching the videos and completing the readings, you should go to the course Canvas page and then to the Discussion Board section. You should then write a short (~1–2 paragraphs) response to one of the questions that I have posted there. There are three questions (all containing “Week 1” in their title), but you only need to answer one. This is meant to be low-stakes in that I will be grading for completion. But, for the sake of practicing good written communication, your response should give a clear answer to the question, and you should support your answer with specific reasons. Your reasons can be drawn from the readings and/or lectures, and you should also feel free to draw inspiration from others’ posts. 

Module 6: What kind of data analysis 2


Module 6 is the second part of a 2-part series on data analysis. Whereas Module 5 focused mainly on assessing the credibility and relevance of data, Module 6 focuses on developing an analysis plan. We examine how a good analysis plan is linked to argument as well as how this plays out in the reading on Queens by Dube and Harish. We then examine some key considerations in developing an analysis plan.


  • Explain how data analysis is linked to argument.
  • Examine the three main “links” of an analysis plan and how they crop up in social science research.
  • Map out an analysis plan based on your argument, focal relationship variables’ type, and data structure.
  • Use your own data to create 1-2 visualizations that motivate your quetstion or support your argument.


  • Thu/Fri @ 8pm. Post your Q & A about the week’s reading at Ed Discussion.

Module 6

Review of Module 5

To get started, watch the video on “Module 5 review.”

Analysis plans as argument

Whereas Module 5 focused on assessing the credibility and relevance of our data, Module 6 focuses on analysis plans: How to conceptualize them, what they look like in practice, and how we can about developing them.

A running theme throughout the Module is that good data analysis is an extension of an argument: It should be tightly linked to the mechanisms that we lay out in our argument. The next video introduces this idea.

Now watch the video on “Analysis plans as argument.”

Anatomy of a plan: “Queens”

How do these three “links” play out in practice? What do they look like in scholarship? In a moment, I will ask you to read “Queens”, by Dube and Harish. I’ve selected this article for two reasons. For one, it tackles a fascinating question: Are states ruled by women less prone to conflict than those ruled by men? In addition, the article nicely exemplifies the three links that we examined above. Accordingly, as you read, try to answer the following questions:

  • What are the focal relationship variables?
  • What are the mechanisms that link the variables?
  • What are some observable implications of the mechanisms? Can you think of some that the authors do not address?
  • How is the analysis linked to the mechanisms?

We will tackle some of these questions in the video below, but you should try to answer them for yourself before watching. Now read “Queens” by Dube and Harish. When you have finished reading, watch the video on “Anatomy of a plan: Queens.”

Developing your analysis plan

So far, we have examined how data analysis is linked to argument. But there are a couple of other considerations that we should make as we develop our analysis plans. We briefly examine these considerations in the final video. As you watch, think about your own data in some detail: What type of variables are in your focal relationship? What is the overall structure of your data? What are the units of analysis?

The reason to keep your answers to these questions in mind is that, ultimately, they should inform your analysis. For instance, in Module 5, Papachristos et al. used so-called Poisson regressions rather than OLS because their dependent variable was a “count” variable that took the value 0 for many observations. Meanwhile, in Module 2, Albertus and Deming used fixed effects regression analysis because they were using panel data and worried that omitted country-level variables might otherwise bias their coefficient estimates.

In this vein, upon watching the video, it may be worth reviewing your notes on the readings from past Modules. In particular: What sort of analysis did different authors perform? How does their analysis plan seem to be shaped by data and variable considerations?

Now watch the video on “Developing your analysis plan.”

Stata and R Exercise: Visualizing your data

The Module 6 R exercise is intended to get you thinking about the first 1-2 pieces of your final data analysis, which will very likely to consist of some kind of table, plot, map, or other visualizations. If you are a Stata user, you should complete the exercise in Stata. Stata has a user-friendly interface for generating simple plots such as histograms, scatterplots, and barplots. Note that in order to complete the exercise, you will need to have your data in hand. You do not have to submit your completed exercise.

Before completing the exercise, let’s reflect on the properties of good data visualization by reading “Aesthetics and technique in data graphical design” by Tufte. Go ahead and read the chapter.

When you have finished reading, follow the instructions that I have pasted below. If you are completing the Module exercises in R, note that these instructions are shown in the RMD file:

Exercise instructions

Be sure to read Tufte’s chapter on “Aesthetics and technique in data graphical design” before you complete this exercise. Think about Tufte’s advice about how to create an impactful graphic and try to implement it below. In particular, label your graphics, use nice colors, and tell a story.

A: Univariate description

Create two visualizations of the univariate distribution of your two main variables – that is, your focal relationship variables. Think of this as presenting your data to your readers. Be sure to consider the your variables’ type and select your visualization type accordingly (e.g., don’t create a histogram for an indicator variable; use a barplot instead).

B: Create a bivariate graph

Create a visualization that begins to capture your theory about your focal relationship. That is, create a visualization of the association between your two main variables of interest. Think of this step as presenting your story (argument) to your readers.


Using your data, create the ugliest and most useless graphic you can imagine. What makes it bad / useless?

Module 5: What kind of data analysis? (Part 1)


Module 5 examines data and data analysis. We examine how to evaluate the relevance and credibility of secondary data for our research as well as justify our data for readers. Module 5 also introduces the concept of focal relationships and examines how we can use focal relationships to guide the development of our data analysis plan. We examine how using focal relationships to guide our data analysis plays out in “More coffee, less crime” by Papachristos et al. Finally, the Module introduces regression analysis with panel data.


  • Evaluate the relevance and credibility of secondary data for your independent research; justify your data for readers.
  • Explain the concept of a focal relationship and how it should guide our data analysis (and all other aspects of our research).
  • Grasp the intuition behind fixed effects (FE) regression when analyzing panel data. Implement a simple FE regression in R and/or Stata.


  • Thu/Fri @ 8pm. Post your Q & A about the week’s reading at Ed Discussion.


Recap. of Module 4

Before we dive into Module 5 material, watch the video on “Recap. of Module 4.”

Evaluating and justifying your data

Most of you will analyze secondary data this quarter. Secondary data are data that you do not personally collect via surveys, interviews, and so forth. They are instead collected by someone else for some other study, and you are simply re-purposing them for your own study. You must therefore scrutinize the data: Are they credible and relevant for your purposes? What are their limitations? You must also answer these questions for your readers in your final research resport.

The first video examines how to go about performing this sort of assessment. Watch the video on “Assessing and justifying your data.”

Using focal relationships to develop your analysis plan

Once you have obtained some (relevant and credible) data, you can now begin developing your analysis plan. How should you go about doing this? In the video below, I advocate an approach that centers on so-called “focal relationships.”

However, before you watch the video, you should read “More coffee, less crime” by Papachristos et al. Pay close attention, in particular, to the authors’ analysis. Here are some questions to guide you toward that end:

  • The authors’ analysis proceeds in several steps. How would you describe these different steps?
  • How do the authors familiarize readers with their data (i.e., trends and patterns)?
  • What kind of regression analysis do the authors perform? Read the details and write them down.
  • Why do the authors select this sort of regression? How do they justify their model selection?

Now watch the video on “Developing an analysis plan.”

Fixed effects regression using panel data

In “More coffee, less crime,” the authors analyze panel data. Panel data are basically data in which we have observations on our units over time. Many of you will analyze some type of panel data this quarter, and this raises some unique analytical opportunities and issues. I have therefore created a brief video on how to run so-called “fixed effects” regression when using panel data.

In the video, I use mainly R, but I also include some code for running fixed effects regression in Stata. I note at the outset that the example in the video is drawn from “Econometrics in R,” which is an excellent online resource anyone who is interested. Here is a link ( to the example from which I have drawn.

Watch the video on “Intro. to panel data”. NOTE: On one slide, I incorrectly label the IV and DV. Throughout the regression examples, the correct DV is rate, which is the number of traffic deaths per 10,000 people in a given state-year. The IV is beertax, which is the tax on a case of beer (adjusted for 1988 dollars).

Module 5 R Exercise: Maps in R

This week’s R exercise centers on map making. We have of course just seen some really excellent examples of how maps can help to support a claim in the Papachristos et al. reading. Download the RMD file from Canvas along with the accompanying .xlsx data on chicago_crime. Complete the RMD and knit your file to PDF. Note that in order to complete the exercise, you must first complete the first portion of the Module 4 R exercise. Specifically, you should complete all steps up to the merger and then save your cleaned GDP per capita data as a .csv file. You will load this dataset into R as part of this week’s exercise.

Module 4: Concepts and measures


Module 4 reviews some key ideas surrounding concepts and measures. In particular, we lay out some attributes of good concepts and measures and examine how these attributes play out in real-world research. We then examine best-practices for translating our concepts into measures. Finally, we dissect “Good cop, bad cop” by Michelle Pautz as a means of examing what these best-practices actually “look like” in practice.


  • Lay out the attributes of good concepts and measures in the social sciences.
  • Learn some strategies for translating latent concepts into observable measures and begin to implement them in a revised research design.
  • Critically analyze social science scholarship for the clarity, validity, and reliability of their concepts and measures.
  • Practice giving and receiving critical and constructive peer feedback during team meetings.
  • Reflect on “tidy data” and brush up on your data tidying skills via the Module 4 R exercise.


  • Thu. / Fri. @ 8pm CST. Post your Q & A about the week’s reading at Ed Discussion.
  • Fri. @ 8pm CST. Submit your revised research design. You should submit your design to Canvas as well as post it as an attachment to your team’s thread at Ed Discussion. I have provided some guidelines for the revised design below. Note that you should read your team members’ revised designs over the weekend and then post written comments for each one by 8pm on Mon. of Week 5.

Revised research design proposal

Your revised research design should be 1-2 pages long, single-spaced. It should clearly lay out (1) a research question and its justification, (2) an answer to the question, (3) the data that you will analyze and why, and (4) an analysis plan and its justification.

In general, your revised design should be more focused and detailed than your initial submission, and it should incorporate relevant feedback from me and your peers. You should also think about what you have learned about good research questions, theory, and concepts and measurement in Modules 1 through 4; you should incorporate these lessons where applicable. Finally, you should also be learning the state of the art surrounding your topic. This should be reflected in how you motivate your research question and in your elucidation of your answer.

I will be marking for completeness, clarity, and thoughtfulness. It is not necessary that you anticipate and address every problem and/or objection. Much of that is still ahead of you. But it does mean – as I mention above – demonstrating that your ideas are developing and becoming more focused, and that you are being attentive to feedback and the course material.

The module

Theories, hypotheses, and arugments review

Before we dive into the material surrounding concepts and measures, watch the two short videos below. The first addresses a couple of key points to keep in mind as you revise your research design. The second reviews key points from Module 3 on “Theories, hypotheses, and arguments.”

Attributes of concepts and measures & Translating conepts to measures

Module 4 focuses on concepts and measures – specifically, attributes of good concepts and measures as well as how we can actually go about translating latent concepts into observed measures.

To get started, read the Chapter by Bernhard Miller on “Making measures capture concepts.” When you have finished reading, watch the videos on “Attributes of concepts and measures” and “Translating concepts to measures.”

Anatomy of a measure: “Cops on Film” by Michelle Pautz

You should now read “Cops on Film” by Michelle Pautz. Given this week’s topic, you should focus on the concepts laid out in the article as well as how Pautz goes about measuring them. In particular, think about your answers to the following questions:

  • What is the puzzle / question, and how does Pautz motivate it?
  • What concept does Pautz aim to measure?
  • How clear is the concept? How does Pautz define it?
  • How closely does the measure match the concept?
  • Would you recognize a good / bad cop if you saw one in film?

We will examine some of these questions in the video, but the video is really meant to be a continuation of the thought experiment that I raised at the end of the last video. That is, I try to show how we should think about thinking about the theoretical scale of our latent concept can help us generate and/or locate a more valid measure of the concept.

Now watch the video on “Anatomy of a measure.”

R exercise on Tidy data

This Module’s R exercise (along with the Module 3 exercise) may be the most useful. The reason is that data cleaning and transforming comprise the bulk of data analysis. And many of you will soon be doing (and may already be doing) lots of data tidying as you obtain, merge, and transform data for your own research projects. Accordingly, I highly encourage you to complete the exercise if you can – the skills and syntax you develop will be useful later on. And even if you are unfamiliar in R, reflecting on what we want our data to look like as well as the steps we can take to convert them into that form will pay dividend later on.

Download the Module 4 R Exercise from Canvas and complete it. You are not required to submit it for a grade.

Module 2: Research areas, topics, and questions


Module 2 examines research questions: in particular, the attributes of a good research question, and a general process we can follow to move from a broad research area to a more specific topic and, finally, a focused question. The module will ask you to reflect on research questions as they crop up in real social science research. Finally, the week’s R exercise will guide you as you learn about and practice web scraping.


  • Grasp the attributes of a good research question
  • Grasp the process for developing a good research question and begin to implement it
  • Examine the development of a research question by engaging a real-world example
  • Examine and practice web scraping in R

Due this week

  • Thu and Fri @ 8pm CST: Ed discussion Q & A.
  • Fri @ 8pm CST: Proposed research design. Submit via the Assignments section at Canvas. Also post as an attachment beneath your team’s heading at Ed Discussion. NOTE: You should read each others’ proposals and prepare 1-2 paragraphs of feedback for each one. You will be required to post your written feedback to Ed Discussion by Mon. 4/12 @ 8pm (Week 3).

Some guidance on research designs

Your proposed research design should be a roughly 1-page, single-spaced document. It should do three things: (1) lay out a research question and justify it; (2) delineate a brief answer to the question; and (3) discuss the sort of data you will need in order to answer the question and why.

The readings and videos for this week are meant to help you to develop your research question and design. So, if you can, try to complete the reading and watch the videos relatively early in the week.

I want to preview a couple of key points from the reading and videos here as means of reassuring and guiding you. In particular:

  • Your research question and design will evolve. In fact, it will likely continue to evolve even as you conduct your analysis and write up your results toward the end of the quarter. (I am still constantly revising the design document for my own dissertation as it becomes a book!) This can be frustrating, but it is inevitable. You will be moving in and out of the literature and data surrounding your question in upcoming weeks, and in doing so, you will get a much better idea about what makes a “good” question as well as the sorts of questions you can examine given the available data.
  • For part 3, it is not necessary that you have specific data / datasets in mind. You can instead approach this as a thought experiment: Given your question and tentative answer, what are your ideal data and why? What sort of measures would you obtain? What would the units of analysis be? Would the data be cross-sectional, panel / time-series, or something else? Are the data likely to derive from an experiment, survey, or machine-learning algorithm, or are they likely to have been hand-coded by scholars? The more you think through these and related questions, the better. Thinking carefully about your ideal data now will help you to identify workable data later.

The Module

Attributes of good questions and developing your question

First read the chapter on “Beginning the research process” by Buttolph Johnson. As you read, think about your answers to the following questions. You don’t need to write your answers down, but thinking about them will help you to develop your own research question:

  • What are some attributes of a good research question?
  • How does one actually go about developing a good research question?
  • How does the literature review go hand in hand with the development of a good question?

Once you have completed the Buttolph Johnson chapter, watch the three videos below on “Research areas, topics, and questions”. Note that you will need to enlarge the videos in order to view them.

Anatomy of a research question

Now read the article by Albertus and Deming on “Branching out.” You should read the entire article. In your reading, you might try to implement some of the advice laid out by Dane’s chapter on “Reading and structuring research” in Module 1. In addition, in line with this week’s overarching topic, here are some questions to think about as you read. To the degree that there is time, we may discuss some of these questions during our team meetings this week:

  • What is the central question that the authors ask?
  • What have other scholars found in terms of answers to this question (or very similar questions)?
  • Given that the question is not really new, what contribution – if any – do you think that the authors make?
  • What sort of data would you want – in theory – in order to answer the question posed by the authors?
  • What data do the authors actually use and how does it differ from the ideal data that you described above?
  • How do the authors justify the data that they use?

Once you have read the article, watch the video on “Anatomy of a research question.”

R exercise on web scraping

Remember that you are not required to complete and submit the weekly R exercise for a grade. They are a voluntary tool for introducing you to some operations in R that you may find useful this quarter and/or in the future. This week’s exercise introduces you to one procedure for scraping data from the web pages written using HTML. I am not an expert on this topic, so any data scientists with experience in web scraping should feel free to post additional resources and advice at Ed Discussion.

Note that for this exercise, you will need to add SelectorGadget to your web browser. I use Chrome and have installed the SelectorGadget extension from Chrome’s Web Store ( For other browsers, you can simply drag the SelectorGadget bookmarklet to your bookmarks toolbar. You can find the bookmarklet at

I have included a very brief video below that introduces the selector tool as it applies to this week’s R exercise.

Module 8: Multivariate regression


Module 8 builds directly from the concepts that we examined in Module 7 on bivariate regression. It shows how omitted variable bias can undermine our regression results and demonstrates how multivariate regression, when well done, can help to mitigate bias. It lays out the logic of multivariate regression and helps students to begin running, interpreting, and using multivariate regression for prediction.


  • Explain what omitted variable bias is as well as how multivariate regression can help us to mitigate it.
  • Build and run multivariate regression models in R.
    • Explain the logic behind your IV, DV, and controls.
    • Interpret regression results.
    • Present regression results in publishable tables.
    • Use a regression model for prediction.


Exercise 4 is due by 11:59pm CST on 3/08. Be sure to download the RMD for Exercise 4 from Canvas and review it early on in the week so that you can plan your time accordingly.

Module: Multivariate regression

Omitted variable bias

First, read OIS Sections 8.1 and 8.2 on “Multiple Regression”. Then watch the video series on “Multivariate Regression.” As with last week’s Module, I have divided what would normally be a single lecture into a series of short videos on different subtopics within multivariate regression.

Begin by watching the video on “Omitted variable bias: Intuition.”

Now watch the video on “Omitted variable bias: The mechanics.” Note that this videos uses maths to demonstrate the intuition that I laid out in the first video. The maths are limited to linear algebra, so I encourage you to follow along with the different steps as best a possible. I happen to think that this is one instance in which the maths do help us toward a better grasp of the concept.

Multivariate regression

Now that we have examined OVB, how can multivariate regression help to mitigate it? We examine that question in the next video. Note that in this videos and those that follow, I will walk you through some examples using the CASchools data that are contained in AER package in R. You can follow along. Simply use the code below to install and load the data as well as create our dependent and independent variables:

# Load the data

# Dependent variable
CASchools$score <- (CASchools$read + CASchools$math)/2

# Independent variable
CASchools$STR <- CASchools$students/CASchools$teachers

# Our simple model
model <- lm(score ~ STR, data = CASchools)

Now watch the videos on “Multivariate regression: The logic” and “Multivariate regression: Prediction”.

One of the last points I make in the video on “Prediction” is that we should avoid so-called “kitchen sink” or “garbage can” regression models that contain lots of control variables not rooted in good theory. So what criteria should guide our control strategy? Let’s see what other scholars suggest as to an answer to this question.

First read Chapter 4 from John Martin’s book on “Thinking through statistics.” What strategies does the author suggest? What are some of the potential pitfalls that potentially crop up as we embark on a control strategy, and how do we avoid them?

How do Martin’s arguments play out in actual scholarship? Earlier in the quarter, we read the first portion of Koch and Nicholson’s article on “Death and Turnout.” You should now read the remainder of the article. Here are some questions to guide your reading and prepare you for the practice questions and exercise:

  • What is the authors’ dependent variable?
  • What is their main independent variable?
  • What is the expected direction of the relationship between IV and DV?
  • What are some variables that may confound this relationship?
  • What variables do the authors control for and why? (Focus on their discussion of Table 4 on p. 942).
  • What do you think about these controls? Do you buy the authors’ reasoning? Why or why not?
  • Finally, pay close attention to the authors’ interpretation of their results. Note the specific language they use to discuss statistical and substantive significance. (As above, focus on their discussion of Table 4).

Practice questions and exercise

Now complete the Module 8 practice questions. These are short and straightforward; I have done much of the coding for you as a means of walking you through running and presenting multivariate regression in R. But note that Exercise 4 builds directly from the practice questions, so be sure to work through them.

Once you have completed the practice questions, be sure to download and complete Exercise 4.

Module 7: Bivariate regression


In Module 7, we begin conducting bivariate analysis. Toward this end, the Module introduces various tools for examining linear relationships between variables and testing them for statistical significance. In particular, the Module moves from measures of joint fluctuation such as covariance and correlation to bivariate linear regression.


  • Calculate covariance and correlation as well as explain what these measures capture conceptually.
  • Explain how the OLS regression line helps us model the relationship between two variables.
  • Perform bivariate OLS regression in R and interpret the regression output.


Assignment 3 is due at 11:59pm CST on Monday, 3/01. The Assignment 3 RMD file can be downloaded at the “Files/Exercises and Assignments” section at Canvas. Be sure to download and review the file early in the week so that you can plan your time accordingly.


The video lectures for this week kick off with a very brief review of some of the assumptions that power the Central Limit Theorem. Watch the video on “CLT assumptions revisited” (I note that this title differs from the text shown on the title slide of my presentation.)

Let’s now turn to the central topic for this week’s Module: bivariate analysis. Note that so far, we have been performing what’s known as “univariate” analysis. This means that we have been examining single variables rather than relationships between variables. For instance, in Module 6, we focused primarily on the mean of a single variable such as annual income. (Recall our running Somersville example.) We then used hypothesis tests to assess whether some observed value of our sample mean provided evidence to refute some hypothesized value of the underlying population mean.

For the remainder of the quarter, we turn our attention to “bivariate” and “multivariate” analysis. That is, we will examine relationships between two or more variables. Ultimately, the basic procedures for such analysis are very similar to those we have used up until now: We will use sample statistics to estimate population parameters; we will calculate the typical error associated with our estimates; we will construct CIs; we will calculate test statistics; and we will perform hypothesis tests.

Measures of join fluctuation

Start by reading Imai, K. (2017). Quantitative Social Science: An Introduction. Read Sections 3.6, 4.2.1, and 4.2.2. The PDF is available at the “Files” section at Canvas.

Now watch the two videos on “Correlation” (Introduction and Significance Tests, respectively).

Bivariate regression: Introduction

Now read OIS Sections 7.1–7.4. Then, watch the video series on “Bivariate regression.” Note that I have divided what would normally be a single lecture into a series of short videos organized by sub-topic.

Bivariate regression: The regression line

Bivariate regression: Calculating regression coefficients

Bivariate regression: Prediction, tests of significance, and interpretation

Once you have completed your reading and watched the video lectures above, download and complete the Module 7 practice questions. They can be downloaded at the “Files/Practice Questions” section at Canvas.

Finally, be sure to download and complete Assignment 3, which is located at the “Files/Exercises and Assignments” section at Canvas.

Module 6: Hypothesis testing


Module 6 builds on Modules 4 and 5 on “Foundations for inference” by examining (univariate) hypothesis testing. We examine the t-distribution as a means of accounting for uncertainty in our estimates when our sample size is small or the population standard deviation is unknown. We then examine hypothesis tests for one and two populations using the t-distribution, and we begin to conduct such tests in R. Finally, we read and analyze one especially prominent and recent use of hypothesis testing in US politics.


  • Explain what a t-distribution is and why we use it.
  • Define null hypothesis, alternative hypothesis, and type 1 and 2 errors.
  • Explain and put to practice the steps in conducting hypothesis testing.
    • State a null and alternative hypothesis
    • Calculate appropriate tests statistics
    • Calculate p-values
    • Compare p-values against different significance levels and interpret the results
  • Critically analyze real-life usages of hypothesis testing.


Exercise 3 is due at 5pm CST on Monday, 2/22. The Exercise 2 RMD file is located at Canvas. Be sure to download and skim through the file early in the week so that you can plan your time accordingly.

Mid-course feedback

Please look for a Canvas announcement from me later in the week with a link to a Google Form that will allow you to provide me with anonymous, mid-course feedback. I am always looking for ways to improve my instruction and better meet your needs and aims, so please do take a moment to complete the Form once it is published.



We’re going to begin by refining our grasp of standard errors and confidence intervals. In particular, we’re going to examine the t-distribution as a conservative alternative to the normal distribution that will allow us to account for additional uncertainty whenever our sample size is small or we estimate the population standard deviation using the sample standard deviation.

Now watch the video on the “t-distribution.”

(Univariate) Hypothesis testing

Now read OIS Sections 4.3 through 4.5. Note that the Syllabus lists some additional sections of OIS that are suggested but optional.

After you have completed the reading, watch the video on “Hypothesis testing 1.”

Now watch the video on “Hypothesis testing 2.”

I realize the the videos for this module are longer than usual, but I wanted to be sure to walk through some examples step-by-step so that you can begin conducting your own hypothesis tests in the practice questions and exercise.

Hypothesis testing in the 2020 US presidential election

I am now going to ask you to read an especially prominent and recent usage of hypothesis testing in US politics. Specifically, we are going to read excerpts from the State of Texas v. the Commonwealth of Pennsylvania, the State of Georgia, and the State of Michigan. If you are attentive to US politics and followed the aftermath of the most recent presidential election, you may already be somewhat familiar with this case.

I should state at the outset that I selected this reading for its relevance. (Indeed, what could possibly be more relevant? Here is a recent legal case that uses hypothesis tests to substantiate claims about fraud in a presidential race!) I realize that the reading centers on a sensitive political issue. Accordingly, as we analyze and discuss the text, let’s remember to do so critically but cordially. And, let’s be sure to limit our discussion to what is relevant to the course objectives.

Now read State of Texas v. Pennsylvania, Georgia, and Michigan.

Here are some discussion questions to accompany your reading. Please try to answer the questions upon completing the reading, and be prepared to discuss them when we meet in Week 7:

  • What is the author’s main claim (i.e., the takeaway of the statistical test)?
  • Describe the nature of the statistical tests conducted by the author:
    • What are the null and alternative hypotheses for each test?
    • Given what you learned about hypothesis testing above, and assuming that the author’s math is correct, what are the results of the tests? (Be precise here.)
  • How do you you interpret the results of the tests? How does the author interpret them? What do you think about this interpretation?

Practice Questions and Exercise

Once you have worked through the videos and reading, complete the Module 6 Practice Questions, which guide you through several hypothesis tests in R. Once you have finished the Practice Questions, be sure to complete Exercise 3.