Module 3: Know your data (and where it came from)!


Module 3 builds on the concepts surrounding data and measurement that we tackled last week by asking you to think critically about some fundamental questions: Where do your data come from? Who made them and why? What do they look like? And so on.

In addition, Module 3 asks you to grapple with how we can move toward making causal claims in the social sciences as well as analyze how scholars (i.e., Koch and Nicholson) have tried to overcome obstacles to causal inference.


  • Explain the distinction between experimental and observational data; explain why random assignment to treatment can facilitate causal inference
  • Lay out obstacles to causal inference when using observational data as well as some strategies for (potentially) addressing them; analyze scholarship in light of these obstacles and strategies.
  • Develop additional tools to assist data exploration in R.

Assignment 1

Reminder: Assignment 1 is due by 5pm CST on Monday 2/1. I suggest that you download and preview Assignment 1 early in the week so that you can plan your time accordingly.

Where do your data come from?

Watch the brief video “Where do your data come from?” Like last week, the video will sometimes ask you to press “pause” and answer some “class questions.”

Now watch the video “Where do your data come from? (Part 2)”.

I mentioned in the videos that our main aim as social scientists is to make valid causal claims. But, as the videos also noted, there are a host of obstacles in our path, including bad data.

Read Martin, Thinking through Statistics, Chapter 2 (“Know your data”). As you read, pay close attention to different data problems and, in particular, how those data problems have at times ruined prominent research. What can we do to keep ourselves from ruin?

Toward causal inference

Now watch the final video on “Toward causal inference.”

The video suggests that we can sometimes move toward causal inference by moving from the level of general laws to the level of mechanisms. How does this play out in scholarship?

Read Koch and Nicholson (2016, pp. 932–938). As you read, pay close attention to the authors’ argument and discussion of their empirical setup.

(1) What do the authors argue? (You might even take some time to diagram their argument). (2) How do thee authors try to move from establishing mere correlation to identifying causation? (3) Why does the structure of their argument potentially help them toward this end?

Now examine the regression table from the authors’ aggregate analysis. (4) Why do the authors control for the variables that they do? (5) How compelling is the aggregate analysis for helping the authors to make a causal claim? (6) Whether you think it is compelling or not, what work is the aggregate analysis doing for the authors?

Practice questions

In the spirit of really getting to know our data, the practice questions for this week will introduce some techniques that will ease the data cleaning and exploration process. In particular, we will practice using loops with for() as well as the dplyr package to summarize and reshape our data.

Download the “Module 3 practice questions” 1 and 2 RMD from Canvas. Note that there are two parts to the practice questions. Also note that the practice questions on loops have a couple of especially challenging questions (e.g., the de Montmort Problem). I suggest that you complete the simpler questions and then move on. You can return to the more challenging ones after you have completed Part 2 as well as Assignment 1.

Module 2: Cases, variables, and measurement


Module 2 asks you to engage some fundamental questions: What are data? Cases? Variables? It additionally askes you to grapple with critical issues surrounding measurement as well as consider how these issues play out in Daniel Treisman’s famous article on “The casues of corruption” (2005). Finally, the Module introduces important statistical and visual tools for exploring and describing variables.


  • Define data, cases, and variables
  • Explain the qualities of “good” measurement; begin to analyze scholarship in light of these qualities
  • Grasp the intuitions behind common measures of central tendency and spread; become familiar with their notation and learn to calculate them in R
  • Explore data in R using summary functions, tables, and plots

Exercise 1

Download, complete, and submit Exercise 1 by 5pm CST on Monday, 1/25. The file is available at Canvas. I recommend that you preview the Exercise early in the week so that you can plan your time accordingly.

What is data?

Read OIS sections 1.1–1.5. Some of this material may be review, but don’t worry if it isn’t!

Now watch the brief video on “What is data?”

Variables and measurement

As the video mentions, we will dig into some actual data as we complete the practice questions. But first, watch the video on “Variables and measurement.” Note that the video will occasionally ask you to press “pause” and then spend some time answering some “class questions.” You don’t have to submit your answers to these questions. However, quickly jotting your answers down may be useful, as we will discuss some of the questions during our section meetings.

How do issues surrounding conceptual clarity, validity, and reliability play out in social science scholarship? In a moment, you will read “The causes of corruption,” by Daniel Treisman (2005). This is a famous article that uses linear regression to test common hypotheses about the causes of corruption worldwide. Before you read, takes some time to answer some questions. (As above, I suggest that jot your answers down somewhere): (1) What is corruption? Define it. (2) Imagine that you wanted to measure corruption levels across countries. How would you go about doing that? What kind of data would you look for? (3) What are some advantages of your approach? What are some of its disadvantages?

Now read the article.

Once you have finished reading, answer these questions: (1) Do you think that Transparency International (TI) Index scores are valid and reliable as a measure of corruption? (2) Treisman makes the case that they are. What does he argue to support the validity and reliability of his measure? (3) Are you persuaded? Why or why not?

Describing variables

Now read OIS sections 1.6 and 1.7. Then, watch the video on “Describing Variables: Measures of Central Tendency and Spread.”

And now, watch the video on “Describing Variables: Tables and Plots.”

Let’s apply some of the concepts presented above! Download the “module 2 practice questions” RMD from Canvas (“Files/Practice Questions”). Note that there are 2 parts to the practice questions. Complete the practice questions in RMD and then knit your files to PDF.

Remember: If you get stuck at any point… breathe. Coding can be frustrating at first, but we will work through it together. There a lots of ways to seek help:

  1. Use the “Help” tab in RStudio
  2. Internet search
  3. Post your question to Ed Discussion
  4. As a final option, email me directly or visit me during my office hours

As you seek help, try to specify the nature of the problem: Examine any warnings or error messages. What line of code seems to be the issue? Which function, specifically? (During knitting, Markdown will often tell you which line of code is stalling the knitting process.) If you are getting error messages, are you missing parentheses, commas, or quotations? (This happens to me all the time.) Answering these questions will help to ensure that you get the help you need.