Module 2: Summarizing data, numerically and visually

SOSC 13200-2


Module 2 asks you to engage some fundamental questions: What are data? Cases? Variables? It additionally askes you to grapple with critical issues surrounding measurement, as well as consider how these issues play out in common measures of human behavior and well-being. Finally, the Module draws on the data from Card & Krueger (1994) to explore some important statistical and visual tools for exploring and describing single variables.


  • Define data, cases, and variables
  • Explain the qualities of “good” measurement; explore how measurement issue play out in social science research.
  • Grasp the intuitions behind common measures of central tendency and spread; become familiar with their notation and learn to calculate them in R
  • Explore data in R using summary functions, tables, and plots

To Do

Download, complete, and submit Assignment 2 by 11:5pm on 1/24. The file will be available for download on Tuesday, 1/18. I recommend that you preview the Assignment shortly after it is posted so that you can plan your time accordingly.

…And Now, The Module

What are data?

Start by watching the video on “Data” below.

Variables and measurement

As the video mentions, we will dig into some actual data as we complete the practice questions. And, Assignment 2 is an opportunity to replicate portions of Card & Krueger’s own summary of their data. But first, watch the video on “Variables & Measurement.” Note that the video will occasionally ask you to press “pause” and then spend some time answering some “class questions.” You don’t have to submit your answers to these questions. However, quickly jotting your answers down may be useful, as we may circle back to some of the questions in our next class meeting.

How do the issues surrounding conceptual clarity, validity, and reliability play out in real life? In a moment, you will read “Measuring and Understanding Behavior, Welfare, and Poverty” by Nobel-prize winning economist Angus Deaton. Before you read, take a moment to answer these questions:

  • What measures of human welfare and poverty can you think of?
  • How good do you think these measures are?
  • What are some potential problems with the conceptual clarity, validity, and reliability of these measures?

Now read the article. As you read, think about the different examples that Deaton lays out. What sort of violation does each one exemplify (conceptual clarity, validity, or reliability)? And critically, what is at stake?

Summarizing variables numerically

Before watching the videos below, you should first read the assigned excerpts from the Verzani (Simple R) and Wickham (ggplot2) texts. These will be useful as you engage, and in some cases, follow along with the videos.

You should also read Card & Krueger’s (1994) article on “Minimum Wages and Employment.” Many of you will remember the article from the fall quarter. As we revisit the article now, we are mainly concerned with getting a strong grasp of the authors’ data: What are the units? What sorts of variables do the data contain? How do the authors describe their data, numerically and visually? Accordingly, pay particular attention to the authors’ description of the dataset as well as any tables and figures that summarize the different variables.

Once you have completed the readings above, watch the videos on “Summarizing Variables Numerically” and “Summarizing Variables Visually.”

Let’s get some additional practice with summarizing data in R. On Monday evening, I will post a practice exercise for the week that will guide you through some basic operations that you will later apply on Assignment 2. The practice exercise will be located at Canvas under “Files/Practice Exercises.” Recall that although you should complete the practice exercise, you do not have to submit it to me for a grade. It is for your learning only.

Remember: If you get stuck at any point… breathe. Coding can be frustrating at first, but we will work through it together. There a lots of ways to seek help:

  1. Use the “Help” tab in RStudio
  2. Internet search
  3. Post your question to Ed Discussion
  4. As a final option, email me directly or visit me during my office hours

As you seek help, try to specify the nature of the problem: Examine any warnings or error messages. What line of code seems to be the issue? Which function, specifically? (During knitting, Markdown will often tell you which line of code is stalling the knitting process.) If you are getting error messages, are you missing parentheses, commas, or quotations? (This happens to me all the time.) Answering these questions will help to ensure that you get the help you need.

Module 1: Course Intro. & Research Questions

SOSC 13200-2 (WIN22)


Module 1 has three main components: (1) We begin by recapitulating some key points from Tuesday’s course overview. (2) We then get to know R, RStudio, and Markdown via a couple of brief videos and a practice exercise. (3) Finally, we start thinking about the nature of research questions in the social sciences: What are the attributes of a good research question? How do we go about formulating one?


  • Become familiar with the RStudio interface and R’s basic functionality.
    • Create a class directory on your personal laptop (using an R Project, if you are up for it).
    • Install and load external packages such as the ggplot2 and here packages.
    • Load excel-style datasets into R using functions such as read.table(), read_dta(), read.csv(), etc.
    • Begin to explore and manipulate datasets.
  • Lay out some attributes of good research questions in the social sciences, and how we can begin to assess them.
  • Explain the distinction between correlation and causation, and some of the assumptions that get us from the former to the latter.

Tasks to complete in Week 1

  • Send an email to Professor Deming with your preferred meeting times for Week 2.
  • Download and install R, RStudio, and LaTex. Installation instructions are located at the “Files” section of Canvas.
  • Complete this Module.
  • Download, complete, knit, and submit Assignment 1 as a PDF to Canvas by 11:59pm on Monday 1/17.

And now, Module 1…

Course overview

If you were not able to attend our first meeting on 1/11, you should first go to the course Canvas page to download the course Syllabus. Read it. This will ensure that you do not miss any important logistical points.

Introduction to R, RStudio, and Markdown

If you have not already done so, download and install R, RStudio, and LaTex. I have posted detailed installation instructions to the “Files” section of Canvas. If you are still having trouble with installation after carefully following the instructions, please meet with me during our regularly scheduled meeting time on Thursday 1/13. Together, we will get you up and running.

Once you are up and running in R, watch the “Introduction to R” video below.

Let’s now get to know the RStudio interface. Start by watching the “R Demonstration” video below. 

We will work in R and RStudio extensively this quarter. In class, my lecture slides will regularly display R code, and I will often ask you to work in small teams to complete practice exercises in R. The aim of these exercises is to boost your programming skills. But more importantly, they will help you apply key statistical concepts and more deeply engage the research that we examine by allowing you to “get under the hood” of the data used by different authors.

Let’s start building our programming skills via a short exercise. Go to the “Files” section of Canvas. You will see a folder entitled “Practice Exercises.” Open it. Then, download the file entitled “module1_practice_exercise.” It is an R Markdown (RMD) file. Save it to your personal computer, open it in RStudio, and work through it. When you are finished, knit it to PDF format. You do not have to submit it to me.

When you get stuck on a practice exercise or assignment: First, breathe! R programming can be frustrating, especially at first. Then, do some googling to see if you can find a solution to the problem. After that, post your question to the class Ed Discussion site. (Be sure to review older posts to see if your question has already been answered.) Finally, note that you are welcome to complete exercises and assignments in teams of 3-4 students (in fact, I encourage it!). Just be sure to follow the guidelines that I have laid out in the Syllabus.

Research questions

At the end of this course, you will submit a brief report in which you lay out a research question and begin to assess via analysis of some quantitative data. With this in mind, what makes a good research question? How do we go about formulating and, ultimately, assessing one?

The readings by King, Keohane, and Verba (1994) and Holland (1986) will help us begin answering these questions. We will discuss the readings when we meet in our small teams during Week 2, so be sure to jot down key points.

Let’s start with KKV (1994). This comes from what is perhaps the most famous (for some scholars, infamous) contemporary texts on the nature of social science research. Before you read, think about your own answers to the two questions that I raised above: What makes a good research question? How would you go about formulating one?

Once you have laid out your own preliminary answers to these questions, read the chapter. Here are some discussion questions that you should try to answer. We will circle back to them during our small-team meetings in Week 2:

  • The authors say that the aim of social science is descriptive and causal inference. What is inference? What is causal inference?
  • What are some attributes of a good research question?
  • Based on your reading, how might the authors answer the second question above? How does this coincide with and/or differ from your own preliminary answer?

Once you have read KKV (1994) and thought about your answers to the questions above, we can move to Holland (1986). This is not an easy read, but it is a good one. In particular, it will help us think carefully about an important distinction between simple correlation and causation.  Before you read, consider how you think about this distinction: What is the difference between correlation and causation?

You should now read Holland (1986). As with KKV (1994), here are some discussion questions that you should try to answer:

  • What does Holland mean when he refers to causation? How does this differ from the way other authors (and philosophers) might use the term?
  • What is the Fundamental Problem of Causal Inference (FPCI)?
  • How can we overcome the FPCI, according to the author?
  • Do you see any problems with the solutions in practice?
  • How do correlation and causation differ? What might Holland say here?

Assignment 1

Download “assignment1” from Canvas. It is located at at “Files/Assignments and Final Project.” It is in RMD format. Open Assignment 1 in RStudio and read the instructions carefully.  Complete it, knit it to PDF, and submit it to Canvas. It is due at 11:59pm on Monday 1/17.

Recall from the video above that, although practice exercises will guide you through some of the coding procedures that you will later use to complete assignments, it will not guide you through all of the required procedures. That is, completing the assignments will require some additional, independent learning of R. With this in mind, be sure to see my earlier note about how to go about learning R (i.e., googling, review of coding fora, Ed Discussion, and teaching and learning from each other).

SOSC 13100

Week 1: Thinking about social science concepts and asking good empirical questions.


We will first build on our discussion on Tuesday by revisiting some the central questions from our discussion while reading Mlodinow’s “Peering through the Eyes of Uncertainty.” We will then examine some of the challenges that crop up when studying many social concepts, and we will examine some solutions for overcoming them. Finally, we examine some attributes of good empirical questions and, ultimately, root them in Friedman’s theory of positivist social science.

How do we study random social events?

On Tuesday, we discussed three fundamental questions: (1) Why study random social events? (2) What do we mean by “random” event? And (3) How can we go about studying random social events? 

Read Mlodinow’s chapter on “Peering through the Window of Uncertainty.” The reading builds on our discussion surrounding these three questions. Accordingly, as you read, keep these questions in mind. In particular:

We saw that one reason to study random social events was to avoid policy and decision-making based on poor intuition. How do findings derived from empirical study diverge from intuition in the different examples laid out by Mlodinow? Can you think of other examples of such divergence?

How can we go about studying a random social event? Here, consider Roger Maris’ “fluke” breaking of Babe Ruth’s record for most home runs in single season in 1961. How does Mlodinow go about “studying” this random event in the chapter? What sorts of questions is he able to tackle through his study?

Once you have completed the reading, watch the brief recap video on “How do we study random social events?”

NOTE: A clarification: At one point in the video, I lay out an example to illustrate the “flavor” of Bayesian statistics: namely, that probability can be viewed as subjective; probabilities are what we “feel” they are, and we constantly update our beliefs about probabilities based on observed data and our experience. The six-sided die (99,999 heads in a row) and Sally Clark examples begin to illustrate this idea. But note that the Sally Clark example is *not* inconsistent with the frequentist approach to probability, which views this illustration as a simple question of independence v. dependence of two events. 

Thinking about social science concepts

Social science investigates relationships between big concepts that are often hard to define and hotly contested. What is democracy, for instance? Is democracy competitive elections? Competitive elections plus basic freedoms? 

Watch the video below. The video briefly lays out some of the challenges of conceptualization in the social sciences. It then lays out two common approaches to these challenges. Keep both the challenges and approaches in mind as you complete your reading further below.

Each of the solutions presented in the video has drawbacks. The first solution, in particular, isn’t a solution at all! To do quantitative social science research, we need to be able to apply concepts across a large number of cases. What are some problems with the second solution?

You will now read Fearon and Laitin’s paper on “Ordinary Language.” The authors grapple with the tricky issues raised in the video as they relate to the concepts of ethnicity and ethnic violence. The authors then lay out a third solution: rooting concepts in an analysis of their meaning. As you read, here are some questions to think about:

  • What do the authors say about the quantoid and interpretivist responses to confusion surrounding social science concepts? What do you think of the authors’ arguments here?
  • What is the third solution/response proposed by the authors? How does it differ, in particular, from the quantoid response? Do you see any problems with the authors’ approach?

Asking good empirical questions 

In SSI, we will focus on empirical questions. In the last video, we examine what this means exactly, and we begin thinking about some of the attributes of good empirical questions. 

In perhaps the most famous piece on social science methodology of the 20th century, Chicago economist Milton Friedman tackled some of the points that I raised in the video above. In particular, Friedman—as you will soon read for yourself—drew a strong distinction between normative and positivist theory. 

As you read, let’s push beyond the video content just presented. In particular, think about Friedman’s “theory of theory”: According to him, what should good social science be doing, such that we should be asking the sorts of questions that I presented in the video above?


We’re taking our discussion online this week. Specifically, after watching the videos and completing the readings, you should go to the course Canvas page and then to the Discussion Board section. You should then write a short (~1–2 paragraphs) response to one of the questions that I have posted there. There are three questions (all containing “Week 1” in their title), but you only need to answer one. This is meant to be low-stakes in that I will be grading for completion. But, for the sake of practicing good written communication, your response should give a clear answer to the question, and you should support your answer with specific reasons. Your reasons can be drawn from the readings and/or lectures, and you should also feel free to draw inspiration from others’ posts. 

Module 8: Writing up


Module 8 brings us full circle: We return to the overall structure of a research report and examine each component in detail. Specifically, we examine the purpose of each component as well as some best-practices for writing each one. In doing so, we examine how these best-practices actually play out in the article on “Judicial Empathy” by Glynn and Sen. The R exercise then invites you to explore and generate some visual tools that we have not yet covered, including interactive plots in HTML.


  • Lay out the different components of a social science research report.
    • Explain the purpose of each component as well as some best-practices for writing each one.
  • Explain how Glynn and Sen effectively tackle the different report components in their article on “Judicial Empathy.”
  • Generate a range of visuals for data description, including waffle plots, tree maps, an interactive plots.


  • Thu/Fri @ 8pm. Module 8 Q & A.
  • Fri @ 8pm. Written feedback on report drafts presented during the week. (See my announcement at Canvas, which contains a Workshop Schedule.) Your feedback should be constructive and address substantive and empirical concerns. No length is established, but in general, each draft will merit around 2 pages of single-spaced feedback. Post your feedback to your team page at Ed Discussion.

The Module

Overview and Review of Module 7

As I mention above, this week’s Module brings us full circle in that it re-examines the overall structure of a social science research report, which we first examined in Module 1. This week’s Module builds on Module 1 by examining each component of the research report in detail, laying out the specific purpose of each component as well as some best-practices to keep in mind as you write each one.

Start by watching the video on “Video series overview and Module 7 review.”

Writing up, step by step

Before we dig into the different components of the research report in the video series, you should read the chapter on “The challenge of writing up” by O’Leary and the article on “Judicial empathy” by Glynn and Sen. As you do so, focus on the different components of the research report: What work does each one do?

For the Glynn and Sen article specifically: Try to identify the different components of the research report. Then, try to answer these two fundamental questions:

  • What actually goes into a particular component?
  • What work is a particular component doing?

Here are some of the key components to examine and think about as you read Glynn and Sen. We we examine these as well as a few others in the videos:

  • Introduction
  • Argument and literature review
  • Data and Methods
  • Findings and Discussion
  • Conclusion

Once you have completed the readings and answered the questions above, watch the video series on “Writing up, step by step.” I have divided the series into several smaller components as a means of grouping ideas as well as keeping each video relatively brief.

Title, abstract, and introduction

Literature review and argument

Data description and method (analysis plan)

Findings and discussion

Conclusion and references and Wrap up

Module 8 R exercise on “More plots”

This week’s R exercise invites you to explore and generate some plots that we have not yet covered, including waffle plots, tree maps, and basic interactive plots. Note that for this Module, you should knit your completed RMD to HTML rather than PDF. Download and completed the Module 8 Exercise RMD from Canvas. Complete it and then knit your completed file. You do not need to submit your work.

Module 7: The Literature Review


Module 7 examines literature reviews: What they are, what they are meant to do, and how we can go about generating them. The Module also explores the importance of surveying existing literature in order to conduct good research more broadly. Indeed, existing literature is fundamental for developing effective questions, arguments, and analysis plans. The weekly R exercise then invites you to reflect on the question of how we can go about visualizing uncertainty and variability in our data analysis.


  • Internalize the importance of existing literature for virtually every aspect of research.
  • Lay out the purpose of a literature review and describe the basic process for writing an effective one.
  • Examine two approaches to writing an effective literature review as they crop up in actual research.
  • Use statistical software to visualize uncertainty and variability in your own data.


  • Fri @ 8pm CST: 15-page draft of the research report. See guidelines below. Note that over the weekend, you should read drafts that will be workshopped in Week 8. I will disseminate a workshop schedule toward the end of the week.
  • Thu/Fri @ 8pm CST: Module 7 Q & A

Draft of the research report

Your draft should be roughly 15 pages. It should be double-spaced and use 12-point Serif font. Your draft should lay out (1) your research question and its motivation; (2) a brief literature review; (3) your argument; (4) your data and analysis plan with justification; and (5) two plots, tables, or other visualization that motivate your question and/or support your argument.

Your draft should be complete and demonstrate thought and effort. That is, whereas this is meant to be a draft that you will revise and augment between now and the end of the quarter, it should nevertheless reflect that you have been thinking about and refining your question, argument, analysis, and grasp of the relevant literature during these past 7 weeks. In addition, your draft should not come across as having been hastily written; you should therefore revise your draft for clarity, precision, and concision before submitting it.

The Module

Review of Module 6

Module 7 focuses on the literature review. As with previous Modules, however, it begins with an overview of the Module videos as well as a very brief review of Module 6 on “Analysis Plans”. Watch the video on “Module 6 Review.”

Uses of Literature

The next video examines the role played by existing literature in different components of research. We have of course already examined this idea to a degree. For instance, in our discussion of research questions, we underscored the importance of surveying prior studies for generating and refining our questions. The video builds on this intuition by examining how we should be incorporating prior work at different points in the research process.

Watch the video on “Uses of Literature.”

The Literature Review

The next video is the real meat of the Module in that it lays out the literature review: What it is, what it is meant to do, and how we can go about writing one effectively. We also examine two different approaches to writing a literature review and what these approaches look like in actual scholarship. Toward this end, you should start by reading the articles by Schillbach on “Alcohol and Self Control” and Kocher et al. on “Rabbit in the Hat.”

Both of these articles are fascinating, so you should read them in their entirety. But, given this week’s topic, you should focus on how the authors engage existing literature on their topic in order to build their own arguments. Accordingly, here are some questions that you should try to answer as you read:

  • What is the focal relationship in each piece (i.e., What is the main IV and DV)?
  • How do the authors organize their engagement of the literature? Do they seem to focus on literature surrounding their IV, or their DV?
  • How do the authors group different citations? Why do you think that they grouped them in the way that they did?
  • Find one or two citations in each piece. What does the citation do for the author? What point does it help them make?

Now read Schillbach and Kocher et al. Once you have finished reading, watch the video on “The Literature Review.”


I should note that Zotero is an excellent resource for organizing citations and generating bibliographies. If you have not used Zotero before, I highly recommend it. A very brief introduction to using Zotero is available at the UC Library website:

R Exercise on Depicting Uncertainty

This week’s R exercise asks you to reflect on uncertainty: How to think about it, and how to report it visually. You will work with your own data to generate some tables and plots that aim to capture uncertainty and variability in your data. You are likely to include some of these in your final research report, so I highly encourage everyone to complete the exercise in R or Stata.

First read the very brief article by Hullman on “Confronting Unknowns.” Once you have read, proceed to the Module 7 R Exercise, which is available for download at Canvas. If you are using Stata, I have copy-pasted a portion of the Exercise instructions below:


The Module 7 RMD invites you to reflect on uncertainty: How to think about it, and how to capture it visually. I have therefore asked you to begin by reading Hullman’s brief article on “Confronting Unknowns.”

Critically, we often confuse uncertainty and variability. Uncertainty means that we do not know a particular quantity. For instance, we we may not know with precision the proportion of voters that will vote Democrat in an upcoming election. By contrast, variability means that a particular variable can take on a range of different values. For instance, each time we conduct a survey, a different proportion of respondents will state that they plan to vote Democrat in an upcoming election.

As with last week’s RMD, you will work with your own data. Start by cleaning your data as necessary. Then proceed below.

In this section, use your own data to create a plot or other graphic that depicts some quantity of interest AND the uncertainty OR variability around it. For example, your quantity of interest could be something along the lines of:

  • Average life expectancy across sub-Saharan Africa.
  • Average perceived level of inequality across countries.
  • Average decrease in enrollment across colleges and universities under Covid-19.
  • Proportion of Divvy users who own annual memberships (or month-to-month memberships).
  • Average decrease in expected educational achievement for each additional child added to a family.

You can choose from a wide range of ways for visually depicting these quantities. Here are some of these ways. Note that whereas I have included some potentially useful resources for generating these graphs in R, Stata users may need to use an internet search in order to find comparable resources:

Module 6: What kind of data analysis 2


Module 6 is the second part of a 2-part series on data analysis. Whereas Module 5 focused mainly on assessing the credibility and relevance of data, Module 6 focuses on developing an analysis plan. We examine how a good analysis plan is linked to argument as well as how this plays out in the reading on Queens by Dube and Harish. We then examine some key considerations in developing an analysis plan.


  • Explain how data analysis is linked to argument.
  • Examine the three main “links” of an analysis plan and how they crop up in social science research.
  • Map out an analysis plan based on your argument, focal relationship variables’ type, and data structure.
  • Use your own data to create 1-2 visualizations that motivate your quetstion or support your argument.


  • Thu/Fri @ 8pm. Post your Q & A about the week’s reading at Ed Discussion.

Module 6

Review of Module 5

To get started, watch the video on “Module 5 review.”

Analysis plans as argument

Whereas Module 5 focused on assessing the credibility and relevance of our data, Module 6 focuses on analysis plans: How to conceptualize them, what they look like in practice, and how we can about developing them.

A running theme throughout the Module is that good data analysis is an extension of an argument: It should be tightly linked to the mechanisms that we lay out in our argument. The next video introduces this idea.

Now watch the video on “Analysis plans as argument.”

Anatomy of a plan: “Queens”

How do these three “links” play out in practice? What do they look like in scholarship? In a moment, I will ask you to read “Queens”, by Dube and Harish. I’ve selected this article for two reasons. For one, it tackles a fascinating question: Are states ruled by women less prone to conflict than those ruled by men? In addition, the article nicely exemplifies the three links that we examined above. Accordingly, as you read, try to answer the following questions:

  • What are the focal relationship variables?
  • What are the mechanisms that link the variables?
  • What are some observable implications of the mechanisms? Can you think of some that the authors do not address?
  • How is the analysis linked to the mechanisms?

We will tackle some of these questions in the video below, but you should try to answer them for yourself before watching. Now read “Queens” by Dube and Harish. When you have finished reading, watch the video on “Anatomy of a plan: Queens.”

Developing your analysis plan

So far, we have examined how data analysis is linked to argument. But there are a couple of other considerations that we should make as we develop our analysis plans. We briefly examine these considerations in the final video. As you watch, think about your own data in some detail: What type of variables are in your focal relationship? What is the overall structure of your data? What are the units of analysis?

The reason to keep your answers to these questions in mind is that, ultimately, they should inform your analysis. For instance, in Module 5, Papachristos et al. used so-called Poisson regressions rather than OLS because their dependent variable was a “count” variable that took the value 0 for many observations. Meanwhile, in Module 2, Albertus and Deming used fixed effects regression analysis because they were using panel data and worried that omitted country-level variables might otherwise bias their coefficient estimates.

In this vein, upon watching the video, it may be worth reviewing your notes on the readings from past Modules. In particular: What sort of analysis did different authors perform? How does their analysis plan seem to be shaped by data and variable considerations?

Now watch the video on “Developing your analysis plan.”

Stata and R Exercise: Visualizing your data

The Module 6 R exercise is intended to get you thinking about the first 1-2 pieces of your final data analysis, which will very likely to consist of some kind of table, plot, map, or other visualizations. If you are a Stata user, you should complete the exercise in Stata. Stata has a user-friendly interface for generating simple plots such as histograms, scatterplots, and barplots. Note that in order to complete the exercise, you will need to have your data in hand. You do not have to submit your completed exercise.

Before completing the exercise, let’s reflect on the properties of good data visualization by reading “Aesthetics and technique in data graphical design” by Tufte. Go ahead and read the chapter.

When you have finished reading, follow the instructions that I have pasted below. If you are completing the Module exercises in R, note that these instructions are shown in the RMD file:

Exercise instructions

Be sure to read Tufte’s chapter on “Aesthetics and technique in data graphical design” before you complete this exercise. Think about Tufte’s advice about how to create an impactful graphic and try to implement it below. In particular, label your graphics, use nice colors, and tell a story.

A: Univariate description

Create two visualizations of the univariate distribution of your two main variables – that is, your focal relationship variables. Think of this as presenting your data to your readers. Be sure to consider the your variables’ type and select your visualization type accordingly (e.g., don’t create a histogram for an indicator variable; use a barplot instead).

B: Create a bivariate graph

Create a visualization that begins to capture your theory about your focal relationship. That is, create a visualization of the association between your two main variables of interest. Think of this step as presenting your story (argument) to your readers.


Using your data, create the ugliest and most useless graphic you can imagine. What makes it bad / useless?

Module 5: What kind of data analysis? (Part 1)


Module 5 examines data and data analysis. We examine how to evaluate the relevance and credibility of secondary data for our research as well as justify our data for readers. Module 5 also introduces the concept of focal relationships and examines how we can use focal relationships to guide the development of our data analysis plan. We examine how using focal relationships to guide our data analysis plays out in “More coffee, less crime” by Papachristos et al. Finally, the Module introduces regression analysis with panel data.


  • Evaluate the relevance and credibility of secondary data for your independent research; justify your data for readers.
  • Explain the concept of a focal relationship and how it should guide our data analysis (and all other aspects of our research).
  • Grasp the intuition behind fixed effects (FE) regression when analyzing panel data. Implement a simple FE regression in R and/or Stata.


  • Thu/Fri @ 8pm. Post your Q & A about the week’s reading at Ed Discussion.


Recap. of Module 4

Before we dive into Module 5 material, watch the video on “Recap. of Module 4.”

Evaluating and justifying your data

Most of you will analyze secondary data this quarter. Secondary data are data that you do not personally collect via surveys, interviews, and so forth. They are instead collected by someone else for some other study, and you are simply re-purposing them for your own study. You must therefore scrutinize the data: Are they credible and relevant for your purposes? What are their limitations? You must also answer these questions for your readers in your final research resport.

The first video examines how to go about performing this sort of assessment. Watch the video on “Assessing and justifying your data.”

Using focal relationships to develop your analysis plan

Once you have obtained some (relevant and credible) data, you can now begin developing your analysis plan. How should you go about doing this? In the video below, I advocate an approach that centers on so-called “focal relationships.”

However, before you watch the video, you should read “More coffee, less crime” by Papachristos et al. Pay close attention, in particular, to the authors’ analysis. Here are some questions to guide you toward that end:

  • The authors’ analysis proceeds in several steps. How would you describe these different steps?
  • How do the authors familiarize readers with their data (i.e., trends and patterns)?
  • What kind of regression analysis do the authors perform? Read the details and write them down.
  • Why do the authors select this sort of regression? How do they justify their model selection?

Now watch the video on “Developing an analysis plan.”

Fixed effects regression using panel data

In “More coffee, less crime,” the authors analyze panel data. Panel data are basically data in which we have observations on our units over time. Many of you will analyze some type of panel data this quarter, and this raises some unique analytical opportunities and issues. I have therefore created a brief video on how to run so-called “fixed effects” regression when using panel data.

In the video, I use mainly R, but I also include some code for running fixed effects regression in Stata. I note at the outset that the example in the video is drawn from “Econometrics in R,” which is an excellent online resource anyone who is interested. Here is a link ( to the example from which I have drawn.

Watch the video on “Intro. to panel data”. NOTE: On one slide, I incorrectly label the IV and DV. Throughout the regression examples, the correct DV is rate, which is the number of traffic deaths per 10,000 people in a given state-year. The IV is beertax, which is the tax on a case of beer (adjusted for 1988 dollars).

Module 5 R Exercise: Maps in R

This week’s R exercise centers on map making. We have of course just seen some really excellent examples of how maps can help to support a claim in the Papachristos et al. reading. Download the RMD file from Canvas along with the accompanying .xlsx data on chicago_crime. Complete the RMD and knit your file to PDF. Note that in order to complete the exercise, you must first complete the first portion of the Module 4 R exercise. Specifically, you should complete all steps up to the merger and then save your cleaned GDP per capita data as a .csv file. You will load this dataset into R as part of this week’s exercise.

Module 4: Concepts and measures


Module 4 reviews some key ideas surrounding concepts and measures. In particular, we lay out some attributes of good concepts and measures and examine how these attributes play out in real-world research. We then examine best-practices for translating our concepts into measures. Finally, we dissect “Good cop, bad cop” by Michelle Pautz as a means of examing what these best-practices actually “look like” in practice.


  • Lay out the attributes of good concepts and measures in the social sciences.
  • Learn some strategies for translating latent concepts into observable measures and begin to implement them in a revised research design.
  • Critically analyze social science scholarship for the clarity, validity, and reliability of their concepts and measures.
  • Practice giving and receiving critical and constructive peer feedback during team meetings.
  • Reflect on “tidy data” and brush up on your data tidying skills via the Module 4 R exercise.


  • Thu. / Fri. @ 8pm CST. Post your Q & A about the week’s reading at Ed Discussion.
  • Fri. @ 8pm CST. Submit your revised research design. You should submit your design to Canvas as well as post it as an attachment to your team’s thread at Ed Discussion. I have provided some guidelines for the revised design below. Note that you should read your team members’ revised designs over the weekend and then post written comments for each one by 8pm on Mon. of Week 5.

Revised research design proposal

Your revised research design should be 1-2 pages long, single-spaced. It should clearly lay out (1) a research question and its justification, (2) an answer to the question, (3) the data that you will analyze and why, and (4) an analysis plan and its justification.

In general, your revised design should be more focused and detailed than your initial submission, and it should incorporate relevant feedback from me and your peers. You should also think about what you have learned about good research questions, theory, and concepts and measurement in Modules 1 through 4; you should incorporate these lessons where applicable. Finally, you should also be learning the state of the art surrounding your topic. This should be reflected in how you motivate your research question and in your elucidation of your answer.

I will be marking for completeness, clarity, and thoughtfulness. It is not necessary that you anticipate and address every problem and/or objection. Much of that is still ahead of you. But it does mean – as I mention above – demonstrating that your ideas are developing and becoming more focused, and that you are being attentive to feedback and the course material.

The module

Theories, hypotheses, and arugments review

Before we dive into the material surrounding concepts and measures, watch the two short videos below. The first addresses a couple of key points to keep in mind as you revise your research design. The second reviews key points from Module 3 on “Theories, hypotheses, and arguments.”

Attributes of concepts and measures & Translating conepts to measures

Module 4 focuses on concepts and measures – specifically, attributes of good concepts and measures as well as how we can actually go about translating latent concepts into observed measures.

To get started, read the Chapter by Bernhard Miller on “Making measures capture concepts.” When you have finished reading, watch the videos on “Attributes of concepts and measures” and “Translating concepts to measures.”

Anatomy of a measure: “Cops on Film” by Michelle Pautz

You should now read “Cops on Film” by Michelle Pautz. Given this week’s topic, you should focus on the concepts laid out in the article as well as how Pautz goes about measuring them. In particular, think about your answers to the following questions:

  • What is the puzzle / question, and how does Pautz motivate it?
  • What concept does Pautz aim to measure?
  • How clear is the concept? How does Pautz define it?
  • How closely does the measure match the concept?
  • Would you recognize a good / bad cop if you saw one in film?

We will examine some of these questions in the video, but the video is really meant to be a continuation of the thought experiment that I raised at the end of the last video. That is, I try to show how we should think about thinking about the theoretical scale of our latent concept can help us generate and/or locate a more valid measure of the concept.

Now watch the video on “Anatomy of a measure.”

R exercise on Tidy data

This Module’s R exercise (along with the Module 3 exercise) may be the most useful. The reason is that data cleaning and transforming comprise the bulk of data analysis. And many of you will soon be doing (and may already be doing) lots of data tidying as you obtain, merge, and transform data for your own research projects. Accordingly, I highly encourage you to complete the exercise if you can – the skills and syntax you develop will be useful later on. And even if you are unfamiliar in R, reflecting on what we want our data to look like as well as the steps we can take to convert them into that form will pay dividend later on.

Download the Module 4 R Exercise from Canvas and complete it. You are not required to submit it for a grade.

Module 3: Theory, hypotheses, and arguments


Module 3 builds on Module 2 by examining theory, hypotheses, and arguments. We distinguish between these concepts but also reflect on how they work together within a final research report: In particular, we focus on the linkage between theory (our answer / logic) and hypotheses (the empirical implications of our theory). We examine some common pitfalls that crop up whenever we find ourselves working with imperfect data as well as some strategies for avoiding them.


  • Differentiate theory, hypotheses, and arguments.
    • Explain how these concepts work together within a research report.
  • Identify common pitfalls in linking theory and hypotheses and explain how we can avoid them.
  • Analyze the linkage between theory and evidence as it crops up in Lessing and Willis.
  • Recall and practice the basics of cleaning and transforming data using dplyr.
  • Reflect on how to give substantive and constructive peer feedback during the week’s team meeting.


Mon. @ 8pm CST. Write 1-2 paragraphs of feedback on each of your team members’ initial research designs. Post your comments as an attachment beneath your team’s sub-heading at Ed Discussion. If you are the first member of your team to post, please create a new thread.
Thu. / Fri. @ 8pm CST. Post your Q & A about the week’s reading at Ed Discussion.

Theory, hypotheses, and arguments

Before diving into the week’s concepts, let’s briefly review some of the concepts that we examined in Module 2 on “Research areas, topics, and questions.” Toward that end, watch the video on “Recap of Module 2.”

Now that we have examined research questions, let’s turn our attention to theories, hypotheses, and arguments. Before you watch the videos below on this topic, read the Chapter by Bryans et al. on “Explaining the Social World.” As your read, think about your answers to the following questions. Doing so will prepare you to better engage the video content and, hopefully, assist you as your progress in your research:

  • What is theory? What are hypotheses? How do these concepts go hand-in-hand?
    • Think about your own research proposal: Did it have a theory (or just hypotheses)? What is your theory?
  • How did you develop your theory? Did you mainly use induction or deduction?
  • What assumptions do you make in your theory?
    *If your theory is correct, what do expect to observe in the world? In other words, what is the best evidence you could find that would tell you that your theory is correct?

Now watch the videos on “Theory, hypotheses, and arguments” and “Why theory?”

The next video aims to get you thinking critically about how we can effectively link theory to hypotheses: “If your theory is correct, what should we observe in the world?” Doing this well is not easy, in part because we rarely find perfect evidence in support of our theories.  When this happens, it’s easy to step into some common pitfalls rather than take the more appropriate step of gathering additional evidence or modifying our theory to ensure it is useful despite empirical limitations. The next video examines some of these common pitfalls as well as some strategies for avoiding them. Be prepared to pause the video throughout, as this video is meant to be a sort of exercise in “spotting the fallacy.”

You will now read “Legitimacy in Criminal Governance” by Lessing and Willis. I selected the article not only because it is one of the most interesting academic articles you will ever read but also because it exemplifies the meme that I presented in the video on “Why theory?” above. That is, Lessing and Willis help us to interpret/connect/make sense of some rich data that I think are extremely puzzling.

Before you read, think about your intuitions to the following questions:

  • How do you think criminal organizations (e.g., mafias and drug gangs) keep their members in line?
  • If your answer (theory) is correct, what would you expect to find if you could examine the internal working of one of these criminal organizations?

Now read the article. When you have finished, watch the video on “Anatomy of an argument.”

R exercise on tidy data and dplyr review

If you are completing the R exercises each week, be sure to download the exercise for Module 3 from Canvas. This week’s exercise is meant to help you brush up on your data cleaning and transforming skills using the dplyr package. While this is less fun than the webscraping we did last week, it is probably more useful, as most of you will have to spend some time cleaning, transforming, and merging any data that you obtain for use in your final research report.

Module 2: Research areas, topics, and questions


Module 2 examines research questions: in particular, the attributes of a good research question, and a general process we can follow to move from a broad research area to a more specific topic and, finally, a focused question. The module will ask you to reflect on research questions as they crop up in real social science research. Finally, the week’s R exercise will guide you as you learn about and practice web scraping.


  • Grasp the attributes of a good research question
  • Grasp the process for developing a good research question and begin to implement it
  • Examine the development of a research question by engaging a real-world example
  • Examine and practice web scraping in R

Due this week

  • Thu and Fri @ 8pm CST: Ed discussion Q & A.
  • Fri @ 8pm CST: Proposed research design. Submit via the Assignments section at Canvas. Also post as an attachment beneath your team’s heading at Ed Discussion. NOTE: You should read each others’ proposals and prepare 1-2 paragraphs of feedback for each one. You will be required to post your written feedback to Ed Discussion by Mon. 4/12 @ 8pm (Week 3).

Some guidance on research designs

Your proposed research design should be a roughly 1-page, single-spaced document. It should do three things: (1) lay out a research question and justify it; (2) delineate a brief answer to the question; and (3) discuss the sort of data you will need in order to answer the question and why.

The readings and videos for this week are meant to help you to develop your research question and design. So, if you can, try to complete the reading and watch the videos relatively early in the week.

I want to preview a couple of key points from the reading and videos here as means of reassuring and guiding you. In particular:

  • Your research question and design will evolve. In fact, it will likely continue to evolve even as you conduct your analysis and write up your results toward the end of the quarter. (I am still constantly revising the design document for my own dissertation as it becomes a book!) This can be frustrating, but it is inevitable. You will be moving in and out of the literature and data surrounding your question in upcoming weeks, and in doing so, you will get a much better idea about what makes a “good” question as well as the sorts of questions you can examine given the available data.
  • For part 3, it is not necessary that you have specific data / datasets in mind. You can instead approach this as a thought experiment: Given your question and tentative answer, what are your ideal data and why? What sort of measures would you obtain? What would the units of analysis be? Would the data be cross-sectional, panel / time-series, or something else? Are the data likely to derive from an experiment, survey, or machine-learning algorithm, or are they likely to have been hand-coded by scholars? The more you think through these and related questions, the better. Thinking carefully about your ideal data now will help you to identify workable data later.

The Module

Attributes of good questions and developing your question

First read the chapter on “Beginning the research process” by Buttolph Johnson. As you read, think about your answers to the following questions. You don’t need to write your answers down, but thinking about them will help you to develop your own research question:

  • What are some attributes of a good research question?
  • How does one actually go about developing a good research question?
  • How does the literature review go hand in hand with the development of a good question?

Once you have completed the Buttolph Johnson chapter, watch the three videos below on “Research areas, topics, and questions”. Note that you will need to enlarge the videos in order to view them.

Anatomy of a research question

Now read the article by Albertus and Deming on “Branching out.” You should read the entire article. In your reading, you might try to implement some of the advice laid out by Dane’s chapter on “Reading and structuring research” in Module 1. In addition, in line with this week’s overarching topic, here are some questions to think about as you read. To the degree that there is time, we may discuss some of these questions during our team meetings this week:

  • What is the central question that the authors ask?
  • What have other scholars found in terms of answers to this question (or very similar questions)?
  • Given that the question is not really new, what contribution – if any – do you think that the authors make?
  • What sort of data would you want – in theory – in order to answer the question posed by the authors?
  • What data do the authors actually use and how does it differ from the ideal data that you described above?
  • How do the authors justify the data that they use?

Once you have read the article, watch the video on “Anatomy of a research question.”

R exercise on web scraping

Remember that you are not required to complete and submit the weekly R exercise for a grade. They are a voluntary tool for introducing you to some operations in R that you may find useful this quarter and/or in the future. This week’s exercise introduces you to one procedure for scraping data from the web pages written using HTML. I am not an expert on this topic, so any data scientists with experience in web scraping should feel free to post additional resources and advice at Ed Discussion.

Note that for this exercise, you will need to add SelectorGadget to your web browser. I use Chrome and have installed the SelectorGadget extension from Chrome’s Web Store ( For other browsers, you can simply drag the SelectorGadget bookmarklet to your bookmarks toolbar. You can find the bookmarklet at

I have included a very brief video below that introduces the selector tool as it applies to this week’s R exercise.