Module 6: What kind of data analysis 2

Overview

Module 6 is the second part of a 2-part series on data analysis. Whereas Module 5 focused mainly on assessing the credibility and relevance of data, Module 6 focuses on developing an analysis plan. We examine how a good analysis plan is linked to argument as well as how this plays out in the reading on Queens by Dube and Harish. We then examine some key considerations in developing an analysis plan.

Objectives

  • Explain how data analysis is linked to argument.
  • Examine the three main “links” of an analysis plan and how they crop up in social science research.
  • Map out an analysis plan based on your argument, focal relationship variables’ type, and data structure.
  • Use your own data to create 1-2 visualizations that motivate your quetstion or support your argument.

Due

  • Thu/Fri @ 8pm. Post your Q & A about the week’s reading at Ed Discussion.

Module 6

Review of Module 5

To get started, watch the video on “Module 5 review.”

Analysis plans as argument

Whereas Module 5 focused on assessing the credibility and relevance of our data, Module 6 focuses on analysis plans: How to conceptualize them, what they look like in practice, and how we can about developing them.

A running theme throughout the Module is that good data analysis is an extension of an argument: It should be tightly linked to the mechanisms that we lay out in our argument. The next video introduces this idea.

Now watch the video on “Analysis plans as argument.”

Anatomy of a plan: “Queens”

How do these three “links” play out in practice? What do they look like in scholarship? In a moment, I will ask you to read “Queens”, by Dube and Harish. I’ve selected this article for two reasons. For one, it tackles a fascinating question: Are states ruled by women less prone to conflict than those ruled by men? In addition, the article nicely exemplifies the three links that we examined above. Accordingly, as you read, try to answer the following questions:

  • What are the focal relationship variables?
  • What are the mechanisms that link the variables?
  • What are some observable implications of the mechanisms? Can you think of some that the authors do not address?
  • How is the analysis linked to the mechanisms?

We will tackle some of these questions in the video below, but you should try to answer them for yourself before watching. Now read “Queens” by Dube and Harish. When you have finished reading, watch the video on “Anatomy of a plan: Queens.”

Developing your analysis plan

So far, we have examined how data analysis is linked to argument. But there are a couple of other considerations that we should make as we develop our analysis plans. We briefly examine these considerations in the final video. As you watch, think about your own data in some detail: What type of variables are in your focal relationship? What is the overall structure of your data? What are the units of analysis?

The reason to keep your answers to these questions in mind is that, ultimately, they should inform your analysis. For instance, in Module 5, Papachristos et al. used so-called Poisson regressions rather than OLS because their dependent variable was a “count” variable that took the value 0 for many observations. Meanwhile, in Module 2, Albertus and Deming used fixed effects regression analysis because they were using panel data and worried that omitted country-level variables might otherwise bias their coefficient estimates.

In this vein, upon watching the video, it may be worth reviewing your notes on the readings from past Modules. In particular: What sort of analysis did different authors perform? How does their analysis plan seem to be shaped by data and variable considerations?

Now watch the video on “Developing your analysis plan.”

Stata and R Exercise: Visualizing your data

The Module 6 R exercise is intended to get you thinking about the first 1-2 pieces of your final data analysis, which will very likely to consist of some kind of table, plot, map, or other visualizations. If you are a Stata user, you should complete the exercise in Stata. Stata has a user-friendly interface for generating simple plots such as histograms, scatterplots, and barplots. Note that in order to complete the exercise, you will need to have your data in hand. You do not have to submit your completed exercise.

Before completing the exercise, let’s reflect on the properties of good data visualization by reading “Aesthetics and technique in data graphical design” by Tufte. Go ahead and read the chapter.

When you have finished reading, follow the instructions that I have pasted below. If you are completing the Module exercises in R, note that these instructions are shown in the RMD file:

Exercise instructions

Be sure to read Tufte’s chapter on “Aesthetics and technique in data graphical design” before you complete this exercise. Think about Tufte’s advice about how to create an impactful graphic and try to implement it below. In particular, label your graphics, use nice colors, and tell a story.

A: Univariate description

Create two visualizations of the univariate distribution of your two main variables – that is, your focal relationship variables. Think of this as presenting your data to your readers. Be sure to consider the your variables’ type and select your visualization type accordingly (e.g., don’t create a histogram for an indicator variable; use a barplot instead).

B: Create a bivariate graph

Create a visualization that begins to capture your theory about your focal relationship. That is, create a visualization of the association between your two main variables of interest. Think of this step as presenting your story (argument) to your readers.

Extra:

Using your data, create the ugliest and most useless graphic you can imagine. What makes it bad / useless?

Module 5: What kind of data analysis? (Part 1)

Overview

Module 5 examines data and data analysis. We examine how to evaluate the relevance and credibility of secondary data for our research as well as justify our data for readers. Module 5 also introduces the concept of focal relationships and examines how we can use focal relationships to guide the development of our data analysis plan. We examine how using focal relationships to guide our data analysis plays out in “More coffee, less crime” by Papachristos et al. Finally, the Module introduces regression analysis with panel data.

Objectives

  • Evaluate the relevance and credibility of secondary data for your independent research; justify your data for readers.
  • Explain the concept of a focal relationship and how it should guide our data analysis (and all other aspects of our research).
  • Grasp the intuition behind fixed effects (FE) regression when analyzing panel data. Implement a simple FE regression in R and/or Stata.

Due

  • Thu/Fri @ 8pm. Post your Q & A about the week’s reading at Ed Discussion.

Module

Recap. of Module 4

Before we dive into Module 5 material, watch the video on “Recap. of Module 4.”

Evaluating and justifying your data

Most of you will analyze secondary data this quarter. Secondary data are data that you do not personally collect via surveys, interviews, and so forth. They are instead collected by someone else for some other study, and you are simply re-purposing them for your own study. You must therefore scrutinize the data: Are they credible and relevant for your purposes? What are their limitations? You must also answer these questions for your readers in your final research resport.

The first video examines how to go about performing this sort of assessment. Watch the video on “Assessing and justifying your data.”

Using focal relationships to develop your analysis plan

Once you have obtained some (relevant and credible) data, you can now begin developing your analysis plan. How should you go about doing this? In the video below, I advocate an approach that centers on so-called “focal relationships.”

However, before you watch the video, you should read “More coffee, less crime” by Papachristos et al. Pay close attention, in particular, to the authors’ analysis. Here are some questions to guide you toward that end:

  • The authors’ analysis proceeds in several steps. How would you describe these different steps?
  • How do the authors familiarize readers with their data (i.e., trends and patterns)?
  • What kind of regression analysis do the authors perform? Read the details and write them down.
  • Why do the authors select this sort of regression? How do they justify their model selection?

Now watch the video on “Developing an analysis plan.”

Fixed effects regression using panel data

In “More coffee, less crime,” the authors analyze panel data. Panel data are basically data in which we have observations on our units over time. Many of you will analyze some type of panel data this quarter, and this raises some unique analytical opportunities and issues. I have therefore created a brief video on how to run so-called “fixed effects” regression when using panel data.

In the video, I use mainly R, but I also include some code for running fixed effects regression in Stata. I note at the outset that the example in the video is drawn from “Econometrics in R,” which is an excellent online resource anyone who is interested. Here is a link (https://www.econometrics-with-r.org/10-1-panel-data.html) to the example from which I have drawn.

Watch the video on “Intro. to panel data”. NOTE: On one slide, I incorrectly label the IV and DV. Throughout the regression examples, the correct DV is rate, which is the number of traffic deaths per 10,000 people in a given state-year. The IV is beertax, which is the tax on a case of beer (adjusted for 1988 dollars).

Module 5 R Exercise: Maps in R

This week’s R exercise centers on map making. We have of course just seen some really excellent examples of how maps can help to support a claim in the Papachristos et al. reading. Download the RMD file from Canvas along with the accompanying .xlsx data on chicago_crime. Complete the RMD and knit your file to PDF. Note that in order to complete the exercise, you must first complete the first portion of the Module 4 R exercise. Specifically, you should complete all steps up to the merger and then save your cleaned GDP per capita data as a .csv file. You will load this dataset into R as part of this week’s exercise.

Module 4: Concepts and measures

Overview

Module 4 reviews some key ideas surrounding concepts and measures. In particular, we lay out some attributes of good concepts and measures and examine how these attributes play out in real-world research. We then examine best-practices for translating our concepts into measures. Finally, we dissect “Good cop, bad cop” by Michelle Pautz as a means of examing what these best-practices actually “look like” in practice.

Objectives

  • Lay out the attributes of good concepts and measures in the social sciences.
  • Learn some strategies for translating latent concepts into observable measures and begin to implement them in a revised research design.
  • Critically analyze social science scholarship for the clarity, validity, and reliability of their concepts and measures.
  • Practice giving and receiving critical and constructive peer feedback during team meetings.
  • Reflect on “tidy data” and brush up on your data tidying skills via the Module 4 R exercise.

Due

  • Thu. / Fri. @ 8pm CST. Post your Q & A about the week’s reading at Ed Discussion.
  • Fri. @ 8pm CST. Submit your revised research design. You should submit your design to Canvas as well as post it as an attachment to your team’s thread at Ed Discussion. I have provided some guidelines for the revised design below. Note that you should read your team members’ revised designs over the weekend and then post written comments for each one by 8pm on Mon. of Week 5.

Revised research design proposal

Your revised research design should be 1-2 pages long, single-spaced. It should clearly lay out (1) a research question and its justification, (2) an answer to the question, (3) the data that you will analyze and why, and (4) an analysis plan and its justification.

In general, your revised design should be more focused and detailed than your initial submission, and it should incorporate relevant feedback from me and your peers. You should also think about what you have learned about good research questions, theory, and concepts and measurement in Modules 1 through 4; you should incorporate these lessons where applicable. Finally, you should also be learning the state of the art surrounding your topic. This should be reflected in how you motivate your research question and in your elucidation of your answer.

I will be marking for completeness, clarity, and thoughtfulness. It is not necessary that you anticipate and address every problem and/or objection. Much of that is still ahead of you. But it does mean – as I mention above – demonstrating that your ideas are developing and becoming more focused, and that you are being attentive to feedback and the course material.

The module

Theories, hypotheses, and arugments review

Before we dive into the material surrounding concepts and measures, watch the two short videos below. The first addresses a couple of key points to keep in mind as you revise your research design. The second reviews key points from Module 3 on “Theories, hypotheses, and arguments.”

Attributes of concepts and measures & Translating conepts to measures

Module 4 focuses on concepts and measures – specifically, attributes of good concepts and measures as well as how we can actually go about translating latent concepts into observed measures.

To get started, read the Chapter by Bernhard Miller on “Making measures capture concepts.” When you have finished reading, watch the videos on “Attributes of concepts and measures” and “Translating concepts to measures.”

Anatomy of a measure: “Cops on Film” by Michelle Pautz

You should now read “Cops on Film” by Michelle Pautz. Given this week’s topic, you should focus on the concepts laid out in the article as well as how Pautz goes about measuring them. In particular, think about your answers to the following questions:

  • What is the puzzle / question, and how does Pautz motivate it?
  • What concept does Pautz aim to measure?
  • How clear is the concept? How does Pautz define it?
  • How closely does the measure match the concept?
  • Would you recognize a good / bad cop if you saw one in film?

We will examine some of these questions in the video, but the video is really meant to be a continuation of the thought experiment that I raised at the end of the last video. That is, I try to show how we should think about thinking about the theoretical scale of our latent concept can help us generate and/or locate a more valid measure of the concept.

Now watch the video on “Anatomy of a measure.”

R exercise on Tidy data

This Module’s R exercise (along with the Module 3 exercise) may be the most useful. The reason is that data cleaning and transforming comprise the bulk of data analysis. And many of you will soon be doing (and may already be doing) lots of data tidying as you obtain, merge, and transform data for your own research projects. Accordingly, I highly encourage you to complete the exercise if you can – the skills and syntax you develop will be useful later on. And even if you are unfamiliar in R, reflecting on what we want our data to look like as well as the steps we can take to convert them into that form will pay dividend later on.

Download the Module 4 R Exercise from Canvas and complete it. You are not required to submit it for a grade.

Module 2: Research areas, topics, and questions

Overview

Module 2 examines research questions: in particular, the attributes of a good research question, and a general process we can follow to move from a broad research area to a more specific topic and, finally, a focused question. The module will ask you to reflect

Objectives

  • Grasp the attributes of a good research question
  • Grasp the process for developing a good research question and begin to implement it
  • Examine the development of a research question by engaging a real-world example
  • Examine and practice web scraping in R

Due this week

  • Thu and Fri @ 8pm CST: Ed discussion Q & A.
  • Fri @ 8pm CST: Proposed research design. Submit via the Assignments section at Canvas. Also post as an attachment beneath your team’s heading at Ed Discussion. NOTE: You should read each others’ proposals and prepare 1-2 paragraphs of feedback for each one. You will be required to post your written feedback to Ed Discussion by Mon. 4/12 @ 8pm (Week 3).

Some guidance on research designs

Your proposed research design should be a roughly 1-page, single-spaced document. It should do three things: (1) lay out a research question and justify it; (2) delineate a brief answer to the question; and (3) discuss the sort of data you will need in order to answer the question and why.

The readings and videos for this week are meant to help you to develop your research question and design. So, if you can, try to complete the reading and watch the videos relatively early in the week.

I want to preview a couple of key points from the reading and videos here as means of reassuring and guiding you. In particular:

  • Your research question and design will evolve. In fact, it will likely continue to evolve even as you conduct your analysis and write up your results toward the end of the quarter. (I am still constantly revising the design document for my own dissertation as it becomes a book!) This can be frustrating, but it is inevitable. You will be moving in and out of the literature and data surrounding your question in upcoming weeks, and in doing so, you will get a much better idea about what makes a “good” question as well as the sorts of questions you can examine given the available data.
  • For part 3, it is not necessary that you have specific data / datasets in mind. You can instead approach this as a thought experiment: Given your question and tentative answer, what are your ideal data and why? What sort of measures would you obtain? What would the units of analysis be? Would the data be cross-sectional, panel / time-series, or something else? Are the data likely to derive from an experiment, survey, or machine-learning algorithm, or are they likely to have been hand-coded by scholars? The more you think through these and related questions, the better. Thinking carefully about your ideal data now will help you to identify workable data later.

The Module

Attributes of good questions and developing your question

First read the chapter on “Beginning the research process” by Buttolph Johnson. As you read, think about your answers to the following questions. You don’t need to write your answers down, but thinking about them will help you to develop your own research question:

  • What are some attributes of a good research question?
  • How does one actually go about developing a good research question?
  • How does the literature review go hand in hand with the development of a good question?

Once you have completed the Buttolph Johnson chapter, watch the three videos below on “Research areas, topics, and questions”. Note that you will need to enlarge the videos in order to view them.

Anatomy of a research question

Now read the article by Albertus and Deming on “Branching out.” You should read the entire article. In your reading, you might try to implement some of the advice laid out by Dane’s chapter on “Reading and structuring research” in Module 1. In addition, in line with this week’s overarching topic, here are some questions to think about as you read. To the degree that there is time, we may discuss some of these questions during our team meetings this week:

  • What is the central question that the authors ask?
  • What have other scholars found in terms of answers to this question (or very similar questions)?
  • Given that the question is not really new, what contribution – if any – do you think that the authors make?
  • What sort of data would you want – in theory – in order to answer the question posed by the authors?
  • What data do the authors actually use and how does it differ from the ideal data that you described above?
  • How do the authors justify the data that they use?

Once you have read the article, watch the video on “Anatomy of a research question.”

R exercise on web scraping

Remember that you are not required to complete and submit the weekly R exercise for a grade. They are a voluntary tool for introducing you to some operations in R that you may find useful this quarter and/or in the future. This week’s exercise introduces you to one procedure for scraping data from the web pages written using HTML. I am not an expert on this topic, so any data scientists with experience in web scraping should feel free to post additional resources and advice at Ed Discussion.

Note that for this exercise, you will need to add SelectorGadget to your web browser. I use Chrome and have installed the SelectorGadget extension from Chrome’s Web Store (https://tinyurl.com/298y44yt). For other browsers, you can simply drag the SelectorGadget bookmarklet to your bookmarks toolbar. You can find the bookmarklet at selectorgadget.com.

I have included a very brief video below that introduces the selector tool as it applies to this week’s R exercise.

Module 8: Multivariate regression

Overview

Module 8 builds directly from the concepts that we examined in Module 7 on bivariate regression. It shows how omitted variable bias can undermine our regression results and demonstrates how multivariate regression, when well done, can help to mitigate bias. It lays out the logic of multivariate regression and helps students to begin running, interpreting, and using multivariate regression for prediction.

Objectives

  • Explain what omitted variable bias is as well as how multivariate regression can help us to mitigate it.
  • Build and run multivariate regression models in R.
    • Explain the logic behind your IV, DV, and controls.
    • Interpret regression results.
    • Present regression results in publishable tables.
    • Use a regression model for prediction.

Assignment

Exercise 4 is due by 11:59pm CST on 3/08. Be sure to download the RMD for Exercise 4 from Canvas and review it early on in the week so that you can plan your time accordingly.

Module: Multivariate regression

Omitted variable bias

First, read OIS Sections 8.1 and 8.2 on “Multiple Regression”. Then watch the video series on “Multivariate Regression.” As with last week’s Module, I have divided what would normally be a single lecture into a series of short videos on different subtopics within multivariate regression.

Begin by watching the video on “Omitted variable bias: Intuition.”

Now watch the video on “Omitted variable bias: The mechanics.” Note that this videos uses maths to demonstrate the intuition that I laid out in the first video. The maths are limited to linear algebra, so I encourage you to follow along with the different steps as best a possible. I happen to think that this is one instance in which the maths do help us toward a better grasp of the concept.

Multivariate regression

Now that we have examined OVB, how can multivariate regression help to mitigate it? We examine that question in the next video. Note that in this videos and those that follow, I will walk you through some examples using the CASchools data that are contained in AER package in R. You can follow along. Simply use the code below to install and load the data as well as create our dependent and independent variables:

# Load the data
require(AER)
data(CASchools)

# Dependent variable
CASchools$score <- (CASchools$read + CASchools$math)/2

# Independent variable
CASchools$STR <- CASchools$students/CASchools$teachers

# Our simple model
model <- lm(score ~ STR, data = CASchools)

Now watch the videos on “Multivariate regression: The logic” and “Multivariate regression: Prediction”.

One of the last points I make in the video on “Prediction” is that we should avoid so-called “kitchen sink” or “garbage can” regression models that contain lots of control variables not rooted in good theory. So what criteria should guide our control strategy? Let’s see what other scholars suggest as to an answer to this question.

First read Chapter 4 from John Martin’s book on “Thinking through statistics.” What strategies does the author suggest? What are some of the potential pitfalls that potentially crop up as we embark on a control strategy, and how do we avoid them?

How do Martin’s arguments play out in actual scholarship? Earlier in the quarter, we read the first portion of Koch and Nicholson’s article on “Death and Turnout.” You should now read the remainder of the article. Here are some questions to guide your reading and prepare you for the practice questions and exercise:

  • What is the authors’ dependent variable?
  • What is their main independent variable?
  • What is the expected direction of the relationship between IV and DV?
  • What are some variables that may confound this relationship?
  • What variables do the authors control for and why? (Focus on their discussion of Table 4 on p. 942).
  • What do you think about these controls? Do you buy the authors’ reasoning? Why or why not?
  • Finally, pay close attention to the authors’ interpretation of their results. Note the specific language they use to discuss statistical and substantive significance. (As above, focus on their discussion of Table 4).

Practice questions and exercise

Now complete the Module 8 practice questions. These are short and straightforward; I have done much of the coding for you as a means of walking you through running and presenting multivariate regression in R. But note that Exercise 4 builds directly from the practice questions, so be sure to work through them.

Once you have completed the practice questions, be sure to download and complete Exercise 4.

Module 7: Bivariate regression

Overview

In Module 7, we begin conducting bivariate analysis. Toward this end, the Module introduces various tools for examining linear relationships between variables and testing them for statistical significance. In particular, the Module moves from measures of joint fluctuation such as covariance and correlation to bivariate linear regression.

Objectives

  • Calculate covariance and correlation as well as explain what these measures capture conceptually.
  • Explain how the OLS regression line helps us model the relationship between two variables.
  • Perform bivariate OLS regression in R and interpret the regression output.

Assignment

Assignment 3 is due at 11:59pm CST on Monday, 3/01. The Assignment 3 RMD file can be downloaded at the “Files/Exercises and Assignments” section at Canvas. Be sure to download and review the file early in the week so that you can plan your time accordingly.

Module

The video lectures for this week kick off with a very brief review of some of the assumptions that power the Central Limit Theorem. Watch the video on “CLT assumptions revisited” (I note that this title differs from the text shown on the title slide of my presentation.)

Let’s now turn to the central topic for this week’s Module: bivariate analysis. Note that so far, we have been performing what’s known as “univariate” analysis. This means that we have been examining single variables rather than relationships between variables. For instance, in Module 6, we focused primarily on the mean of a single variable such as annual income. (Recall our running Somersville example.) We then used hypothesis tests to assess whether some observed value of our sample mean provided evidence to refute some hypothesized value of the underlying population mean.

For the remainder of the quarter, we turn our attention to “bivariate” and “multivariate” analysis. That is, we will examine relationships between two or more variables. Ultimately, the basic procedures for such analysis are very similar to those we have used up until now: We will use sample statistics to estimate population parameters; we will calculate the typical error associated with our estimates; we will construct CIs; we will calculate test statistics; and we will perform hypothesis tests.

Measures of join fluctuation

Start by reading Imai, K. (2017). Quantitative Social Science: An Introduction. Read Sections 3.6, 4.2.1, and 4.2.2. The PDF is available at the “Files” section at Canvas.

Now watch the two videos on “Correlation” (Introduction and Significance Tests, respectively).

Bivariate regression: Introduction

Now read OIS Sections 7.1–7.4. Then, watch the video series on “Bivariate regression.” Note that I have divided what would normally be a single lecture into a series of short videos organized by sub-topic.

Bivariate regression: The regression line

Bivariate regression: Calculating regression coefficients

Bivariate regression: Prediction, tests of significance, and interpretation

Once you have completed your reading and watched the video lectures above, download and complete the Module 7 practice questions. They can be downloaded at the “Files/Practice Questions” section at Canvas.

Finally, be sure to download and complete Assignment 3, which is located at the “Files/Exercises and Assignments” section at Canvas.

Module 6: Hypothesis testing

Overview

Module 6 builds on Modules 4 and 5 on “Foundations for inference” by examining (univariate) hypothesis testing. We examine the t-distribution as a means of accounting for uncertainty in our estimates when our sample size is small or the population standard deviation is unknown. We then examine hypothesis tests for one and two populations using the t-distribution, and we begin to conduct such tests in R. Finally, we read and analyze one especially prominent and recent use of hypothesis testing in US politics.

Objectives

  • Explain what a t-distribution is and why we use it.
  • Define null hypothesis, alternative hypothesis, and type 1 and 2 errors.
  • Explain and put to practice the steps in conducting hypothesis testing.
    • State a null and alternative hypothesis
    • Calculate appropriate tests statistics
    • Calculate p-values
    • Compare p-values against different significance levels and interpret the results
  • Critically analyze real-life usages of hypothesis testing.

Assignment

Exercise 3 is due at 5pm CST on Monday, 2/22. The Exercise 2 RMD file is located at Canvas. Be sure to download and skim through the file early in the week so that you can plan your time accordingly.

Mid-course feedback

Please look for a Canvas announcement from me later in the week with a link to a Google Form that will allow you to provide me with anonymous, mid-course feedback. I am always looking for ways to improve my instruction and better meet your needs and aims, so please do take a moment to complete the Form once it is published.

Module

t-distribution

We’re going to begin by refining our grasp of standard errors and confidence intervals. In particular, we’re going to examine the t-distribution as a conservative alternative to the normal distribution that will allow us to account for additional uncertainty whenever our sample size is small or we estimate the population standard deviation using the sample standard deviation.

Now watch the video on the “t-distribution.”

(Univariate) Hypothesis testing

Now read OIS Sections 4.3 through 4.5. Note that the Syllabus lists some additional sections of OIS that are suggested but optional.

After you have completed the reading, watch the video on “Hypothesis testing 1.”

Now watch the video on “Hypothesis testing 2.”

I realize the the videos for this module are longer than usual, but I wanted to be sure to walk through some examples step-by-step so that you can begin conducting your own hypothesis tests in the practice questions and exercise.

Hypothesis testing in the 2020 US presidential election

I am now going to ask you to read an especially prominent and recent usage of hypothesis testing in US politics. Specifically, we are going to read excerpts from the State of Texas v. the Commonwealth of Pennsylvania, the State of Georgia, and the State of Michigan. If you are attentive to US politics and followed the aftermath of the most recent presidential election, you are probably already familiar with the case.

I should state at the outset that I selected this reading for its relevance. (What could be possibility be more relevant?) I recognize that the reading centers on a potentially sensitive political issue. So, as we read, analyze, and discuss these passages, I ask that we do so critically but cordially, and that we limit our discussion to what is relevant to the course objectives.

Now read State of Texas v. Pennsylvania, Georgia, and Michigan.

Here are some discussion questions to accompany your reading. Please try to answer the questions upon completing the reading, and be prepared to discuss them when we meet in Week 7:

  • What is the author’s main claim (i.e., the takeaway of the statistical test)?
  • Describe the nature of the statistical tests conducted by the author:
    • What are the null and alternative hypotheses for each test?
    • Given what you learned about hypothesis testing above, and assuming that the author’s math is correct, what are the results of the tests? (Be precise here.)
  • How do you you interpret the results of the tests? How does the author interpret them? What do you think about this interpretation?

Practice Questions and Exercise

Once you have worked through the videos and reading, complete the Module 6 Practice Questions, which guide you through several hypothesis tests in R. Once you have finished the Practice Questions, be sure to complete Exercise 3.

Module 3: Know your data (and where it came from)!

Overview

Module 3 builds on the concepts surrounding data and measurement that we tackled last week by asking you to think critically about some fundamental questions: Where do your data come from? Who made them and why? What do they look like? And so on.

In addition, Module 3 asks you to grapple with how we can move toward making causal claims in the social sciences as well as analyze how scholars (i.e., Koch and Nicholson) have tried to overcome obstacles to causal inference.

Objectives

  • Explain the distinction between experimental and observational data; explain why random assignment to treatment can facilitate causal inference
  • Lay out obstacles to causal inference when using observational data as well as some strategies for (potentially) addressing them; analyze scholarship in light of these obstacles and strategies.
  • Develop additional tools to assist data exploration in R.

Assignment 1

Reminder: Assignment 1 is due by 5pm CST on Monday 2/1. I suggest that you download and preview Assignment 1 early in the week so that you can plan your time accordingly.

Where do your data come from?

Watch the brief video “Where do your data come from?” Like last week, the video will sometimes ask you to press “pause” and answer some “class questions.”

Now watch the video “Where do your data come from? (Part 2)”.

I mentioned in the videos that our main aim as social scientists is to make valid causal claims. But, as the videos also noted, there are a host of obstacles in our path, including bad data.

Read Martin, Thinking through Statistics, Chapter 2 (“Know your data”). As you read, pay close attention to different data problems and, in particular, how those data problems have at times ruined prominent research. What can we do to keep ourselves from ruin?

Toward causal inference

Now watch the final video on “Toward causal inference.”

The video suggests that we can sometimes move toward causal inference by moving from the level of general laws to the level of mechanisms. How does this play out in scholarship?

Read Koch and Nicholson (2016, pp. 932–938). As you read, pay close attention to the authors’ argument and discussion of their empirical setup.

(1) What do the authors argue? (You might even take some time to diagram their argument). (2) How do thee authors try to move from establishing mere correlation to identifying causation? (3) Why does the structure of their argument potentially help them toward this end?

Now examine the regression table from the authors’ aggregate analysis. (4) Why do the authors control for the variables that they do? (5) How compelling is the aggregate analysis for helping the authors to make a causal claim? (6) Whether you think it is compelling or not, what work is the aggregate analysis doing for the authors?

Practice questions

In the spirit of really getting to know our data, the practice questions for this week will introduce some techniques that will ease the data cleaning and exploration process. In particular, we will practice using loops with for() as well as the dplyr package to summarize and reshape our data.

Download the “Module 3 practice questions” 1 and 2 RMD from Canvas. Note that there are two parts to the practice questions. Also note that the practice questions on loops have a couple of especially challenging questions (e.g., the de Montmort Problem). I suggest that you complete the simpler questions and then move on. You can return to the more challenging ones after you have completed Part 2 as well as Assignment 1.

Module 2: Cases, variables, and measurement

Overview

Module 2 asks you to engage some fundamental questions: What are data? Cases? Variables? It additionally askes you to grapple with critical issues surrounding measurement as well as consider how these issues play out in Daniel Treisman’s famous article on “The casues of corruption” (2005). Finally, the Module introduces important statistical and visual tools for exploring and describing variables.

Objectives

  • Define data, cases, and variables
  • Explain the qualities of “good” measurement; begin to analyze scholarship in light of these qualities
  • Grasp the intuitions behind common measures of central tendency and spread; become familiar with their notation and learn to calculate them in R
  • Explore data in R using summary functions, tables, and plots

Exercise 1

Download, complete, and submit Exercise 1 by 5pm CST on Monday, 1/25. The file is available at Canvas. I recommend that you preview the Exercise early in the week so that you can plan your time accordingly.

What is data?

Read OIS sections 1.1–1.5. Some of this material may be review, but don’t worry if it isn’t!

Now watch the brief video on “What is data?”

Variables and measurement

As the video mentions, we will dig into some actual data as we complete the practice questions. But first, watch the video on “Variables and measurement.” Note that the video will occasionally ask you to press “pause” and then spend some time answering some “class questions.” You don’t have to submit your answers to these questions. However, quickly jotting your answers down may be useful, as we will discuss some of the questions during our section meetings.

How do issues surrounding conceptual clarity, validity, and reliability play out in social science scholarship? In a moment, you will read “The causes of corruption,” by Daniel Treisman (2005). This is a famous article that uses linear regression to test common hypotheses about the causes of corruption worldwide. Before you read, takes some time to answer some questions. (As above, I suggest that jot your answers down somewhere): (1) What is corruption? Define it. (2) Imagine that you wanted to measure corruption levels across countries. How would you go about doing that? What kind of data would you look for? (3) What are some advantages of your approach? What are some of its disadvantages?

Now read the article.

Once you have finished reading, answer these questions: (1) Do you think that Transparency International (TI) Index scores are valid and reliable as a measure of corruption? (2) Treisman makes the case that they are. What does he argue to support the validity and reliability of his measure? (3) Are you persuaded? Why or why not?

Describing variables

Now read OIS sections 1.6 and 1.7. Then, watch the video on “Describing Variables: Measures of Central Tendency and Spread.”

And now, watch the video on “Describing Variables: Tables and Plots.”

Let’s apply some of the concepts presented above! Download the “module 2 practice questions” RMD from Canvas (“Files/Practice Questions”). Note that there are 2 parts to the practice questions. Complete the practice questions in RMD and then knit your files to PDF.

Remember: If you get stuck at any point… breathe. Coding can be frustrating at first, but we will work through it together. There a lots of ways to seek help:

  1. Use the “Help” tab in RStudio
  2. Internet search
  3. Post your question to Ed Discussion
  4. As a final option, email me directly or visit me during my office hours

As you seek help, try to specify the nature of the problem: Examine any warnings or error messages. What line of code seems to be the issue? Which function, specifically? (During knitting, Markdown will often tell you which line of code is stalling the knitting process.) If you are getting error messages, are you missing parentheses, commas, or quotations? (This happens to me all the time.) Answering these questions will help to ensure that you get the help you need.