## Introduction to R and R Graphics
## Eric Guntermann
## December 13, 2013
## This workshop is an introduction to R and R graphics. It covers opening datasets,
## creating and manipulating data, running models, and making graphs.
## It covers linear models, logistic regression models,
## multinomial logistic regression models, and ordinal logistic regression models.
## Before we begin, it's important to understand some basic concepts in R. The most
## important concept in R is that of objects. Objects are computer creations that store
## information. R also runs commands, which are performed on objects, in turn creating
## other objects.
## Another important concept is that of packages. Commands in R are contained in packages,
## some of which are included by default in R andsome of which must be downloaded. In this
## workshop, we will use the car and effects packages, which aren't installed by default.
## To install the effects package, type:
install.packages("effects")
## Let's also install the car package, which contains some useful statistical commands.
install.packages("car")
## We will also use the foreign package, which is used to open data files from other
## statistical packages, such as STATA and SPSS. the foreign package is included in the
## R base installation. The lattice package, which we will use to make some advanced graphs,
## is also installed by default.
## You'll notice that most of the preceding lines have begun with ##. One or more # tells
## R that the rest of the line is a comment, which it should ignore.
## To begin, let's open our two datasets. First, we have to load the foreign package
## using the library function. Then we run the read.dta command, used for opening STATA
## files, on the name of our data file. We then assign the output of the command to
## the object qc12. The data is the 2012 Makind Electoral Democracy Work (MEDW) survey
## done for the Quebec provincial elections.
library(foreign)
## To read in the data, you have to set the working directory to the directory containing
## the dataset
setwd("/where/file/is") # In RStudio, You can also click on the "Session" menu, then
## "Set Working Directory", then "To Source File Location"
qc12 <- read.dta("qc12.dta", convert.factors=TRUE) # Read in the data
# conver.factors=TRUE preserves STATA value labels
## If this file were in SPSS format, you would type the following:
## qc12 <- read.spss("qc12.sav", to.data.frame=TRUE, use.value.labels=TRUE)
#W We can also open datasets in many other formats. Let's open a dataset in CSV.
prop <- read.csv("prop.csv")
## This second data set contains two measures of electoral disproportionality for 79
## elections. The data are from the Comparative Study of Electoral Systems (CSES).
## A Fundamental aspect of R is the search path, the search path tells us where R searchs
## for objects. It can be obtained using:
search()
## We see the .GlobalEnv, which is the workspace, as well as a list of packages which have
## either been loaded by default or that we have loaded ourselves.
## To find out the objects that exist in the workspace (.GlobalEnv), type:
ls()
## We can see that our two datasets appear in the search path, but note that the variables
## in each do not!
## Any object we create will appear here unless we assign it to a dataset. For example:
a <- "object 1"
ls()
# We can see that object a is now in the workspace. To print it we can simply type:
a
# To remove objects from the workspace type:
rm(a)
ls() #It's gone!
# To remove all objects, type:
rm(list=ls())
ls() # Empty!
# We now have to reload our datasets since we just removed them from our workspace.
qc12 <- read.dta("qc12.dta", convert.factors=TRUE)
prop <- read.csv("prop.csv")
# To find out which objects are in any other location in the search path, type a
# number in the parentheses.
# For example:
ls(2)
## To attach or not to attach?
# In R, we have two choices. Either we attach datasets so that variables are available
# in the workspace or we don't. In the latter case, we have to refer to the data set
# every time we want to acess a variable. This is done as follows:
# dataset$variable. For example, if we want to view the variable, mdm (mean district
# magnitude) in the prop dataset, we would type:
prop$mdm
# If we don't want to refer to both the dataset and the variable name every time we want
# to get data from a data set, we can simply attach the dataset
attach(prop)
# Now we can simply type use the variable name:
mdm
# We also see that the prop dataset is now in our search path
search()
# Two caveats:
# (1) This is convenient if you know you will only use one dataset in each session, but not
# if you plan on using more. We are using two datasets, so we will not leave any datasets
# attached. Variables in attached datasets can potentially mask other variables in our
# workspace, so we should be sure that attaching datasets is the best thing for us to do.
# (2) Make sure you always detach your datasets when you are done working with them.
detach(prop)
## Exploring data
## There are a number of functions in R for exploring data.
summary(prop$gallagher) # display some summary statistics when used on a numeric variable.
summary(prop$regime) # Gives frequencies for a categorical variable.
class(prop$mdm) # Tells us the type of variable
head(qc12$vote) # gives the first six values of a variable or of all the variables
# in a dataset.
## Creating and managing data
# The most basic structure for storing data in R is the vector. Vectors can be created
# simply by assigning sets of numbers, character strings, or any other type of set to an
# object. This is done using the c (concatenate) command
object <- c(1,2,3,4)
# Our new vector can now be seen by simply typing it
object
# The most important data structure we use when doing statisical analyses is the data frame
# It is a set of rows representing units and columns representing variables. View (Yes,
# that's a capital V!) can be used to see a whole data frame.
# head (shown above) gives the first six rows of a dataset.
View(qc12)
head(prop)
## Types of variables:
# Numeric:
summary(prop$mdm)
# Factors:
summary(qc12$vote)
# Creating a factor: a few different ways
agef <- cut(qc12$age, breaks=5) # Splits age into 5 equal sized intervals, which become
# the levels of a factor
summary(agef)
# Another way
myvar <- c(1,2,3,2,3,1,2,3,2,1) # Creates a numeric variable
summary(myvar)
myfactor <- factor(myvar) # Converts myvar to a factor
summary(myfactor)
levels(myfactor) <- c("low", "med", "hi") # Changes the factor level names
summary(myfactor)
# What if you wanted to combine factor levels? Let's say you want to include the
# observations that were coded as "med" in the category "hi". Just assign the level
# name "hi" to cases codes as "med"
myfactor2 <- myfactor # Let's copy our variable
levels(myfactor2) <- c("low", "hi", "hi") # Now we assign "med" cases the value of "hi"
summary(myfactor2)
# If you want to change the names of the levels of the factor:
myfactor3 <- myfactor
levels(myfactor3) <- c("Beginner", "Intermediate", "Advanced")
summary(myfactor3)
# Recoding numeric variables. Let's look at the age variable in qc12:
summary(qc12$age)
# Suppose we have reason to believe that the one person coded as being over 100 (112 years
# in this case) was not actually that old, but that it was simply a data entry error.
# If we believe the person's real age is 22, we could do the following:
age <- qc12$age # Copy the age variable. Notice that we are copying age from the qc12
# dataset to our workspace. Now when we simply type age we get our new age variable which is
# stored in our workspace, but when we type qc12$age we get age as stored in qc12.
age[age==112] <- 22 # This assigns the value 22 to all cases (there is one here) where age
# is coded 112. If we were doing this directly to a variable in a dataset we would type:
# qc12$age[qc12$age==112] <- 22
# The last example pointed to one very powerful tool in R: indexing. In R, data can be
# manipulated in an easy and very flexible way. Simply use the following form to produce
# an ouput object that is a subset of the input object:
# dataset[1:50, 1:5] # indexing using integers. Here's an example:
qc12[1:10, 1:5] # Picks out rows 1 to 10 and columns 1 to 5.
# We could assign this to an object if we wanted to use this subset of the dataset.
# or
# dataset$var1[dataset$var1>5] # indexing using logical conditions
qc12$vote[qc12$sex=="Female"] # We have to index using the correct number of dimensions
# A variable is represented as a vector so it has one dimension. A dataset has two.
## An important note about missing data
# Missing data in R are stored as NA. If we want to declare the previous age observation to
# be missing, we could type:
age <- qc12$age # Copy age again
age[age==112] <- NA
summary(age) # We now see one missing observation
# Let's set that obervation back to its original value. The is.na() function asks R for the
# indices of all cases of age that are coded as NA. We then access those observations and
# assign them the value of 112.
age[is.na(age)] <- 112
summary(age) # No NAs, 112 is the max
## Changing variable names. We can change individual variable names by assigining
# their contents to new objects and then possibly deleting the original variable:
qc2 <- qc12 # Create copy of dataset
qc2$ses <- qc2$income # Create copy of variable
qc2$income <- NULL # Delete original variable
# Another way is to change all variable names at the same time.
prop2 <- prop # Create copy of dataset
colnames(prop2) <- c("id", "disprop", "mdistrict", "system") # Set names of all variables in dataset
## Now that we've seen how to import our data and how to manage it, let's run some models.
# Before we do that, let's clear our workspace and reload our datasets
rm(list=ls())
qc12 <- read.dta("qc12.dta", convert.factors=TRUE)
prop <- read.csv("prop.csv")
## Let's run some models!
## 1. Continuous dependent variable
## Here our dependent variable is an ideology scale. We run a linear model.
mod1 <- lm(ideology ~ age + sex + french + education + log(income), data=qc12)
## We can get a summary of the results using:
summary(mod1)
## Anova performans an F test on our regression output. An F test allows us to appropriately
## test hypotheses on all our coefficients, including education, which involves 9 dummy
## variables. Anova comes with the car package. Let's load it.
library(car)
Anova(mod1)
## We see that education and log(income) have significant effects
## To visualize the effects of our variables, we first have to load the effects package.
library(effects)
## Let's look at the effect of log(income) on ideology. We saw that it has a significant
## effect, In one short line, we can create a quite informative graph.
plot(effect("log(income)", mod1))
## By default, effect includes a rug plot, which is not very informative for this variable.
## Let's remove it, by using the rug=FALSE option.
plot(effect("log(income)", mod1), rug=FALSE)
## The effects package, by default, holds other variables constant at interesting values.
## We should probably specify these just to be sure what we're looking at.
plot(effect("log(income)", mod1), rug=FALSE, given.values=c("age"="mean(age)",
"sex"="Female", "french"="TRUE", "education"="Bachelor's degree"))
## If our explanatory variable of interest is a dummy variable, effects can make a
## similarly interesting graph
# university is a dummy variable coded TRUE if people have at least some university
mod2 <- lm(ideology ~ age + sex + french + university + income, data=qc12)
Anova(mod2)
plot(effect("university", mod2), rug=FALSE, given.values=c("age"="mean(age)",
"sex"="Female", "french"="TRUE", "income"="mean(log(income))"))
## Interactions are not much harder to present. Consider the interaction between
## french and log(income)
mod3 <- lm(ideology ~ sex + french*log(income) + education, data=qc12)
Anova(mod3)
plot(effect("french*log(income)", mod3), rug=F, given.values=c("sex"="Female",
"education"="Bachelor's degree"))
## We see that income has a weaker effect on Francophones than on Anglophones.
## However, we might want to look at the other side of the interaction,
## i.e. how income conditions the effect of language. To do that we have to
## create the graph in two lines.
eff_freinc <- effect("french*log(income)", mod3)
## Remember how R is about applying commands to objects and producing other objects.
## Previously we applied two commands in one line. Here we just split it up in two, which
## is necessary to pick the variable we want to display.
plot(eff_freinc, x.var="french", given.values=c("sex"="Female", "education"="Bachelor's degree"))
## 2. Binary dependent variable
## Here our dependent variable is a binary variable representing support for Quebec sovereignty.
mod4 <- glm(sov ~ income + french, data=qc12, family=binomial)
Anova(mod4)
## Let's look at the effect of income
plot(effect("income", mod4), rug=FALSE, given.values=c("french"="French"))
## Now, how about the effect of language?
plot(effect("french", mod4), rug=FALSE, given.values=c("income"="mean(income)"))
## We can also interact these two variables
mod5 <- glm(sov ~ income*french, data=qc12, family=binomial)
Anova(mod5)
plot(effect("income*french", mod5), rug=FALSE)
## The effect of income is a lot weaker for Francophones
## 3. Nominal dependent variable
## Now we use multinomial logistic regression to look at the effect of
## language and ideology on vote choice, including the three largest parties in Quebec in
## 2012.
## The command multinom is in the nnet package, which we have to load, since it isn't
## loaded by default.
## To begin, let's look at the effect of a dichotomous variable, language, on vote choice.
library(nnet)
mod6 <- multinom(vote ~ sex + french, data=qc12)
Anova(mod6)
plot(effect("french", mod6), rug=FALSE, given.values=c("sex"="Female"))
## Now for the effect of a continuous independent variable, ideology.
mod7 <- multinom(vote ~ ideology + sex + french, data=qc12)
Anova(mod7)
plot(effect("ideology", mod7), rug=FALSE, given.values=c("sex"="Female", "french"="TRUE"))
## Now let's see if university conditions the effect of ideology on vote choice,
## by interacting our university dummy with the ideology scale.
mod8 <- multinom(vote ~ university*ideology + french + sex, data=qc12)
Anova(mod8)
plot(effect("university*ideology", mod8), rug=FALSE, given.values=c("french"="TRUE", "sex"="Female"))
## 4. Ordinal depednent variable
## Here we look at the effect of income on satisfaction with democracy in Quebec,
## an ordinal variable with three levels
library(MASS)
mod9 <- polr(satdem ~ age + french + income, data=qc12)
Anova(mod9)
plot(effect("income", mod9), rug=FALSE, given.values=c("french"="TRUE", "age"="mean(age)"))
#Another way to see this is using a stacked graph, although it doesn't give
## confidence bounds
plot(effect("income", mod9), style="stacked", rug=F, given.values=c("french"="TRUE", "age"="mean(age)"))
## We can also look at the effect of a dichotomous variable. Here, let's look at income in
## two categories
mod10 <- polr(satdem ~ age + french + income_2cat, data=qc12)
Anova(mod10)
plot(effect("income_2cat", mod10), rug=FALSE, given.values=c("french"="TRUE", "age"="mean(age)"))
# The effects package uses lattice to produce nice graphs that effectively present
# regression results. As you've seen, it's very easy to use. However, sometimes you might
# want to create something more specific taking advantage of the full power of lattice.
# Let's try using lattice manually. First, let's load the package.
library(lattice) # This step isn't necessary since the effects package already loaded lattice.
# It would be necessary though if we hadn't already used effects.
# The most common command in lattice is xyplot. It creates a scatterplot, with various types of
# add-ons, including text and linear regression lines.
# Lattice graphs are created in two steps. First, the main function (xyplot, histogram,
#etc.) sets up the general elements of the graph. Second, the panel function sets up things
# in the plotting region. Let's look at the relationship between Gallagher's index of
# disproportionality and mean district magnitude in the prop data.
xyplot(log(gallagher) ~ log(mdm), data=prop, # Sets up the graph
panel=function(x,y){ # Create panel function
panel.points(x,y, col="black") # Display points, set colour to black
panel.lmline(x,y) # display linear regression line
})
# You can split the graphs to condition on a third variable.
xyplot(log(gallagher) ~ log(mdm) | regime, data=prop, # condition on regime
panel=function(x,y){
panel.points(x,y, col="black")
panel.lmline(x,y, col="blue")
})
# We can add country-year labels by adding the line panel.text(x,y, labels=labelvar). To
# export a graph to a pdf. Add pdf("filename.pdf") before the code creating the graph
# and dev.off() after it. the commmand pdf() opens a pdf file and all subsequent
# graphs are sent to it. dev.off() closes the file.
# Example:
pdf("mygraph.pdf")
xyplot(log(gallagher) ~ log(mdm) | regime, data=prop,
panel=function(x,y,subscripts){
panel.text(x,y, labels=prop$countryyear[subscripts], cex=0.5, col="black")# Reduce text size
panel.lmline(x,y, col="blue")
})
dev.off() # The file mygraph.pdf is now in the R working directory
## What if you need help?
## Whenever you need help with R (and we all do from time to time), type help(functionname).
# For example:
help(effect)
## Before ending today's workshop, let's clean up our workspace and quit R.
rm(list=ls())
q()
## That's all for today. If you have any questions or need any help, don't hesitate to ask
# me. You can contact me at eric.guntermann@umontreal.ca