Using scatterplots to explore correlations

Scatterplots are very useful for seeing if there’s a relation between two sets of data.

With this tutorial, we’re going to start from scratch with our data and use qplot to explore the data we have and then create a publication-ready graphic to present what we found.

Working the data

Let’s start by loading in the libraries and data we want to use.

# load the libraries 
library(readr)
library(ggplot2)
library(ggthemes)

# load in the data we want to explore
df <- read_csv("ElginUOF.csv")

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Year_Quarter = col_character()
## )

## See spec(...) for full column specifications.

Now, let’s use summary() to get an idea of the data. But let’s go a step further and store the result of summary() in a dataframe so we can refer to it anytime we want.

dfsum <- data.frame(unclass(summary(df)), check.names = FALSE, stringsAsFactors = FALSE)

dfsum

##         Year_Quarter       Total_CFS    Total_arrests Total_RTR_incidents
## 1 Length:12          Min.   :18178   Min.   : 889.0       Min.   :25.00  
## 2 Class :character   1st Qu.:19663   1st Qu.: 947.8       1st Qu.:32.00  
## 3 Mode  :character   Median :21544   Median : 994.5       Median :35.50  
## 4               <NA> Mean   :21341   Mean   :1013.2       Mean   :39.67  
## 5               <NA> 3rd Qu.:22753   3rd Qu.:1046.2       3rd Qu.:50.50  
## 6               <NA> Max.   :24715   Max.   :1246.0       Max.   :56.00  
##          SOF_only        UOF_only Total_transitions Handgun_drawn_NP
## 1 Min.   : 6.00   Min.   :15.00    Min.   : 2.000    Min.   :0.000  
## 2 1st Qu.: 9.75   1st Qu.:16.00    1st Qu.: 3.000    1st Qu.:1.000  
## 3 Median :12.00   Median :19.50    Median : 6.500    Median :3.000  
## 4 Mean   :11.67   Mean   :21.92    Mean   : 6.083    Mean   :3.417  
## 5 3rd Qu.:13.25   3rd Qu.:25.75    3rd Qu.: 8.000    3rd Qu.:5.250  
## 6 Max.   :19.00   Max.   :35.00    Max.   :12.000    Max.   :8.000  
##    Handgun_pointed Handgun_discharged   Total_handgun Traser_drawn_NP
## 1 Min.   : 2.000          Min.   :0   Min.   : 4.00   Min.   :0.000  
## 2 1st Qu.: 3.750          1st Qu.:0   1st Qu.: 7.00   1st Qu.:1.000  
## 3 Median : 6.500          Median :0   Median : 9.50   Median :2.000  
## 4 Mean   : 6.667          Mean   :0   Mean   :10.08   Mean   :2.083  
## 5 3rd Qu.: 9.250          3rd Qu.:0   3rd Qu.:13.50   3rd Qu.:3.000  
## 6 Max.   :12.000          Max.   :0   Max.   :18.00   Max.   :6.000  
##      Taser_pointed     Taser_fired     Total_taser Officer_noInjuries
## 1 Min.   : 2.000   Min.   :1.000   Min.   : 5.00      Min.   :19.00  
## 2 1st Qu.: 3.750   1st Qu.:2.750   1st Qu.: 8.00      1st Qu.:26.50  
## 3 Median : 6.000   Median :4.500   Median :13.00      Median :32.00  
## 4 Mean   : 7.083   Mean   :4.667   Mean   :13.83      Mean   :34.08  
## 5 3rd Qu.: 9.250   3rd Qu.:6.500   3rd Qu.:19.50      3rd Qu.:43.75  
## 6 Max.   :16.000   Max.   :9.000   Max.   :27.00      Max.   :51.00  
##    Officer_minor    Officer_mayor   Total_officer Suspect_noInjuries
## 1 Min.   :1.00   Min.   :0.0000   Min.   :1.000      Min.   :18.00  
## 2 1st Qu.:3.75   1st Qu.:0.0000   1st Qu.:4.000      1st Qu.:25.00  
## 3 Median :5.50   Median :0.0000   Median :6.000      Median :28.50  
## 4 Mean   :5.25   Mean   :0.3333   Mean   :5.583      Mean   :32.42  
## 5 3rd Qu.:7.00   3rd Qu.:1.0000   3rd Qu.:7.000      3rd Qu.:42.00  
## 6 Max.   :8.00   Max.   :1.0000   Max.   :9.000      Max.   :47.00  
##      Suspect_minor    Suspect_major   Total_suspect
## 1 Min.   : 3.000   Min.   :0.0000   Min.   : 3.00  
## 2 1st Qu.: 4.750   1st Qu.:0.0000   1st Qu.: 4.75  
## 3 Median : 7.500   Median :0.0000   Median : 7.50  
## 4 Mean   : 7.083   Mean   :0.1667   Mean   : 7.25  
## 5 3rd Qu.: 8.000   3rd Qu.:0.0000   3rd Qu.: 8.25  
## 6 Max.   :14.000   Max.   :1.0000   Max.   :14.00

Now we can click on dfsum in the environment window of R studio and view the results as a dataframe. All the code that made dfsum is the kind of thing that we know works but that we don’t have to spend a lot of time understanding why it works. That can come later.

What we’re looking at is three years of data, by quarter, showing how many times officers had to use physical force to restrain suspects, how many times they used a taser or gun and how many times an officer or suspect was injured

This isn’t a big dataset - there are very few rows and columns But that means it’s also easy to understand what we’re seeing. When you start working with data that has 100s of rows and columns these same techniques are going to be vital to understanding what you have.

Right off the bat, I can see I need to create a column that only has years and not years + quarters. As long as I’m at it, I’ll store a sort order in another column.

I don’t know if I’m going to use these, but from experience I know this is the step where I would create them.

df <-df[order(df$Year_Quarter),]
df$sort <- seq.int(nrow(df))

The first command sorts the dataframe by the Year_Quarter column, while the second command creates a column called “sort” and stores the row number in it.

Now we’re going to get a little tricky. We have a column - Year_Quarter - that we want to grab information from. Luckily, the information in that column is formatted the same way: YYYY QQ, where YYYY = the year and QQ equals the quarter. With a space, that’s seven characters in the column.

So we can use the “substr” command to select what we want.

# grab the year
df$year <- as.numeric(as.character( substr(df$Year_Quarter, start=1, stop=4) ))
# grab the quarter
df$quarter <- substr(df$Year_Quarter, start=6, stop=7)

The first command selects the first four characters and stores them as a number (as.numeric(as.character) in a column called “year.” The second line selects the last two characters and stores them in “quarter.” We don’t need to specify a data type since it’s already considered a character.

We’ve done some work with the dataframe - let’s update dfsum. We’ll simply run the command we did before and overwrite what we did previously.

dfsum <- data.frame(unclass(summary(df)), check.names = FALSE, stringsAsFactors = FALSE)
dfsum

##         Year_Quarter       Total_CFS    Total_arrests Total_RTR_incidents
## 1 Length:12          Min.   :18178   Min.   : 889.0       Min.   :25.00  
## 2 Class :character   1st Qu.:19663   1st Qu.: 947.8       1st Qu.:32.00  
## 3 Mode  :character   Median :21544   Median : 994.5       Median :35.50  
## 4               <NA> Mean   :21341   Mean   :1013.2       Mean   :39.67  
## 5               <NA> 3rd Qu.:22753   3rd Qu.:1046.2       3rd Qu.:50.50  
## 6               <NA> Max.   :24715   Max.   :1246.0       Max.   :56.00  
##          SOF_only        UOF_only Total_transitions Handgun_drawn_NP
## 1 Min.   : 6.00   Min.   :15.00    Min.   : 2.000    Min.   :0.000  
## 2 1st Qu.: 9.75   1st Qu.:16.00    1st Qu.: 3.000    1st Qu.:1.000  
## 3 Median :12.00   Median :19.50    Median : 6.500    Median :3.000  
## 4 Mean   :11.67   Mean   :21.92    Mean   : 6.083    Mean   :3.417  
## 5 3rd Qu.:13.25   3rd Qu.:25.75    3rd Qu.: 8.000    3rd Qu.:5.250  
## 6 Max.   :19.00   Max.   :35.00    Max.   :12.000    Max.   :8.000  
##    Handgun_pointed Handgun_discharged   Total_handgun Traser_drawn_NP
## 1 Min.   : 2.000          Min.   :0   Min.   : 4.00   Min.   :0.000  
## 2 1st Qu.: 3.750          1st Qu.:0   1st Qu.: 7.00   1st Qu.:1.000  
## 3 Median : 6.500          Median :0   Median : 9.50   Median :2.000  
## 4 Mean   : 6.667          Mean   :0   Mean   :10.08   Mean   :2.083  
## 5 3rd Qu.: 9.250          3rd Qu.:0   3rd Qu.:13.50   3rd Qu.:3.000  
## 6 Max.   :12.000          Max.   :0   Max.   :18.00   Max.   :6.000  
##      Taser_pointed     Taser_fired     Total_taser Officer_noInjuries
## 1 Min.   : 2.000   Min.   :1.000   Min.   : 5.00      Min.   :19.00  
## 2 1st Qu.: 3.750   1st Qu.:2.750   1st Qu.: 8.00      1st Qu.:26.50  
## 3 Median : 6.000   Median :4.500   Median :13.00      Median :32.00  
## 4 Mean   : 7.083   Mean   :4.667   Mean   :13.83      Mean   :34.08  
## 5 3rd Qu.: 9.250   3rd Qu.:6.500   3rd Qu.:19.50      3rd Qu.:43.75  
## 6 Max.   :16.000   Max.   :9.000   Max.   :27.00      Max.   :51.00  
##    Officer_minor    Officer_mayor   Total_officer Suspect_noInjuries
## 1 Min.   :1.00   Min.   :0.0000   Min.   :1.000      Min.   :18.00  
## 2 1st Qu.:3.75   1st Qu.:0.0000   1st Qu.:4.000      1st Qu.:25.00  
## 3 Median :5.50   Median :0.0000   Median :6.000      Median :28.50  
## 4 Mean   :5.25   Mean   :0.3333   Mean   :5.583      Mean   :32.42  
## 5 3rd Qu.:7.00   3rd Qu.:1.0000   3rd Qu.:7.000      3rd Qu.:42.00  
## 6 Max.   :8.00   Max.   :1.0000   Max.   :9.000      Max.   :47.00  
##      Suspect_minor    Suspect_major   Total_suspect            sort
## 1 Min.   : 3.000   Min.   :0.0000   Min.   : 3.00   Min.   : 1.00  
## 2 1st Qu.: 4.750   1st Qu.:0.0000   1st Qu.: 4.75   1st Qu.: 3.75  
## 3 Median : 7.500   Median :0.0000   Median : 7.50   Median : 6.50  
## 4 Mean   : 7.083   Mean   :0.1667   Mean   : 7.25   Mean   : 6.50  
## 5 3rd Qu.: 8.000   3rd Qu.:0.0000   3rd Qu.: 8.25   3rd Qu.: 9.25  
## 6 Max.   :14.000   Max.   :1.0000   Max.   :14.00   Max.   :12.00  
##             year            quarter
## 1 Min.   :2014   Length:12         
## 2 1st Qu.:2014   Class :character  
## 3 Median :2015   Mode  :character  
## 4 Mean   :2015                 <NA>
## 5 3rd Qu.:2016                 <NA>
## 6 Max.   :2016                 <NA>

Focusing on a portion

There are some interesting columns in there. Let’s look at one question: As use of physical force increased, do injuries for officers or suspects also increase?

To answer that, we need to create a column that sums the RTR instances that involve use of force. For Response to resistance, we have

show of force only, which is some kind of warning that physical force will be used
use of force only, which is an officer using physical restraining technique like a hold, in response to resistance
and transition, which includes both showing and using force

To account for all instances where force was used, we need to add up transition and use of force and put that into a new column we’ll call “force.”

df$force <- df$UOF_only + df$Total_transitions

That’s easy. What’s great about R is how it assumes you want the value in UOF_only row 2 to be added to the value of Total_transitions row 2 and so on.

Now let’s select the columns we want to work with and store them in a new dataframe called “dfLook.”

dfLook <- subset(df, select = c("year","Year_Quarter","force","Total_officer","Total_suspect"))

Good, but there’s something we didn’t consider: officer and suspect injuries consist of major and minor injuries. Let’s add major injuries to our new dataframe.

Then let’s save it so we don’t have to recreate the wheel each time.

dfLook$Officer_major <- df$Officer_mayor
dfLook$Suspect_major <- df$Suspect_major
write_csv(dfLook,"dfLook.csv")

We can do this because we didn’t change the order of the rows at an point. If we had sorted dfLook in a different way then the columns we added wouldn’t be matched to the right row.

Let’s take a look at the new dataframe.

summary(dfLook)

##       year      Year_Quarter           force       Total_officer  
##  Min.   :2014   Length:12          Min.   :18.00   Min.   :1.000  
##  1st Qu.:2014   Class :character   1st Qu.:22.75   1st Qu.:4.000  
##  Median :2015   Mode  :character   Median :24.50   Median :6.000  
##  Mean   :2015                      Mean   :28.00   Mean   :5.583  
##  3rd Qu.:2016                      3rd Qu.:36.50   3rd Qu.:7.000  
##  Max.   :2016                      Max.   :42.00   Max.   :9.000  
##  Total_suspect   Officer_major    Suspect_major   
##  Min.   : 3.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 4.75   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 7.50   Median :0.0000   Median :0.0000  
##  Mean   : 7.25   Mean   :0.3333   Mean   :0.1667  
##  3rd Qu.: 8.25   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :14.00   Max.   :1.0000   Max.   :1.0000

Right away we can see there’s more injuries to suspects than officers, but for officers there are slightly more major injuries.

Experimenting with qplot

Let’s focus on officers.

qplot(Total_officer,
      data=dfLook,
      fill=factor(year),
      bins = 12
)

There were more quarters with higher numbers of officer injuries in 2014 than in subsequent years.

Here’s how that looks with a density plot

qplot(Total_officer, data=dfLook, 
      geom="density",
      fill=factor(year),
      color=factor(year),
      linetype = factor(year),
      alpha=I(.5)
)

How about all incidents where force was used

qplot(force,
      data=dfLook,
      fill=factor(year),
      bins = 12
)

Clearly more quarters with higher numbers of use-of-force in 2016 than previous years. With density:

qplot(force, data=dfLook, 
      geom="density",
      fill=factor(year),
      color=factor(year),
      linetype = factor(year),
      alpha=I(.5)
)

It doesn’t appear that there will be much of a correlation between incidents where force was used and injuries to officers.

Testing relationships in the data

Let’s compare force and officer injuries with a scatter plot.

qplot(force,Total_officer,
      data=dfLook, 
      xlab="UOF incidents", 
      ylab="officer injuried") + 
  stat_smooth(method="lm")

There does seem to be a slight rise to the trend line.

The dark area associated with the trend line is the 95% confidence interval. That’s a prediction that if more data was added, 95% of the time it would fall within that area. As you can see, the area is pretty wide so there’s not a lot of use in that prediction.

Let’s take a closer look at the number behind that trend line.

fit_officer <- lm(force ~ Total_officer, data = dfLook)
summary(fit_officer)

## 
## Call:
## lm(formula = force ~ Total_officer, data = dfLook)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.344  -5.950  -1.957   6.713  14.481 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    23.3944     7.1031   3.294   0.0081 **
## Total_officer   0.8249     1.1936   0.691   0.5052   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.517 on 10 degrees of freedom
## Multiple R-squared:  0.04559,    Adjusted R-squared:  -0.04986 
## F-statistic: 0.4776 on 1 and 10 DF,  p-value: 0.5052

Lots to uppack here. Honestly, I am not qualified to explain in detail these more advanced statistic measures.

What follows is a very over-simplified explanation. Look at the last set of three lines, that starts with “Residual standard error.”

Generally, you want your Adjusted R-square to be as close to 1.00 as possible. Here it’s -0.04986. Negative numbers are fine. But it’s not very close to 1.00. Only about 4.9 percent of the observations can be explained by the relationship between the two data sets. So not much of a relationship.
The p-value should be less that 0.05. Here it’s well over: 0.5052.

HOWEVER: Don’t expect too much out of these stats - we’re using a very small dataset so it’s not going to be very informative and may even cause confusing results. Look at the portion labeled Coefficients:

Std. Error and t value. You want the t values to be as large as possible versus the std errors. Neither are. This is kind of like margin of error. If, for instance, a poll says two candidates are at 51% vs. 49% and the margin of error is five percentage points, the candidate with 49% could actually be at 54%.
The Pr(>|t|) or p-value uses those T values and Error to compute the P-value. The top one is well under 0.05 (0.0081) which is good. It has ** which denotes ( under signif.codes ) that it’s close to zero in whether the result is random versus signifcant. If it had *** it would be considered significant. But the bottom one is 0.5052 with is really bad!

Statistics is a science. Our purpose in graphing these values out is to see how the data relates to the regression line and maybe spot outliers. We will seldom use T values or R2 in our stories.

But there’s nothing wrong in knowing about these things. And in fact understanding when it’s telling you that something’s wrong can be very useful.

Here’s a link to a resource explaining what you see with the summary https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R

Using facet grids

We can take a deeper look at the data by pulling out aspects of it into their own charts. Let’s start with major injuries.

qplot(force,Total_officer,
      data=dfLook) + 
  stat_smooth(method="lm") +
  facet_grid(. ~ factor(Officer_major))

In terms of major injuries, there are either none in a quarter or there is one. Therefore, when we add facet_grid(. ~ factor(Officer_major) to the chart, it groups the quarters accordingly.

And with that, we can see major injuries occured when there were relatively few incidents of of force.

What can we see when we facet by year:

qplot(force,Total_officer,
      data=dfLook) + 
  stat_smooth(method="lm") +
  facet_grid(. ~ factor(year))

Three years results in three grids. In 2014 and 2015, injuries did tend to be higher when force was used most. But in 2016, the quarters where there were the most use of force incidents did not result in the highest number of injuries.

In fact, the one quarter in 2016 with the highest number of force incidents out of all three years had fewer officer injuries than any quarter in 2014.

Let’s do one more facet grid, with both years and major injuries.

qplot(force,Total_officer,
      data=dfLook, 
      xlab="UOF incidents", 
      ylab="officer injuried") + 
  stat_smooth(method="lm") +
  facet_grid(factor(Officer_major) ~ factor(year))

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): NaNs produced

The more you facet a dataset, the more you can confuse yourself. Here it’s pretty easy.

First, across the three columns, we’ve divided the data by years. Then, top and bottom rows are quarters where there were or were not major injuries. Again, there’s either zero major injuries or one. If there was a quarter with seven major injuries, we’d have three rows labeled 0, 1 and 7.

So this is pretty useful. First off, 2014 had two quarters with a major injury vs. 2015 and 2016. And these happened when there were generally fewer force incidents than in subsequent years.

But look at 2016: without the one quarter that had a major injury, there’s actually a negative, downward trend in terms of use of force vs. officer injuries. In other words, even though use of force in response to resistance was increasing, injuries to officer were going down.

Plot for publication

Let’s create a final scatter plot to show what we found.

#---------------------
# This function set styles for the chart
# Be sure to run it before you plot

theme_gfx <- function(...) {
  theme_set(theme_get() + theme(text = element_text(family = 'Verdana', size= 12, lineheight=0.9))) + 
    theme(
      # edit background colors
      plot.background = element_blank(),
      legend.background = element_blank(),
      panel.background=element_rect(fill="#E5E5E5"),
      strip.background=element_rect(fill="#E5E5E5"),
      # modify grid and tick lines
      panel.grid.major = element_line(size = .6, color="#D2D2D2"),
      panel.grid.minor = element_line(size = .6, color="#D2D2D2", linetype = "dashed"),
      axis.ticks = element_blank(),
      # edit font sizes
      plot.title = element_text(size = 27,face="bold"),
      plot.subtitle = element_text(size = 18),
      #legend.title=element_text(size = 13,face="bold"),
      legend.text=element_text(size=14.7),
      axis.title=element_text(size=15, face="bold"),
      axis.text=element_text(size=13.5),
      plot.caption=element_text(size=13.5, hjust=0),
      strip.text = element_text(face="bold", size=13.5, hjust=0),
      # This puts the legend across the top
      legend.position="top", 
      legend.direction="horizontal",
      # removes label for legend
      legend.title = element_blank(),
      ...
    )
}

#-----------------
# create the plot

officer <- ggplot(dfLook) +
  aes(x=force, 
      y=Total_officer
  ) + 
  geom_point(stat="identity", 
             size = 3,
             shape=21,
             #alpha=I(.5),
             aes(
               color=factor(year),
               fill=factor(year) )
  ) +
  geom_smooth(method=lm,se=FALSE) +
  theme_gfx() +
  scale_y_continuous(limits = c(0, 10), breaks=c(seq(0,10,2)) ) #+
  #scale_x_continuous(limits = c(0, 44), breaks=c(seq(0,44,2)) )

# add all the titles.
officer <- officer + 
  labs(
    title="Headline",
    subtitle="Intro",
    x="USE OF FORCE INCIDENTS",
    y="INJURIES TO OFFICERS",
    caption="\nSource: Elgin police")

# color scheme - comment out for B/W PDF
officer <- officer + scale_colour_manual( values = c("#0063A5", "#DE731D", "#009964", "#DA2128", "#6F2C91") ) + scale_fill_manual( values = c("#0063A5", "#DE731D", "#009964", "#DA2128", "#6F2C91") )
# make B/W PDF - remember to change name of file!
#officer <- officer + scale_colour_grey(start = 0, end = 0.75) + scale_fill_grey(start = 0, end = 0.75)

officer

By this time, you should be able to understand most of what we’ve done. A few notes:

We used factor(year) on both color and fill. Color makes the outline for the shape (21) the same color as the fill. If we had so many points that we needed to make them transparent ( with alpha=I(.5) ), the lines would still be solid while the fill would be transparent. Just a design choice.

The other thing we did is hide the confidence interval around the trendline with geom_smooth(method=lm,se=FALSE). With so few data points,the interval didn’t add much. Plus we’d have to explain it in an extensive note.

Finally, although we adjusted the y axis we did not do the same for the x axis. Does it need the zero point or is the labeling on there clear enough?

As far as what the chart shows, the intro could be something like this: In spite of officers increasing their use of hand-to-hand techniques in response to resistance, the number of officers injured has not increased at the same rate.

Try it yourself

Using the code and data we already have, see how injuries to suspects compare to use of force incidents.

15_Alltogether