Preparing your data for ggplot2

This guide will likely have the least amount of code, but it may contain the most difficult concept to grasp.

The simple reason is that ggplot2 asks to have the data you want to plot as the focal point of your dataframe.

To make that clear, let’s load in the data and libraries we’ll need.

library(readr)
library(tidyr)

df <- read_csv("dfCrime.csv")
## Parsed with column specification:
## cols(
##   Year_Quarter = col_character(),
##   year = col_integer(),
##   quarter = col_character(),
##   Total_CFS = col_integer(),
##   Total_arrests = col_integer(),
##   Total_RTR = col_integer(),
##   SOF_only = col_integer(),
##   UOF_only = col_integer(),
##   Transitions = col_integer()
## )
head(df)
## # A tibble: 6 x 9
##   Year_Quarter  year quarter Total_CFS Total_arrests Total_RTR SOF_only
##          <chr> <int>   <chr>     <int>         <int>     <int>    <int>
## 1      2014 1Q  2014      1Q     19217           989        32       12
## 2      2014 2Q  2014      2Q     21265          1178        25        7
## 3      2014 3Q  2014      3Q     21994          1246        36       11
## 4      2014 4Q  2014      4Q     18182          1047        28        6
## 5      2015 1Q  2015      1Q     18178          1014        34       10
## 6      2015 2Q  2015      2Q     19812           929        32        9
## # ... with 2 more variables: UOF_only <int>, Transitions <int>

The head command shows the first six rows of the dataframe.

df is set up like a typical spreadsheet: Rows contain all the information pertaining to the item in the first column, Year_Quarter. Each subsequent column contains only the information described by the column header.

So, total_RTR is the total number of response to resistance incidents. The next three columns - SOF_only, UOF_only and Transitions - are three categories of RTR incidents. If we add up the totals of these three for each row we would get the number in Total_RTR.

Those three categories are part of a whole. So ggplot2 wants us to arrange the numbers that way, with all the other columns being descriptive of those numbers.

We can do that with the tidyer library. But first, let’s add the sort column to the dataset, as it’s something we’ll need in the future.

dfsort <-df[order(df$Year_Quarter),]
dfsort$sort <- seq.int(nrow(dfsort))
names(dfsort)
##  [1] "Year_Quarter"  "year"          "quarter"       "Total_CFS"    
##  [5] "Total_arrests" "Total_RTR"     "SOF_only"      "UOF_only"     
##  [9] "Transitions"   "sort"
head(dfsort)
## # A tibble: 6 x 10
##   Year_Quarter  year quarter Total_CFS Total_arrests Total_RTR SOF_only
##          <chr> <int>   <chr>     <int>         <int>     <int>    <int>
## 1      2014 1Q  2014      1Q     19217           989        32       12
## 2      2014 2Q  2014      2Q     21265          1178        25        7
## 3      2014 3Q  2014      3Q     21994          1246        36       11
## 4      2014 4Q  2014      4Q     18182          1047        28        6
## 5      2015 1Q  2015      1Q     18178          1014        34       10
## 6      2015 2Q  2015      2Q     19812           929        32        9
## # ... with 3 more variables: UOF_only <int>, Transitions <int>, sort <int>

First we sort the dataframe based on the column Year_Quarter, then store the sorted information in a new dataframe called dfsort.

Then we grab the number of each row and store it in a new column called dfsort$sort.

Now let’s use tidyr to reshape our dataframe.

dfsub <- gather(dfsort, set, value, 7:9, factor_key=TRUE)
names(dfsub)
## [1] "Year_Quarter"  "year"          "quarter"       "Total_CFS"    
## [5] "Total_arrests" "Total_RTR"     "sort"          "set"          
## [9] "value"
head(dfsub)
## # A tibble: 6 x 9
##   Year_Quarter  year quarter Total_CFS Total_arrests Total_RTR  sort
##          <chr> <int>   <chr>     <int>         <int>     <int> <int>
## 1      2014 1Q  2014      1Q     19217           989        32     1
## 2      2014 2Q  2014      2Q     21265          1178        25     2
## 3      2014 3Q  2014      3Q     21994          1246        36     3
## 4      2014 4Q  2014      4Q     18182          1047        28     4
## 5      2015 1Q  2015      1Q     18178          1014        34     5
## 6      2015 2Q  2015      2Q     19812           929        32     6
## # ... with 2 more variables: set <fctr>, value <int>

It’s difficult to see in this guide, but run the script and then compare the new dataframe dfsub with the original df.

Here’s what we’ve done:

  • dfsub <- gather(dfsort, sets up the tidyr command.

  • 7:9 lists the columns to rewor - in this case, the three categories of RTR incidents.

  • set, value, creates two new columns using those three. set is where it stores the column names for three categories. value is where it puts the values of those three columns.

  • factor_key=TRUE saves set as a factor of value

Every other column in the original dataset is included.

As you can see, there is only one of each value in the value column, but there are multiple instances of the three categories in the set column.

The reason for this will become far more clear once we start plotting more complicated graphics than the simple bar chart. For now, especially if you’re used to working in excel, this is a new way of considering your data and it will likely take a little time to get used to.

We’ll finish this off by making those levels in set a bit more reader friendly because they’ll be showing up in our graphics as the text of legends.

dfsub$set <- factor(dfsub$set, levels = c("Transitions","SOF_only","UOF_only"),
                 labels = c("Transitions","Show of force","Use of force" ))

write_csv(dfsub,"dfsubset.csv")

dfsub$set <- factor(dfsub$set, takes the column set and saves it as a factor.

levels = c(“Transitions”,“SOF_only”,“UOF_only”), selects the levels in set. If you recall, these were our column names. Note the order in which we select them. In this step, we can also set the order that these are used in plots.

labels = c(“Transitions”,“Show of force”,“Use of force” )) replaces the selected levels with new, more reader-friendly labels.

Finally, we save the result in a cvs.

Next, we’ll use the data we reconfigured to create more complex plots.