Building a Pivot Table in R (with a Bit of SQL)

Data, Dogs, Ducati (and Baseball)


Building a Pivot Table in R (with a Bit of SQL)

Tagged as: [ loops  style  R  vectorizing  ]

9 April 2018 – Building an Excel-style pivot table in R is not a problem. We are going to cover the process in this post. Like most data problems, the most involved part is cleaning and structuring the data in a manner suitable for the task. Those steps are decribed below, as well.

Building an Excel-style pivot table in R is not a problem. We are going to cover the process in this post.

Like most data problems, the most involved part is cleaning and structuring the data in a manner suitable for the task. Those steps are decribed below, as well.

The data we are working with in this example are utility readings for a collection of residential, commercial, and multi-family accounts for 2014 through 2017. We will create a pivot table giving the median value on a monthly basis for each of the three account types.

A key part of the exercise will be using a bit of SQL functionality built into R to join data tables.

Begin by loading the packages we will work with.

library(tidyverse)
library(magrittr)
library(lubridate)
library(reshape)
library(data.table)
library(huxtable)

Our data are found in four CSV files which we need to load. In a bit, we will combine the three main files and use the Bill file to classify the different accounts.

FY15 <- read.csv('https://raw.githubusercontent.com/seslezak/R-Code/master/data/Usage%20FYE15.csv', header = TRUE, na.strings=c("","NA"))
FY16 <- read.csv('https://raw.githubusercontent.com/seslezak/R-Code/master/data/Usage%20FYE16.csv', header = TRUE, na.strings=c("","NA"))
FY17 <- read.csv('https://raw.githubusercontent.com/seslezak/R-Code/master/data/Usage%20FYE17.csv', header = TRUE, na.strings=c("","NA"))
Bill <- read.csv('https://raw.githubusercontent.com/seslezak/R-Code/master/data/Billing.csv', header = TRUE, na.strings=c("","NA"))

We want to understand the data we are dealing with, and make sure the three main data files have the same basic structure, so we can look at feature names and the structure of the data. We see WATER records are just one of four factor levels in the service column. We can also see the Bill file contains the account numbers and service types. The other files contain only account numbers, so we need to use Bill to classify accounts by service types in the pivot table.

colnames(FY15)
## [1] "amount"     "bill_usage" "service"    "tran_date"  "account"
colnames(FY16)
## [1] "amount"     "bill_usage" "service"    "tran_date"  "account"
colnames(FY17)
## [1] "amount"     "bill_usage" "service"    "tran_date"  "account"
colnames(Bill)
## [1] "company"   "tran_date" "type_str"  "service"   "amount"    "rate_code"
## [7] "account"
str(FY15)
## 'data.frame':    54165 obs. of  5 variables:
##  $ amount    : num  22.8 30.9 24.9 22.8 24.9 ...
##  $ bill_usage: int  4 8 5 4 5 5 4 6 4 8 ...
##  $ service   : Factor w/ 4 levels "DWELLING C","FIRE STAND",..: 4 4 4 4 4 4 4 4 3 3 ...
##  $ tran_date : Factor w/ 22 levels "01-06-15","01-12-14",..: 19 20 14 21 15 12 18 13 19 20 ...
##  $ account   : Factor w/ 2324 levels "1-1051A","1-1085A",..: 1236 1236 1236 1236 1236 1236 1236 1236 1236 1236 ...
str(FY16)
## 'data.frame':    50997 obs. of  5 variables:
##  $ amount    : num  31 31 27.7 31 24.5 ...
##  $ bill_usage: int  500 500 400 500 300 400 500 300 300 400 ...
##  $ service   : Factor w/ 4 levels "DWELLING C","FIRE STAND",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ tran_date : Factor w/ 34 levels "01-02-16","01-04-16",..: 31 32 24 33 26 34 28 20 29 21 ...
##  $ account   : Factor w/ 2264 levels "1-1051A","1-1085A",..: 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 ...
str(FY17)
## 'data.frame':    53560 obs. of  5 variables:
##  $ amount    : num  37.5 31 31 26 26 ...
##  $ bill_usage: int  700 500 500 400 400 500 500 400 700 500 ...
##  $ service   : Factor w/ 4 levels "DWELLING C","FIRE STAND",..: 4 4 4 4 4 4 4 4 3 3 ...
##  $ tran_date : Factor w/ 33 levels "01-01-17","01-03-17",..: 30 31 24 32 25 33 27 18 30 31 ...
##  $ account   : Factor w/ 2333 levels "1-1051A","1-1085A",..: 1242 1242 1242 1242 1242 1242 1242 1242 1242 1242 ...
str(Bill)
## 'data.frame':    2171 obs. of  7 variables:
##  $ company  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ tran_date: Factor w/ 3 levels "01-03-17","03-03-17",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ type_str : Factor w/ 1 level "Charge": 1 1 1 1 1 1 1 1 1 1 ...
##  $ service  : Factor w/ 1 level "WATER": 1 1 1 1 1 1 1 1 1 1 ...
##  $ amount   : num  29.9 10 29.9 18 10 ...
##  $ rate_code: Factor w/ 8 levels "W1C","W1M","W4C",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ account  : Factor w/ 2171 levels "1-1051A","1-1085A",..: 1158 1442 1600 823 718 399 1855 963 989 2029 ...

We can isolate the water usage data in the files. We will also remove the dollar amount columns since we don’t need that information. This will leave us with the features we need: usage, service, date, and account. We check the structure of the files to make sure they are what we expect.

FY15W <- FY15[FY15$service == 'WATER', ]
FY15W <- FY15W[, -1]
str(FY15W)
## 'data.frame':    26189 obs. of  4 variables:
##  $ bill_usage: int  4 8 5 4 5 5 4 6 5 4 ...
##  $ service   : Factor w/ 4 levels "DWELLING C","FIRE STAND",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ tran_date : Factor w/ 22 levels "01-06-15","01-12-14",..: 19 20 14 21 15 12 18 13 22 16 ...
##  $ account   : Factor w/ 2324 levels "1-1051A","1-1085A",..: 1236 1236 1236 1236 1236 1236 1236 1236 1236 1236 ...
FY16W <- FY16[FY16$service == 'WATER', ]
FY16W <- FY16W[, -1]
str(FY16W)
## 'data.frame':    25116 obs. of  4 variables:
##  $ bill_usage: int  500 500 400 500 300 400 500 300 300 400 ...
##  $ service   : Factor w/ 4 levels "DWELLING C","FIRE STAND",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ tran_date : Factor w/ 34 levels "01-02-16","01-04-16",..: 31 32 24 33 26 34 28 20 29 21 ...
##  $ account   : Factor w/ 2264 levels "1-1051A","1-1085A",..: 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 ...
FY17W <- FY17[FY17$service == 'WATER', ]
FY17W <- FY17W[, -1]
str(FY17W)
## 'data.frame':    26161 obs. of  4 variables:
##  $ bill_usage: int  700 500 500 400 400 500 500 400 500 600 ...
##  $ service   : Factor w/ 4 levels "DWELLING C","FIRE STAND",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ tran_date : Factor w/ 33 levels "01-01-17","01-03-17",..: 30 31 24 32 25 33 27 18 28 20 ...
##  $ account   : Factor w/ 2333 levels "1-1051A","1-1085A",..: 1242 1242 1242 1242 1242 1242 1242 1242 1242 1242 ...

In each of the files, we want to identify any missing data and remove the rows containing them. We check dimensions to make sure the changes took.

sum(is.na(FY15W))
## [1] 2
FY15W[rowSums(is.na(FY15W)) > 0,]
##       bill_usage service tran_date account
## 54163   44555800   WATER      <NA>    <NA>
which(FY15W == 44555800)
## [1] 26189
FY15W <- FY15W[-26189, ]
dim(FY15W)
## [1] 26188     4

We repeat the process for the other two data files.

sum(is.na(FY16W))
## [1] 0
sum(is.na(FY17W))
## [1] 2
FY17W[rowSums(is.na(FY17W)) > 0,]
##       bill_usage service tran_date account
## 53557   34111504   WATER      <NA>    <NA>
which(FY17W == 34111504)
## [1] 26161
FY17W <- FY17W[-26161, ]
dim(FY17W)
## [1] 26160     4

Many rows appear to contain zero values, so we should remove them, as well.

FY15W <- FY15W[apply(FY15W[1], 1, function(z) !any(z <= 0)), ]
FY16W <- FY16W[apply(FY16W[1], 1, function(z) !any(z <= 0)), ]
FY17W <- FY17W[apply(FY17W[1], 1, function(z) !any(z <= 0)), ]

One problem with the usage information for FY15 is the data are represented as hundreds of units. In the other files data are represented as actual units. So we need to adjust the FY15 data. Let’s create a new column called x100 in the FY15W data frame for the adjustments. Then we remove the original usage column, move our new column to the first position, and change its header back to the original bill_usage.

FY15W$x100 <- FY15W$bill_usage * 100
FY15W <- FY15W[, -1]
FY15W <- FY15W %>% select(x100, everything())
colnames(FY15W)[1] <- "bill_usage"

The format for the tran_date information isn’t useful so we will pull the month and year information out and create new columns for them.

FY15W$Date <- dmy(FY15W$tran_date)
FY16W$Date <- dmy(FY16W$tran_date)
FY17W$Date <- dmy(FY17W$tran_date)
FY15W <- FY15W %>% mutate(Month = lubridate::month(Date, label = TRUE))
FY16W <- FY16W %>% mutate(Month = lubridate::month(Date, label = TRUE))
FY17W <- FY17W %>% mutate(Month = lubridate::month(Date, label = TRUE))
FY15W <- FY15W %>% mutate(Year = year(Date))
FY16W <- FY16W %>% mutate(Year = year(Date))
FY17W <- FY17W %>% mutate(Year = year(Date))

With that done, we can remove the original tran_date columns.

FY15W <- FY15W[, -3]
FY16W <- FY16W[, -3]
FY17W <- FY17W[, -3]

Let’s check the column names to make sure everything is in order.

colnames(FY15W)
## [1] "bill_usage" "service"    "account"    "Date"       "Month"     
## [6] "Year"
colnames(FY16W)
## [1] "bill_usage" "service"    "account"    "Date"       "Month"     
## [6] "Year"
colnames(FY17W)
## [1] "bill_usage" "service"    "account"    "Date"       "Month"     
## [6] "Year"

The data we need from the Bill file are the rate codes and the account numbers. The other CSV files contain only account information. We need to tie those files to the rate codes in the Bill file so we can combine usage by group in our pivot table: residential, commercial, and multi-family users.

To do this, we will create a new data frame – RateAcct – that will contain this data. We can run the object to make sure we got it right.

RateAcct <- select(Bill, rate_code, account)
RateAcct
##      rate_code  account
## 1          WR1  3-1094N
## 2          WR1  4-9738L
## 3          WR1  6-1854C
## 4          WR1 14-6711Y
## 5          WR1 13-9642C
## 6          WR1 11-8362J
## 7          WR1  7-7712Q
## 8          WR1 15-6146W
## 9          WR1 15-7782L
## 10         WR1  8-9707E
##  [ reached getOption("max.print") -- omitted 2161 rows ]

We can get the information we need from the RateAcct data frame into the three main data files by using a left join, just as we would in SQL. A strength of R is its SQL functionality, which is contained in several packages.

Here, we use the left_join command found in the dplyr library. This will associate each account number with the appropriate service class for each of the three data CSVs.

FY15W <- FY15W %>% left_join(RateAcct, by = 'account')
sum(is.na(FY15W))
## [1] 1924
FY16W <- FY16W %>% left_join(RateAcct, by = 'account')
sum(is.na(FY16W))
## [1] 2024
FY17W <- FY17W %>% left_join(RateAcct, by = 'account')

We are approaching the point at which we need to combine the files into a single data frame, so let’s get a sense of the missing data present in the three CSVs.

sum(is.na(FY17W))
## [1] 844
sum(sum(is.na(FY15W)), sum(is.na(FY16W)), sum(is.na(FY17W)))
## [1] 4792

The rate_code information isn’t in a useful format. We need to create strings for “Residential”, “Commercial”, and “MultiFamily” accounts. These will be the headers for our pivot table so it is time to create them in the data frames.

The only information in the code useful for us are the letters R, C, and M. In each data frame, we need to remove all other characters and convert those letters into “Residential”, “Commercial”, and “MultiFamily” strings.

This could be accomplished with a loop, but R is a vector language and loops are prone to all kinds of problems in R. In this paricular case, we would have to use a loop and ifelse() statements, which is probably a bit too complex a solution for this problem. Though one could work, we will avoid using a loop here.

FY15W$rate_code <- str_replace_all(FY15W$rate_code, '[W1234]', '')
FY15W$rate_code <- str_replace_all(FY15W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'MultiFamily'))
FY16W$rate_code <- str_replace_all(FY16W$rate_code, '[W1234]', '')
FY16W$rate_code <- str_replace_all(FY16W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'MultiFamily'))
FY17W$rate_code <- str_replace_all(FY17W$rate_code, '[W1234]', '')
FY17W$rate_code <- str_replace_all(FY17W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'MultiFamily'))

We check NA sums again to make sure we haven’t added any blanks.

sum(sum(is.na(FY15W)), sum(is.na(FY16W)), sum(is.na(FY17W)))
## [1] 4792

Now, we can combine the three files into a single data frame, WaterData. We will also rename the columns in the new data table.

WaterData <- rbind(FY15W, FY16W)
WaterData <- rbind(WaterData, FY17W)
WaterData <- dplyr::rename(WaterData, Usage = bill_usage, Service = service, 
                       Account = account, Date = Date, Month = Month, Year = Year, 
                       Class = rate_code)

We need to convert the Class information to factors. Then we summarize the data frame to make sure all is in good shape.

WaterData$Class <- as.factor(WaterData$Class)
str(WaterData)
## 'data.frame':    74214 obs. of  7 variables:
##  $ Usage  : num  400 800 500 400 500 500 400 600 500 400 ...
##  $ Service: Factor w/ 4 levels "DWELLING C","FIRE STAND",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Account: chr  "3-1094N" "3-1094N" "3-1094N" "3-1094N" ...
##  $ Date   : Date, format: "2014-07-31" "2014-08-31" ...
##  $ Month  : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 7 8 9 10 11 4 5 6 12 1 ...
##  $ Year   : int  2014 2014 2014 2014 2014 2015 2015 2015 2014 2015 ...
##  $ Class  : Factor w/ 3 levels "Commercial","MultiFamily",..: 3 3 3 3 3 3 3 3 3 3 ...

There are some odd outlier values in the data so let’s identify them and clean them up. Remember, the data are expressed in actual usage, so some of these numbers don’t make sense. We can identify them and their location in the data, and remove the rows.

min(WaterData$Usage)
## [1] 1
which(WaterData == 1)
## [1] 67895
WaterData <- WaterData[-67895, ]
min(WaterData$Usage)
## [1] 3
which(WaterData == 3)
## [1] 67895
WaterData <- WaterData[-67895, ]
min(WaterData$Usage)
## [1] 100
max(WaterData$Usage)
## [1] 999900
which(WaterData == 999900)
## [1] 15937
WaterData <- WaterData[-15937, ]

Run a check to make sure we’ve removed the outliers.

max(WaterData$Usage)
## [1] 123800
summary(WaterData$Usage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     100     500     900    1544    1700  123800

There are still nearly 5000 NA values in our data. These are accounts without a class associated with them, so we do not know if they are Residential, Commericial, or Multi-Family. We can find their relative proportions by creating a proportion table. We see the vast majority of accounts are Residential. Commercial and Multi-Family accounts combined make up about 2.5% of all accounts.

sum(is.na(WaterData))
## [1] 4790
table(WaterData$Class) / length(WaterData$Class)
## 
##  Commercial MultiFamily Residential 
##    0.019633    0.005323    0.910498
prop.table(table(WaterData$Class))
## 
##  Commercial MultiFamily Residential 
##     0.02099     0.00569     0.97332

We could come up with a solution by assigning to each missing account type a value in proportion to that type’s numbers in the rest of the data set, but that is more than we need to do at this point. We will simply omit the NA values in the next step.

We are ready to create the pivot table from the data we have cleaned and structured.

First, we create a special data frame containing only data and features we need for the pivot table. We arrange all the data by Month then we group data by Class. We remove the Service, Account, Date, and Year features. We will also omit all the NA accounts we saw earlier.

WaterTable <- WaterData %>% arrange(Month) %>% group_by(Class) %>% select(-c(Service, Account, Date, Year)) %>% na.omit()

This leaves us with a table consisting only of Usage, Month, and Class. We convert that into a data table object we will use to create the final table. We build the table to provide the median usage by Month and Class.

The final product will be a 12 x 4 pivot table. One row is for each month, and the four columns will be Month, Commercial, MultiFamily, and Residential.

The WaterTable in the previous chunk has dimensions of 69,421 x 3. So we need to transform it one last time into the final shape. It might be helpful to visualize each step here so we can see the transformation take place.

We are creating a table, so we need to convert the data frame object “WaterTable”" into a data table object, which we will call “DataTable.”

DataTable<- data.table(WaterTable)
DataTable
##        Usage Month       Class
##     1:   400   Jan Residential
##     2:  1400   Jan Residential
##     3:   400   Jan Residential
##     4:   400   Jan Residential
##     5:   700   Jan Residential
##    ---                        
## 69417:  2700   Dec Residential
## 69418:   800   Dec Residential
## 69419:   200   Dec Residential
## 69420:  1500   Dec  Commercial
## 69421:  1200   Dec Residential

Next, we need to summarize Usage by median for each month and then group the medians by Month and Class. We will call this object “WaterResult.”

WaterResult <- DataTable[, median(Usage), by = .(Month,Class)]
WaterResult
##     Month       Class    V1
##  1:   Jan Residential   700
##  2:   Jan MultiFamily 10100
##  3:   Jan  Commercial  1200
##  4:   Feb Residential   600
##  5:   Feb MultiFamily  9200
##  6:   Feb  Commercial  1150
##  7:   Mar Residential   600
##  8:   Mar MultiFamily 11350
##  9:   Mar  Commercial  1100
## 10:   Apr Residential   900
## 11:   Apr MultiFamily 10800
## 12:   Apr  Commercial  1500
## 13:   May Residential  1000
## 14:   May MultiFamily 12350
## 15:   May  Commercial  1500
## 16:   Jun Residential  1100
## 17:   Jun MultiFamily 12150
## 18:   Jun  Commercial  1450
## 19:   Jul Residential  1400
## 20:   Jul MultiFamily 16100
## 21:   Jul  Commercial  2000
## 22:   Aug Residential  1400
## 23:   Aug MultiFamily 13400
## 24:   Aug  Commercial  1500
## 25:   Sep Residential  1300
## 26:   Sep MultiFamily 13100
## 27:   Sep  Commercial  1200
## 28:   Oct Residential  1200
## 29:   Oct MultiFamily 14800
## 30:   Oct  Commercial  1600
## 31:   Nov Residential  1000
## 32:   Nov MultiFamily 12900
## 33:   Nov  Commercial  1100
## 34:   Dec Residential   800
## 35:   Dec MultiFamily 10600
## 36:   Dec  Commercial  1300
##     Month       Class    V1

The final hurdle is spreading out the second column. We can see the Class column contains our three account types. We need to spread those out over three columns. In other words, we have a situation in which the information we want in columns is contained in rows. We can ue the spread() function to do exactly that.

Notice, what was the Usage column was renamed V1 by the system when we summarized the data. We need to refer to the new column name when creating the final object, “WaterOut.”

WaterOut <- spread(WaterResult, Class, V1)
WaterOut
##     Month Commercial MultiFamily Residential
##  1:   Jan       1200       10100         700
##  2:   Feb       1150        9200         600
##  3:   Mar       1100       11350         600
##  4:   Apr       1500       10800         900
##  5:   May       1500       12350        1000
##  6:   Jun       1450       12150        1100
##  7:   Jul       2000       16100        1400
##  8:   Aug       1500       13400        1400
##  9:   Sep       1200       13100        1300
## 10:   Oct       1600       14800        1200
## 11:   Nov       1100       12900        1000
## 12:   Dec       1300       10600         800

That’s the pivot table format we want. We have 12 rows and four columns. To finish the project, we use the kable() function from the knitr package to create a pretty table we can post in an HTML online.

Month Commercial MultiFamily Residential
Jan 1200 10100 700
Feb 1150 9200 600
Mar 1100 11350 600
Apr 1500 10800 900
May 1500 12350 1000
Jun 1450 12150 1100
Jul 2000 16100 1400
Aug 1500 13400 1400
Sep 1200 13100 1300
Oct 1600 14800 1200
Nov 1100 12900 1000
Dec 1300 10600 800
plot(WaterOut)

ggplot(WaterOut, aes(Month, Residential)) + geom_col()

ggplot(WaterOut, aes(Month, Commercial)) + geom_col()

ggplot(WaterOut, aes(Month, MultiFamily)) + geom_col()

Written on April 9, 2018