12 Predictive modeling III: Time series

In this chapter we will discuss how we can model and predict time series problems. For this analysis, we will use the new packages modeltime and timetk:

install.packages("timetk")
install.packages("modeltime")

library(timetk)
library(modeltime)
library(tidyverse)
library(tidymodels)
library(lubridate)

12.1 Data profiling

Let’s assume that we want to predict the total volume of traffic in an intersection for any given day in the future. This is a time series problem, because the total volume of traffic experiences fixed-frequency patterns. For instance, each day of the week experiences a baseline similar pattern. Months, quarters, and holidays also experience seasonality patterns. The dataset that we will use for this problem is stored in the trafficAggregated.csv file.

d = read_csv("../data/trafficAggregated.csv")
d %>% head

## # A tibble: 6 × 6
##   date_time  totalVolume  temp holiday       weather_description weather_main
##   <date>           <dbl> <dbl> <chr>         <chr>               <chr>       
## 1 2016-01-01       34489  266  New Years Day sky is clear        Clear       
## 2 2016-01-02       41018  266. None          sky is clear        Clear       
## 3 2016-01-03       28628  270. None          overcast clouds     Clouds      
## 4 2016-01-04       81891  263. None          mist                Mist        
## 5 2016-01-05       61120  269. None          overcast clouds     Clouds      
## 6 2016-01-06       68060  272. None          overcast clouds     Clear

To get a better view of the dataset, we can call the function plot_time_series that will plot the linechart of our dependent variable:

d %>% plot_time_series(date_time, totalVolume, .interactive = T)

12.2 Feature engineering

Similar to what we did in the regression and classification chapters, we will need to split our data into a training and a testing set. However, because there is a time component in our data, our test set needs to be chronologically after the training set. Otherwise, we will be using information from the future to predict the past. To make sure that our training and test sets are correctly defined, the package modeltime comes with the function time_series_split. This function has a parameter assess, which we can use to identify how many days, weeks, months, or years we want our test set to be.

splits = d %>% time_series_split(date_var = date_time, assess = "6 months", cumulative = T)
trs = training(splits)
ts = testing(splits)

Now we call the function tk_time_series_cv_plan along with the plot_time_series_cv_plan to visualize the training and the test sets that we have created:

splits %>% tk_time_series_cv_plan() %>% plot_time_series_cv_plan(date_time, totalVolume, .interactive = T)

Next, we can define the recipe of our time series models. On top of what we have learned so far, in this setting (time series) we also add the time component (date_time) in our set of independent variables as follows:

br = recipe(totalVolume ~ date_time + temp + holiday + weather_main, trs) %>% 
  step_dummy(all_nominal())
br %>%  prep %>% juice %>%  head

## # A tibble: 6 × 22
##   date_time   temp totalVolume holiday_Columbus.Day holiday_Independence.Day
##   <date>     <dbl>       <dbl>                <dbl>                    <dbl>
## 1 2016-01-01  266        34489                    0                        0
## 2 2016-01-02  266.       41018                    0                        0
## 3 2016-01-03  270.       28628                    0                        0
## 4 2016-01-04  263.       81891                    0                        0
## 5 2016-01-05  269.       61120                    0                        0
## 6 2016-01-06  272.       68060                    0                        0
## # … with 17 more variables: holiday_Labor.Day <dbl>,
## #   holiday_Martin.Luther.King.Jr.Day <dbl>, holiday_Memorial.Day <dbl>,
## #   holiday_New.Years.Day <dbl>, holiday_None <dbl>, holiday_State.Fair <dbl>,
## #   holiday_Thanksgiving.Day <dbl>, holiday_Veterans.Day <dbl>,
## #   holiday_Washingtons.Birthday <dbl>, weather_main_Clouds <dbl>,
## #   weather_main_Drizzle <dbl>, weather_main_Fog <dbl>,
## #   weather_main_Haze <dbl>, weather_main_Mist <dbl>, …

12.3 Model definition and training

Similar to the regression and the classification chapters, we can define time series models by declaring the relevant engine. In the following, we define an ARIMA and a PROPHET model. In addition, we define an ARIMA with boosted trees and a PROPHET with boosted trees approaches.

You can find more info about these models here: https://www.tidymodels.org/find/parsnip/ (in the search select package modeltime):

arima_reg = arima_reg() %>% set_engine("auto_arima")
prophet_reg = prophet_reg() %>%set_engine("prophet") 
arima_reg_b = arima_boost() %>% set_engine("arima_xgboost")
prophet_reg_b = prophet_boost() %>% set_engine("prophet_xgboost")
xgbost_reg = boost_tree() %>% set_engine("xgboost") %>% set_mode("regression")

A respective workflow is as follows:

bw = workflow() %>%  add_model(arima_reg) %>% add_recipe(br)
bw

## ══ Workflow ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: arima_reg()
## 
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ARIMA Regression Model Specification (regression)
## 
## Computational engine: auto_arima

We can train the four different models as follows:

m1 =  bw %>% fit(trs) 
m2 = bw %>% update_model(prophet_reg) %>%  fit(trs) 
m3 = bw %>% update_model(arima_reg_b) %>%  fit(trs) 
m4 = bw %>% update_model(prophet_reg_b) %>%  fit(trs)

Finally, we can build a non-time-series regression model for comparison. Below, we train an XGBoost model. We adjust the recipe in two ways: First, because XGBoost is not a time series model, it does not understand date-time variables. Hence, we will have to remove variable date_time from the recipe. Second, given that we expect consecutive observations to be correlated, we need to create a lagged version of our dependent variable. We can do this by calling the step function step_lag:

xgRecipe = recipe(totalVolume ~  temp + holiday + weather_main, trs) %>% 
  step_dummy(all_nominal()) %>% step_lag(totalVolume) %>% 
  step_medianimpute(lag_1_totalVolume)

## Warning: `step_medianimpute()` was deprecated in recipes 0.1.16.
## Please use `step_impute_median()` instead.

xgRecipe %>%  prep %>% juice %>% head

## # A tibble: 6 × 22
##    temp totalVolume holiday_Columbus.Day holiday_Independence.… holiday_Labor.D…
##   <dbl>       <dbl>                <dbl>                  <dbl>            <dbl>
## 1  266        34489                    0                      0                0
## 2  266.       41018                    0                      0                0
## 3  270.       28628                    0                      0                0
## 4  263.       81891                    0                      0                0
## 5  269.       61120                    0                      0                0
## 6  272.       68060                    0                      0                0
## # … with 17 more variables: holiday_Martin.Luther.King.Jr.Day <dbl>,
## #   holiday_Memorial.Day <dbl>, holiday_New.Years.Day <dbl>,
## #   holiday_None <dbl>, holiday_State.Fair <dbl>,
## #   holiday_Thanksgiving.Day <dbl>, holiday_Veterans.Day <dbl>,
## #   holiday_Washingtons.Birthday <dbl>, weather_main_Clouds <dbl>,
## #   weather_main_Drizzle <dbl>, weather_main_Fog <dbl>,
## #   weather_main_Haze <dbl>, weather_main_Mist <dbl>, …

m5 = bw %>% update_model(xgbost_reg) %>% update_recipe(xgRecipe) %>%  fit(trs)

12.4 Evaluation

The modeltime package has two functions called modeltime_table and modeltime_calibrate that allow us to compare various models at once. Specifically:

mt = modeltime_table(m1, m2, m3, m4,m5)
ct = mt %>% modeltime_calibrate(ts)
ct

## # Modeltime Table
## # A tibble: 5 × 5
##   .model_id .model     .model_desc                       .type .calibration_data
##       <int> <list>     <chr>                             <chr> <list>           
## 1         1 <workflow> REGRESSION WITH ARIMA(0,1,2)(0,0… Test  <tibble [182 × 4…
## 2         2 <workflow> PROPHET W/ REGRESSORS             Test  <tibble [182 × 4…
## 3         3 <workflow> ARIMA(0,0,0) WITH NON-ZERO MEAN … Test  <tibble [182 × 4…
## 4         4 <workflow> PROPHET W/ XGBOOST ERRORS         Test  <tibble [182 × 4…
## 5         5 <workflow> XGBOOST                           Test  <tibble [182 × 4…

Now that we have the table, we can call the function modeltime_accuracy that will return the accuracy of each model. Since we are dealing with a regression problem, we will be interested in the MAE and the RMSE:

ct %>% modeltime_accuracy() %>% select(.model_id, .model_desc , mae, rmse) %>% 
  table_modeltime_accuracy(.interactive = T)

Finally, note that we can plot the forecast by calling the plot_modeltime_forecast function in combination with the modeltime_forecast:

ct %>% modeltime_forecast(new_data = ts, actual_data = ts) %>% 
  plot_modeltime_forecast(.interactive = T)

For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu