12 Predictive modeling III: Time series
In this chapter we will discuss how we can model and predict time series problems.
For this analysis, we will use the new packages modeltime
and timetk
:
install.packages("timetk")
install.packages("modeltime")
library(timetk)
library(modeltime)
library(tidyverse)
library(tidymodels)
library(lubridate)
12.1 Data profiling
Let’s assume that we want to predict the total volume of traffic in an intersection for any given day in the future. This is a time series problem, because the total volume of traffic experiences fixed-frequency patterns. For instance, each day of the week experiences a baseline similar pattern. Months, quarters, and holidays also experience seasonality patterns. The dataset that we will use for this problem is stored in the trafficAggregated.csv
file.
= read_csv("../data/trafficAggregated.csv")
d %>% head d
## # A tibble: 6 × 6
## date_time totalVolume temp holiday weather_description weather_main
## <date> <dbl> <dbl> <chr> <chr> <chr>
## 1 2016-01-01 34489 266 New Years Day sky is clear Clear
## 2 2016-01-02 41018 266. None sky is clear Clear
## 3 2016-01-03 28628 270. None overcast clouds Clouds
## 4 2016-01-04 81891 263. None mist Mist
## 5 2016-01-05 61120 269. None overcast clouds Clouds
## 6 2016-01-06 68060 272. None overcast clouds Clear
To get a better view of the dataset, we can call the function plot_time_series
that will plot the linechart of our dependent variable:
%>% plot_time_series(date_time, totalVolume, .interactive = T) d
12.2 Feature engineering
Similar to what we did in the regression and classification chapters, we will need to split our data into a training and a testing set. However, because there is a time component in our data, our test set needs to be chronologically after the training set. Otherwise, we will be using information from the future to predict the past. To make sure that our training and test sets are correctly defined, the package modeltime
comes with the function time_series_split
. This function has a parameter assess
, which we can use to identify how many days, weeks, months, or years we want our test set to be.
= d %>% time_series_split(date_var = date_time, assess = "6 months", cumulative = T)
splits = training(splits)
trs = testing(splits) ts
Now we call the function tk_time_series_cv_plan
along with the plot_time_series_cv_plan
to visualize the training and the test sets that we have created:
%>% tk_time_series_cv_plan() %>% plot_time_series_cv_plan(date_time, totalVolume, .interactive = T) splits
Next, we can define the recipe of our time series models. On top of what we have learned so far, in this setting (time series) we also add the time component (date_time
) in our set of independent variables as follows:
= recipe(totalVolume ~ date_time + temp + holiday + weather_main, trs) %>%
br step_dummy(all_nominal())
%>% prep %>% juice %>% head br
## # A tibble: 6 × 22
## date_time temp totalVolume holiday_Columbus.Day holiday_Independence.Day
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2016-01-01 266 34489 0 0
## 2 2016-01-02 266. 41018 0 0
## 3 2016-01-03 270. 28628 0 0
## 4 2016-01-04 263. 81891 0 0
## 5 2016-01-05 269. 61120 0 0
## 6 2016-01-06 272. 68060 0 0
## # … with 17 more variables: holiday_Labor.Day <dbl>,
## # holiday_Martin.Luther.King.Jr.Day <dbl>, holiday_Memorial.Day <dbl>,
## # holiday_New.Years.Day <dbl>, holiday_None <dbl>, holiday_State.Fair <dbl>,
## # holiday_Thanksgiving.Day <dbl>, holiday_Veterans.Day <dbl>,
## # holiday_Washingtons.Birthday <dbl>, weather_main_Clouds <dbl>,
## # weather_main_Drizzle <dbl>, weather_main_Fog <dbl>,
## # weather_main_Haze <dbl>, weather_main_Mist <dbl>, …
12.3 Model definition and training
Similar to the regression and the classification chapters, we can define time series models by declaring the relevant engine. In the following, we define an ARIMA and a PROPHET model. In addition, we define an ARIMA with boosted trees and a PROPHET with boosted trees approaches.
modeltime
):
= arima_reg() %>% set_engine("auto_arima")
arima_reg = prophet_reg() %>%set_engine("prophet")
prophet_reg = arima_boost() %>% set_engine("arima_xgboost")
arima_reg_b = prophet_boost() %>% set_engine("prophet_xgboost")
prophet_reg_b = boost_tree() %>% set_engine("xgboost") %>% set_mode("regression") xgbost_reg
A respective workflow is as follows:
= workflow() %>% add_model(arima_reg) %>% add_recipe(br)
bw bw
## ══ Workflow ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: arima_reg()
##
## ── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_dummy()
##
## ── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ARIMA Regression Model Specification (regression)
##
## Computational engine: auto_arima
We can train the four different models as follows:
= bw %>% fit(trs)
m1 = bw %>% update_model(prophet_reg) %>% fit(trs)
m2 = bw %>% update_model(arima_reg_b) %>% fit(trs)
m3 = bw %>% update_model(prophet_reg_b) %>% fit(trs) m4
Finally, we can build a non-time-series regression model for comparison. Below, we train an XGBoost model. We adjust the recipe in two ways: First, because XGBoost is not a time series model, it does not understand date-time variables. Hence, we will have to remove variable date_time
from the recipe. Second, given that we expect consecutive observations to be correlated, we need to create a lagged version of our dependent variable. We can do this by calling the step function step_lag:
= recipe(totalVolume ~ temp + holiday + weather_main, trs) %>%
xgRecipe step_dummy(all_nominal()) %>% step_lag(totalVolume) %>%
step_medianimpute(lag_1_totalVolume)
## Warning: `step_medianimpute()` was deprecated in recipes 0.1.16.
## Please use `step_impute_median()` instead.
%>% prep %>% juice %>% head xgRecipe
## # A tibble: 6 × 22
## temp totalVolume holiday_Columbus.Day holiday_Independence.… holiday_Labor.D…
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 266 34489 0 0 0
## 2 266. 41018 0 0 0
## 3 270. 28628 0 0 0
## 4 263. 81891 0 0 0
## 5 269. 61120 0 0 0
## 6 272. 68060 0 0 0
## # … with 17 more variables: holiday_Martin.Luther.King.Jr.Day <dbl>,
## # holiday_Memorial.Day <dbl>, holiday_New.Years.Day <dbl>,
## # holiday_None <dbl>, holiday_State.Fair <dbl>,
## # holiday_Thanksgiving.Day <dbl>, holiday_Veterans.Day <dbl>,
## # holiday_Washingtons.Birthday <dbl>, weather_main_Clouds <dbl>,
## # weather_main_Drizzle <dbl>, weather_main_Fog <dbl>,
## # weather_main_Haze <dbl>, weather_main_Mist <dbl>, …
= bw %>% update_model(xgbost_reg) %>% update_recipe(xgRecipe) %>% fit(trs) m5
12.4 Evaluation
The modeltime
package has two functions called modeltime_table
and modeltime_calibrate
that allow us to compare various models at once. Specifically:
= modeltime_table(m1, m2, m3, m4,m5)
mt = mt %>% modeltime_calibrate(ts)
ct ct
## # Modeltime Table
## # A tibble: 5 × 5
## .model_id .model .model_desc .type .calibration_data
## <int> <list> <chr> <chr> <list>
## 1 1 <workflow> REGRESSION WITH ARIMA(0,1,2)(0,0… Test <tibble [182 × 4…
## 2 2 <workflow> PROPHET W/ REGRESSORS Test <tibble [182 × 4…
## 3 3 <workflow> ARIMA(0,0,0) WITH NON-ZERO MEAN … Test <tibble [182 × 4…
## 4 4 <workflow> PROPHET W/ XGBOOST ERRORS Test <tibble [182 × 4…
## 5 5 <workflow> XGBOOST Test <tibble [182 × 4…
Now that we have the table, we can call the function modeltime_accuracy
that will return the accuracy of each model. Since we are dealing with a regression problem, we will be interested in the MAE and the RMSE:
%>% modeltime_accuracy() %>% select(.model_id, .model_desc , mae, rmse) %>%
ct table_modeltime_accuracy(.interactive = T)
Finally, note that we can plot the forecast by calling the plot_modeltime_forecast
function in combination with the modeltime_forecast
:
%>% modeltime_forecast(new_data = ts, actual_data = ts) %>%
ct plot_modeltime_forecast(.interactive = T)
For comments, suggestions, errors, and typos, please email me at: kokkodis@bc.edu