Empirical Cumulative Distribution Functions

The use of quantile regression facilitates quantization of the prediction uncertainty by constructing empirical cumulative distribution functions (ECDF) at any year of a forecast. The empirical cumulative distribution function (ECDF) is determined by modeling the collection of quantile forecast results.

The data

The data I’ll be using for this blog post Total quarterly beer production in Australia (in megalitres) from 1956:Q1 to 2010:Q2. Data is only used to show case how to apply the functions I’ll be building and the limitations of current methodology to explain uncertainty.

Code part 1 - Packages and data

suppressPackageStartupMessages(library(forecast))
suppressPackageStartupMessages(library(fpp2))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(splines))
suppressPackageStartupMessages(library(quantreg))
suppressPackageStartupMessages(library(RColorBrewer))

We will create a split data for training and testing the data

We are leaving out two years for testing

beer <- window(ausbeer,start=1992,end=c(2007,4))

Forecast Parameter setting

step_h = 8

Fitting a simple Holt-Winters Model

See `?hw’ for more help

fit <- hw(beer,seasonal="multiplicative")

Looking at the data and its forecast

autoplot(window(ausbeer, start=1992)) +
    autolayer(fit, series="HW multiplicative forecasts", PI=FALSE) +
    xlab("Year") + ylab("Megalitres") +
    ggtitle("Forecasts for quarterly beer production") +
    guides(colour=guide_legend(title="Forecast"))

Forecasting the model

fit_frcst <- fit %>% forecast(h=8)

autoplot(fit_frcst)

Measuring the accuracy in the traditional sense

beer_future <- window(ausbeer, start=2008)
accuracy(fit, beer_future)
##                      ME     RMSE       MAE        MPE     MAPE      MASE
## Training set -0.5228136 12.24711  9.294418 -0.1409847 2.149997 0.6499593
## Test set      9.1624415 12.71675 11.308159  2.1263484 2.641392 0.7907803
##                     ACF1 Theil's U
## Training set -0.20924378        NA
## Test set      0.05629771  0.281231

Before we create uncertainty model, we need some pre-processing

some color managment for custom colors

rhg_cols <- c("#771C19","#AA3929","#E25033","#F27314","#F8A31B",
              "#E2C59F","#B6C5CC","#8E9CA3","#556670","#000000")

fixing the data structure

beer_df <- data.frame(Y=as.matrix(ausbeer), date=time(ausbeer))

Now, lets look at the uncertainty of the model and data

# First, we create a qunatile regression model 
#  the knot degree and degrees of freedom have to found for each data
qst <-rq(beer_df$Y ~ bs(beer_df$date, degree = 3, df=120), 
         tau = c(0.1, 0.25,0.5,0.75,0.9))

# Plotting the distribution the quantiles and the original data
suppressMessages(ggplot() + 
        geom_line(aes(y = beer_df$Y, x = beer_df$date)) +
        geom_line(aes(y = predict(qst)[,1], x = beer_df$date, color = "a")) +
        geom_line(aes(y = predict(qst)[,2], x = beer_df$date, color = "b")) +
        geom_line(aes(y = predict(qst)[,3], x = beer_df$date, color = "c")) +
        geom_line(aes(y = predict(qst)[,4], x = beer_df$date, color = "d")) +
        geom_line(aes(y = predict(qst)[,5], x = beer_df$date, color = "e")) +
        scale_color_manual(name = "Probability Distribution",
                            values = c("a"=rhg_cols[1],
                                       "b"=rhg_cols[2],
                                       "c"=rhg_cols[3],
                                       "d"=rhg_cols[4],
                                       "e"=rhg_cols[5]), 
                            labels = c("10% Probability",
                                       "25% Probability",
                                       "50% Probability",
                                       "75% Probability",
                                       "90% Probability")) +
        xlab("Timeline") +
        ylab("Beer Productions") +
        ggtitle(paste("Probability distribution of quaterly beer production of for all years")))

Now lets fully model future data

# Creating a function that forecasts each quantile of our regression line
qoh <- function(qst) {
        qohn <- c()
        for(i in 1:5){
                qohn[[i]] <- predict(qst)[,(i)] %>%
                        nnetar() %>%
                        forecast(h=step_h)
        }
        
        return(qohn)
}

# Running the function on our regression fit
qoh_out <- qoh(qst) 

# Combining the data into a single data frame
results_together <- do.call(rbind,
                            lapply(names(qoh_out),
                                   function(x){
                                     transform(
                                       as.data.frame(qoh_out[[x]]), 
                                       Name = x)
}))

# Creating our Y axis based on the quantiles selected in the model
y = c(.1,.25,.5,.75,.9)

# Selecting the time point we like to visualize
x = t(as.data.frame(qoh_out)[1,])

# putting the two points together in a single data frame
df = data.frame(y, x)
names(df) <- c("scale", "unc_window")

# plotting the uncertainty window of the selected time point by putting the data inside an Empirical Cumulative Distribution Function (ecdf)
ggplot(df, aes(unc_window)) + 
  stat_ecdf(geom = "step") +
        xlab("Beer Production") +
        ylab("Probability Quantiles") +
        ggtitle(paste("Probability distribution of beer production at 3rd quater 2010 forecast"))


# Storing the data for actual reporting
ecdf_fun <- ecdf(df$unc_window)
ecdf_dat <- data.frame("prob" = ecdf_fun(df$unc_window),
                          "data_dist" = df$unc_window)

What is it good for

You may ask yourself how this is helpful? Most of you encouter time series analysis where you are asked to predict something you think is simple, say the price of real estate. But is a simple output good enough? Outside of Kaggle competitions, the answer is no. In real life, you need a lot more and until now, that answer had been a simple confidence interval which comes with all forecast methods for 80% and 95% intervals. But now, we can actually give the probability bound of each time point along the forecast, which allows for much more accurate prediction interval.