How to forecast product sales in the R Tidyverse with Modeltime and Prophet

Anita Owens
9 min readNov 22, 2021

--

Tidyverse ecosystem + Facebook’s Prophet. Image by author.

Tidyverse has changed the game when it comes to data analysis in R. The Tidyverse ecosystem also includes tidy machine learning that makes exploring, analyzing and manipulating complex machine learning objects easy thanks in part to the List Column Workflow. I wanted to give forecasting in the Tidyverse a try using the modeltime package, developed by Matt Dancho, which was built for tidy time series forecasting which follows a streamlined workflow.

Modeltime Workflow: https://business-science.github.io/modeltime/ Image by author.

The Tidyverse machine learning workflow taps into many different machine learning models including the Prophet package created by Facebook’s Core Data Science team. Prophet is robust to trends in time series data including daily, weekly and yearly seasonality. Trend in time series data is a specific pattern which in which observations are either increasing or decreasing over time. Also, trends can constantly change. The Prophet package was designed to be robust to changing trends, missing data, and outliers so we will use this package to model and forecast our sales data.

This dataset is from Chapter 31 of Wayne L. Winston’s Marketing Analytics: Data-Driven Techniques with Microsoft Excel. We have three (3) years of daily sales of Post-its. Those lovable note snippets that hang around your desk.

According to 3M, Post-its have been around 40+ years, but took a while for consumers to catch on. Source: https://www.post-it.com/3M/en_US/post-it/contact-us/about-us/

We will do the following:

  1. Determine which factors influence sales
  2. Build a model to forecast daily sales

Explore dataset

Let’s import and check our data. We have 6 variables:

  1. Month (The month as a number)
  2. Day number
  3. Day (This is our date variable but since we imported from Excel we get the Excel formatted dates)
  4. Pricing for that particular day (7 different prices)
  5. Display (this is a binary variable. 1 indicates that our product was on a display for that day. 0 indicates that it was not on display).
  6. Actual sales

We have to do a bit of transformation to get our data in the right format. First, let’s create a new formatted date column as our data column inherited the Excel date formatting.

#Create New Formatted Date Column supplying origin argument
postit$new_date <- as.Date(postit$Day, origin = "1899-12-30")
#Check results
str(postit$new_date)
Reformatted date column

Let’s factor some variables.

#Create list of variables that need transformation
fac_vars <- c("Month", "display")
#Factor Month, Display and Price Variables
postit[,fac_vars] <- lapply(postit[,fac_vars], factor)
#Check results
str(postit)
Results of factoring

Let’s visualize our sales over time with plot_time_series() function from the timetk package. If you set .interactive = TRUE argument, then your plot is an interactive Plotly plot.

#From timetk package - visualize sales
postit %>%
plot_time_series(new_date, actualsales, .interactive = TRUE)
Time series plot of sales. Image by author.

Our plot of sales indicates that there is a positive trend of sales for our Post-its.

Task 1: What factors influence sales

First, let’s create some data visualizations of actual sales.

Plots: Histogram of sales, histogram of log(sales), boxplot of sales by month, and boxplot of sales by price. Image by author.

Sales are a bit skewed to the right, but looks normal after log transformation. Sales are lowest in March. Sales are generally higher in Q4 and the highest in December. December also appears to have the largest spread of sales. Some outliers during months August and September. Plotting sales by price reveals quite interesting insights. Sales are highest when the price is 5.95. Sales decrease when price increases to 6.1 and so and so forth where the lowest sales happen when the price is at its highest 7.52. There appears to be a relationship between price and sales.

Plot of sales by display and summary stats output of sales (mean, median and standard deviation).

More sales when there is a display. Average sales are 639 vs 587.

Now we have some idea of the factors that appear to influence sales, we can quantify the relationships with a linear regression model.

The first step of building a linear regression model is to build the model specification. We will build the model to fit actual sales as explained by only Month, Price and Display.

# Model Spec
model_spec_lm <- linear_reg() %>%
set_engine(“lm”)
# Build Linear Regression Model
model_fit_ln <- model_spec_lm %>%
fit(actualsales ~ Month + Price + display, data = postit)
#Print summary of model in a tidy object
lm_summary <- tidy(model_fit_lm) %>%
mutate(significant = p.value <= 0.05)
lm_summary
#If you prefer the classic R-style model output, you can use summary function to print fit to console
summary(model_fit_lm$fit)
Tidy output of linear regression model. Image by author.

A generic interpretation of our linear regression model. For every one unit increase in coefficient (term column), increases (if positive) or decreases (if negative) sales by the value of our estimate, while all other variables remain at the same level.

Most months are significant but December is positive and the most statistically significant. Time of year does appear to have an impact sales.

All price coefficients are negative and significant which means that all price levels (not on our intercept) reduces sales.

For display advertising, we can expect an increase of approximately 63 when our product is on display.

Evaluate linear regression model fit

#Use the glance function from the broom package to get additional information from the model (e.g. model statistics like r-squared)
glance(model_fit_lm)
Model statistics in a data frame. Image by author.

Assessing model fit gives us information that the model we built is reasonable enough for us to derive insights. As for the model statistics, we have a very low p-value and an r-squared of .795, which tells us that that 80% of the variation in sales is explained by our independent/explanatory variables.

We built this linear model for explanatory purposes and gaining insights into what factors drives sales. Now on to model building for the forecasting task.

Task 2: Build a forecast model for sales

Step 1: Split dataset into test and training data

We will split our dataset into test and training data. We will use the last 12 months of the dataset as the training set.

#Step 1: Split data into test and training set
splits <- postit %>%
time_series_split(date_var = new_date, assess = "12 months", cumulative = TRUE)
#Visualize test train split
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(new_date, actualsales, .interactive = TRUE)
Time Series Cross Validation Plot

Step 2: We want to define our model specification and fit the model. We will be using prophet.

# Step 2: Model Specification and Model fit
model_spec_prophet <- prophet_reg() %>%
set_engine("prophet")
Prophet model output

Step 3: Put model into modeltime table

#Step 3: Put model into a modeltime table
models_tbl <- modeltime_table(model_fit_prophet)
models_tbl

Step 4: Calibrate model

Calibration step computes the forecast predictions and residuals from the test set.

#Step 4: Calibrate model
calibration_tbl <- models_tbl %>%
modeltime_calibrate(new_data = testing(splits))
#Can also print calibration model to console
calibration_tbl

Step 5: Get accuracy metrics

#Step 5: Get Accuracy Metrics
calibration_tbl %>%
modeltime_accuracy()
Accuracy measures in a tidy data frame. Image by author.

When determining error and accuracy, there are several different measures to evaluate forecasts, but I will focus on the following two (2):

Mean Absolute Error (MAE): how far away our predictions from the actual value in absolute terms and is on the same scale as the data.

Mean Absolute Percentage Error (MAPE): how far away our predictions from the actual value in absolute percentage terms so the prediction values are not dependent on scale.

I pay attention to the MAPE as it is often used in business to measure forecasts. Percentage error = (actual — forecast) / actual.

The MAE tells us how off we are by unit sales and the MAPE tells us we are off our forecasts are by percentage amount. This helps to put our forecasts in better context for business stakeholders.

#Plot the residuals
# Out-of-Sample data (test set)
#Change new_data argument if you want to plot in-sample residuals (training set). Timeplot is the default but can change to acf or seasonality plot.
calibration_tbl %>%
modeltime_calibrate(new_data = testing(splits)) %>%
modeltime_residuals() %>%
plot_modeltime_residuals(.interactive = TRUE, .type = "timeplot")

We want our time series residuals to hover around zero. Everything seems okay until November and December as we start to see more variability in the residuals.

Step 6: Create future forecast on test set

#Step 6: Create future forecast on test data
(forecast_tbl <- calibration_tbl %>%
modeltime_forecast(
new_data = testing(splits),
actual_data = postit,
keep_data = TRUE #Includes the new data (and actual data) as extra columns with the results of the model forecasts
))
future forecast data frame. Image by author.

When you open the tibble (data frame) for the forecast model, the .key column indicates whether the .value column is the actual value or the prediction value. This model’s predictions starts on row 1097.

Step 7: Plot modeltime forecast on the test data

#Step 7: Plot modeltime forecast - this is the test data
plot_modeltime_forecast(forecast_tbl)
Modeltime forecast plot. Image by author.

Forecast sales into the future

If you have a time series data that is univariate, you can simply use the modeltime_forecast() function and specify the horizon (h) argument to get a forecast into the future, but since our postit data has xregs (independent regressors), we must create a new dataf rame with the xregs needed to fit the future values.

Here, I created a new data frame to hold our future forecasts. The code to create this new data frame is pretty lengthy so I will omit it here, but can be found on Github. Notice the date starts on the first of January 2014.

#Check results
str(explanatory_data)
Structure of new data frame

CAUTION: It is super important that the data in your new data frame matches the exact same formatting, same column names, and be of the same object types of the of the data used in building the forecast. If errors occur during either the model fit or forecasting phase, differently formatted data or column names could be a possible culprit.

Create future forecast.

#First, refit to the full dataset
refit_tbl <- calibration_tbl %>%
modeltime_refit(data = postit)
#Specify the H or horizon argument to get a forecast into the future, but if using xregs (independent regressors), you must create a new tibble/data frame with the xregs to be used.#Forecast on the new tibble data frame
forecast_tbl_future_data <- refit_tbl %>%
modeltime_forecast(
new_data = explanatory_data
)
Tibble of future forecasts. Image by author.

Tibble of the forecasts gives us the predicted values with the confidence intervals.

#plot forecast
plot_modeltime_forecast(forecast_tbl_future_data, .interactive = TRUE)
Forecast plot. Image by author.

How do we know if our forecasts are any good?

As the forecaster, it’s your job to reduce error, but most importantly you need a benchmark to determine how much error is acceptable and gradually reduce the error of your forecasts over time.

One easy way is to use several different methods and average your forecasts. I have found that by adding more historical data to the testing phase reduced my forecast error by a percentage reduction of 15%. I am now starting to add value to the business.

Final Summary

1: What factors influence sales

Price (More sales at the lowest price point. Higher prices reduces sales.)

Display (More sales when product is on display)

Month of year (December the highest. March the lowest).

2: Build a forecast model for sales

Our forecasts in a nice table and bar chart

Alternatively, for business stakeholders, you could plot your forecasts against actual sales for comparison in a dual line chart.

Forecasts plotted against actual sales. Image by author.

Alternatively, you could also create forecasts of different scenarios and plot against actual sales data especially if you need to build a business case to implement your recommendations.

Full code can be found on Github and Rpubs.

Final Thoughts

I really enjoyed performing time series analysis using Modeltime package. As a Tidyverse user, this way of working with machine learning models is streamlined, functional and flexible plus the added benefit of being able to compare many models at once.

References:

[1] Winston, W. L. (2014). Marketing analytics: Data-driven techniques with Microsoft Excel. Wiley.

[2] Forecasting: Principles and Practice (3rd ed) Rob J Hyndman and George Athanasopoulos, Monash University, Australia. https://otexts.com/fpp3/

[3] https://business-science.github.io/modeltime/

[4] https://facebook.github.io/prophet/

--

--

Anita Owens

Analytics engineer, mentor and lecturer in analytics. The glue person bridging the gap between data and the business. https://www.linkedin.com/in/anitaowens/