How to forecast with two-way ANOVA in R
A powerful stats technique to have in your forecasting toolbox
In a previous article, I walked through the steps of using ANOVA (analysis of variance) for forecasting with one-way ANOVA. In that scenario, we were forecasting book sales based on shelf location (one factor). The forecast depended upon if there was a significant relationship between the factor (shelf location) and the outcome (sales).
However, what if you have two factors? You can use two-way ANOVA for forecasting (even three-way ANOVA!), but things get little more complicated due to: 1) replication and 2) interaction.
Replication
If we are trying to analyze the effect of two factors (x) on one (y) variable, it needs to be observed the same number of times (referred to as k) for each combination of the two factors (x).
If k = 1, it is called two-way without replication. For instance, if we are trying to determine if two factors, such as sales territory or sales rep influences sales, we can only have one pairing of a sales rep to a sales territory. This is also called a randomized block design.
If k > 1, then it’s two-way ANOVA with replication. If we want to know if advertising and price influences product sales, we can observe product sales at all levels of advertising and price.
ANOVA replication scenario (k = number of times)
K = 1
without replicationK > 1
with replication
Interaction
For our two variables of interest, we also have to be concerned with if they interact. In other words, will the combination of our two variables of interest, have any strange effects on the observation we are observing?
ANOVA Assumptions
Like all statistical tests, there are some assumptions that need to be made to ensure we can ANOVA. The assumptions are:
1. Data comes from random samples
2. The observations are independent
3. Normally distributed underlying data
4. Homogeneity of variances
However, what we really care about for ANOVA to be valid for forecasting are that the residuals or forecast errors (actual sales — predicted sales) are normally distributed and the variance of the residuals for each group is identical.
Also, let’s define our null and alternative hypotheses for two-way ANOVA:
H₀ — the means of factor A are equal
Ha — the means of factor A are differentH₀ — the means of factor B are equal
Ha — the means of factor B are differentH₀ — there is no interaction between factors A&B
Ha — there is an interaction between factors A&B
Analysis: Two-way ANOVA with replication scenario
Does advertising and coupons impact sales? We will forecast peanut butter sales (y) with advertising and coupon as our independent variables or factors (x’s). This analysis assumes that only advertising and coupons impact sales.
Let’s take a look at the dataset.
We have 3 columns: 1) Advertising 2) Coupon and 3) No coupon. Our coupon columns have our sales data in units. However, we need to turn our wide dataset into a long dataset for further analysis using the pivot_longer() function.
#Pivot data from wide to long dataset
(peanutbutter_long <- peanutbutter %>%
pivot_longer(!advertising, names_to = "coupon", values_to = "sales"))
Let’s visualize our sales data using ggline() from the ggpubr package.
Now we complete our two-way ANOVA using the aov() function. Here, we are testing sales by main effect of advertising and coupon plus the interaction between advertising and coupon.
two_way_aov <- aov(sales ~ advertising*coupon, data = peanutbutter_long)#The anova table
summary(two_way_aov)
Advertising is very significant (a very small p-value less than 0.001) while coupon is significant (p-value of 0.0467). The advertising:coupon variable does not show interaction as the p-value is very large at 0.8006. This means that advertising and coupons independently influence sales. Since there is no interaction between our two factors, then our forecast would be:
Predicted sales = overall average + Factor A effect (if significant) + Factor B effect (if significant).
NOTE: If a factor is not significant then the factor effect is assumed to be 0.
Model Diagnostics
The residuals can be plotted using a kernel density plot, but if you have a small dataset, it may be more useful to use a Quantile-Quantile plot (qq plot). If the data is normally distributed, the observations should fall along the diagonal line.
This QQ plot makes it clear that we have a few observations along the tails are off the line which means we have a bit of skew in the data. Since most of the observations fall along the diagonal, I think we are good here and the point predictions should be good to use although the prediction intervals may be quite wide.
What can we expect in sales? (Here, I have created a table of mean sales based on each factor level)
Ads tends to increase sales by 78 (158.33–80.67) over no ad. Coupon tends to increase sales by 21 (130–109) over no coupon.
What can we expect in sales when there is both advertising and a coupon?
First, calculate the overall average.
#Calculate the overall average
overall_avg <- mean(peanutbutter_long$sales)
paste("The overall sales average is", overall_avg, sep=" ")Output: The overall sales average is 119.5
Second, calculate advertising effect (with advertising subtracted by no advertising).
#Calculate advertising effect
adv_effect <- (158.33 - 80.67)
Next, calculate coupon effect (with coupon subtracted by no coupon).
#Calculate coupon effect
coupon_effect <- (130 - 109)
Finally, calculate the predicted value.
#Calculate Forecast:
predicted_sales <- overall_avg + adv_effect + coupon_effectpaste("Forecast when there is both advertising and a coupon is:", predicted_sales, sep = " "Output: Forecast when there is both advertising and a coupon is: 218.16
What if there is an interaction effect?
Here is example of a dataset that has an interaction effect. We have several weeks of video game sales (y) with advertising and price as our independent variables or factors (x’s).
Let’s visualize this dataset:
Advertising is on the x-axis, sales on the y-axis, price levels are represented via the lines.
To interpret what’s going on, let’s first take a look at the grey line. The grey line represents the high price level across all advertising levels (the x-axis) and the line is fairly parallel across all advertising levels. Both the yellow (medium price level) and blue line (low price level) show an increase in sales when we move from medium advertising to high advertising. This chart makes it clear that sales increase when ad spending increases (blue and yellow lines), but not when prices are high (the grey line).
Now we complete our two-way ANOVA using the aov() function. Here, we are testing sales by main effect of advertising and price, plus the interaction between advertising and price.
aov(sales ~ advertising*price, data = games_long)
The summary output:
Since the p-value for the interaction (advertising*price) is very small (significant interaction effect), we can forecast sales for any price and advertising combination equal to the mean of the observations having that combination of factor levels.
What can we expect in sales?
If we wanted to forecast sales when advertising is low and price is low, the forecast would be 29.67 according to our table. If advertising is high and price is low, the forecast would be 51.
All there is left to do is check the residuals for our forecasts to be valid.
Key Takeaways
The forecast equation we can use to predict with two-way ANOVA (with or without replication) is as follows:
predicted value = overall average + factor A effect (if significant) + factor B effect (if significant)
If a factor is not significant then the factor effect is assumed to be 0.
For two-way ANOVA (with replication), if interaction effect is significant, then the predicted value is the value of the response variable (y) is equal to the mean of all observations having that combination of factor levels. If the interaction effect, is not significant, you can proceed with your analysis as if it were a two-way ANOVA without replication scenario.
If there is seasonality in your data, you can incorporate seasonality and proceed with two-way ANOVA for forecasting.
The code and all datasets can be found on Github and RPubs below:
- Forecasting with Two-Way ANOVA in R — With Replication and No Interaction Github RPubs
- Forecasting with Two-Way ANOVA in R — When interaction is absent Github RPubs
- Forecasting with Two-Way ANOVA in R — When interaction is present Github RPubs
References
[1] Winston, W. L. (2014). Marketing analytics: Data-driven techniques with Microsoft Excel. 607–617. Wiley.
[2] Hyndman, R. J., Athanasopoulos, George (2021). Forecasting Principles and Practice. 3rd ed.