How to use one-way ANOVA for forecasting in R
The statistical test that can also forecast
Imagine my surprise when I learned that Analysis of Variance (ANOVA), well regarded in hypothesis testing, can be used to forecast sales. Well it really shouldn’t be a surprise as ANOVA and regression are closely related.
ANOVA helps answer the question: Are our groups really different? We use ANOVA to analyze the variance of means and then we can make inferences about those means.
I will forecast product sales using one-way ANOVA. One-way as we are only using a single independent variable to forecast. If we wanted to analyze by two independent variables, that would be a two-way ANOVA. This dataset is from Chapter 40 of Wayne L. Winston’s Marketing Analytics: Data-Driven Techniques with Microsoft Excel. It contains 6 weeks of sales data for a computer book that was located on different shelf positions in the store.
This dataset has 3 columns (Front, Back, and Middle shelf location) with total sales on the rows.
ANOVA will help answer the question: Does shelf location impact book sales and what can we expect in total sales?
Explore Dataset
Firstly, we need to reshape the data in order to be able to run any statistical analysis. Right now our dataset is wide and we want it to be long so that each row is simply one observation per point in time per location. We will first, add a week number as the row number and then use the pivot_longer() function from tidyr.
books <- bookstore %>%
mutate(week_num = row_number()) %>%
pivot_longer(!week_num, names_to = "location",
values_to = "sales")books
One thing that is important to know is that there are some missing values which we will ignore in our analysis rather than removing or imputing (replacing) the missing values.
Next, let’s do some exploratory data visualization. We’ll plot sales by shelf location using a box plot to examine the distributions.
#set Wall Street Journal theme for all plots
theme_set(theme_wsj())ggplot(data = books, aes(x=location, y=sales)) + geom_boxplot(na.rm=TRUE) + ggtitle("Book sales by shelf location")
Average sales are higher when located at the back. We also have a few outliers represented by the black dots outside of the boxes.
ANOVA
Like all statistical tests, there are some assumptions that need to be made to ensure we can use our statistical test. For ANOVA, the assumptions are:
1. Data comes from random samples
2. The observations are independent
3. Normally distributed underlying data
4. Homogeneity of variances
Let’s define our null and alternative hypothesis:
H₀ — the means across groups are equal
Ha — the means across groups are different
ANOVA assumptions
In order to use ANOVA, we have to ensure our underlying data meets the four (4) ANOVA assumptions outlined above. The assumptions we are most concerned with is normality and homogeneity of variances. We will check normality with the Shapiro-Wilk’s method and homogeneity of variance with the Bartlett test.
# Test normality across groups (Shapiro)
tapply(books$sales, books$location, FUN = shapiro.test)
All p-values are very large (> 0.05) across all shelf positions so we can assume normality of the underlying data.
Let’s check homogeneity of variance.
# Check the homogeneity of variance (Bartlett)
bartlett.test(sales ~ location, data = books)
The Bartlett test p-value is very large (> 0.05), we can assume homogeneity of variance. We can now safely continue with a one-way ANOVA.
Perform one-way ANOVA
# Perform one-way ANOVA
(anova_results <- oneway.test(sales ~ location, data = books, var.equal = TRUE))#Extract p-value
anova_results$p.value < 0.05 #If true, means are different. If false, mean sales are identical in all shelf positions.
INTERPRETATION OF ONE-WAY ANOVA RESULT:
The p-value of the test is greater than the significance level alpha = 0.05. We can cannot conclude that sales are significantly different based on shelf height. (The p-value is higher than 5%, so we fail to reject the null hypothesis that the means across groups are equal). In other words, we accept the null hypothesis and conclude that sales are not significantly different across shelf positions.
Forecast
The predicted mean for each group is the overall mean. The forecast will be weekly sales irrespective of shelf location.
#Forecast: mean of sales and remove n/a's
mean(books$sales, na.rm = TRUE)
We can expect book sales of 1,120 per week.
Scenario 2: What if the groups have significantly different means?
Let’s try again with a different scenario. Let’s take a look a new dataset of 6 weeks worth of sales data for our computer book. Just like our first dataset this data originally wide and we transform it to a long format.
Let’s plot this dataset on a box plot.
Within this dataset, we see significantly more sales for books located at the back when comparing to the front and middle locations.
ANOVA Assumptions (Scenario 2)
# Test normality across groups
tapply(books2$sales, books2$location, FUN = shapiro.test)
# Check the homogeneity of variance
bartlett.test(sales ~ location, data = books2)
Bartlett test p-value is also very large.
Perform One-way ANOVA again
# Perform one-way ANOVA
(anova_results2 <- oneway.test(sales ~ location, data = books2, var.equal = TRUE))anova_results2$p.value < 0.05 #If true, means are different. reject null hypothesis and alternative hypothesis is true. if false, mean sales are identical in all shelf positions.
INTERPRETATION OF ONE-WAY ANOVA RESULT:
The p-value is very small 0.003426 so we reject the null hypothesis and conclude that sales are significantly different across shelf positions.
Forecast (when we have significantly different means)
The predicted mean for each group equals the group mean.
We can expect sales of 900 books to be sold per week when located at the front, 1100 per week when located at the middle, and 1400 when located at the back of the book section.
Key Takeaways
When using one-way ANOVA for forecasting:
Forecast if mean of sales are not significant across groups:
The predicted mean for each group equals the overall mean.Forecast if mean of sales are significant across groups:
The predicted mean for each group equals the group mean.
References:
Winston, W. L. (2014). Marketing analytics: Data-driven techniques with Microsoft Excel. Wiley.