How to use one-way ANOVA for forecasting in R

Anita Owens
6 min readJan 26, 2022

The statistical test that can also forecast

The English book section at my local bookstore. Image by author.

Imagine my surprise when I learned that Analysis of Variance (ANOVA), well regarded in hypothesis testing, can be used to forecast sales. Well it really shouldn’t be a surprise as ANOVA and regression are closely related.

ANOVA helps answer the question: Are our groups really different? We use ANOVA to analyze the variance of means and then we can make inferences about those means.

I will forecast product sales using one-way ANOVA. One-way as we are only using a single independent variable to forecast. If we wanted to analyze by two independent variables, that would be a two-way ANOVA. This dataset is from Chapter 40 of Wayne L. Winston’s Marketing Analytics: Data-Driven Techniques with Microsoft Excel. It contains 6 weeks of sales data for a computer book that was located on different shelf positions in the store.

This dataset has 3 columns (Front, Back, and Middle shelf location) with total sales on the rows.

ANOVA will help answer the question: Does shelf location impact book sales and what can we expect in total sales?

Explore Dataset

Firstly, we need to reshape the data in order to be able to run any statistical analysis. Right now our dataset is wide and we want it to be long so that each row is simply one observation per point in time per location. We will first, add a week number as the row number and then use the pivot_longer() function from tidyr.

books <- bookstore %>% 
mutate(week_num = row_number()) %>%
pivot_longer(!week_num, names_to = "location",
values_to = "sales")
books
Transformed data from wide-to-long. Image by author.

One thing that is important to know is that there are some missing values which we will ignore in our analysis rather than removing or imputing (replacing) the missing values.

Next, let’s do some exploratory data visualization. We’ll plot sales by shelf location using a box plot to examine the distributions.

#set Wall Street Journal theme for all plots
theme_set(theme_wsj())
ggplot(data = books, aes(x=location, y=sales)) + geom_boxplot(na.rm=TRUE) + ggtitle("Book sales by shelf location")
Book sales ggplot boxplot using wsj theme from ggthemes package. Image by author.

Average sales are higher when located at the back. We also have a few outliers represented by the black dots outside of the boxes.

ANOVA

Like all statistical tests, there are some assumptions that need to be made to ensure we can use our statistical test. For ANOVA, the assumptions are:

1. Data comes from random samples
2. The observations are independent
3. Normally distributed underlying data
4. Homogeneity of variances

Let’s define our null and alternative hypothesis:

H₀ — the means across groups are equal
Ha — the means across groups are different

ANOVA assumptions

In order to use ANOVA, we have to ensure our underlying data meets the four (4) ANOVA assumptions outlined above. The assumptions we are most concerned with is normality and homogeneity of variances. We will check normality with the Shapiro-Wilk’s method and homogeneity of variance with the Bartlett test.

# Test normality across groups (Shapiro)
tapply(books$sales, books$location, FUN = shapiro.test)
Shapiro-Wilk normality test output. Image by author.

All p-values are very large (> 0.05) across all shelf positions so we can assume normality of the underlying data.

Let’s check homogeneity of variance.

# Check the homogeneity of variance (Bartlett)
bartlett.test(sales ~ location, data = books)
Output of Bartlett test. Image by author.

The Bartlett test p-value is very large (> 0.05), we can assume homogeneity of variance. We can now safely continue with a one-way ANOVA.

Perform one-way ANOVA

# Perform one-way ANOVA 
(anova_results <- oneway.test(sales ~ location, data = books, var.equal = TRUE))
#Extract p-value
anova_results$p.value < 0.05 #If true, means are different. If false, mean sales are identical in all shelf positions.
One-way ANOVA output p-value is very large so we fail to reject the null hypothesis. Image by author.

INTERPRETATION OF ONE-WAY ANOVA RESULT:
The p-value of the test is greater than the significance level alpha = 0.05. We can cannot conclude that sales are significantly different based on shelf height. (The p-value is higher than 5%, so we fail to reject the null hypothesis that the means across groups are equal). In other words, we accept the null hypothesis and conclude that sales are not significantly different across shelf positions.

Forecast

The predicted mean for each group is the overall mean. The forecast will be weekly sales irrespective of shelf location.

#Forecast:  mean of sales and remove n/a's
mean(books$sales, na.rm = TRUE)
Forecast for weekly sales. Image by author.

We can expect book sales of 1,120 per week.

Scenario 2: What if the groups have significantly different means?

Let’s try again with a different scenario. Let’s take a look a new dataset of 6 weeks worth of sales data for our computer book. Just like our first dataset this data originally wide and we transform it to a long format.

Transformed data from wide-to-long. Image by author.

Let’s plot this dataset on a box plot.

Book sales for scenario 2 with ggplot boxplot using wsj theme from ggthemes package. Image by author.

Within this dataset, we see significantly more sales for books located at the back when comparing to the front and middle locations.

ANOVA Assumptions (Scenario 2)

# Test normality across groups
tapply(books2$sales, books2$location, FUN = shapiro.test)
Shapiro-Wilk test output. Image by author.
# Check the homogeneity of variance
bartlett.test(sales ~ location, data = books2)
Bartlett test output. Image by author.

Bartlett test p-value is also very large.

Perform One-way ANOVA again

# Perform one-way ANOVA 
(anova_results2 <- oneway.test(sales ~ location, data = books2, var.equal = TRUE))
anova_results2$p.value < 0.05 #If true, means are different. reject null hypothesis and alternative hypothesis is true. if false, mean sales are identical in all shelf positions.
ANOVA output for scenario 2. Image by author.

INTERPRETATION OF ONE-WAY ANOVA RESULT:
The p-value is very small 0.003426 so we reject the null hypothesis and conclude that sales are significantly different across shelf positions.

Forecast (when we have significantly different means)

The predicted mean for each group equals the group mean.

Forecast of weekly sales based on result of ANOVA. Image by author.

We can expect sales of 900 books to be sold per week when located at the front, 1100 per week when located at the middle, and 1400 when located at the back of the book section.

Key Takeaways

When using one-way ANOVA for forecasting:

Forecast if mean of sales are not significant across groups:
The predicted mean for each group equals the overall mean.

Forecast if mean of sales are significant across groups:
The predicted mean for each group equals the group mean.

Code can be found on Github and Rpubs.

References:

Winston, W. L. (2014). Marketing analytics: Data-driven techniques with Microsoft Excel. Wiley.

--

--

Anita Owens

Analytics engineer, mentor and lecturer in analytics. The glue person bridging the gap between data and the business. https://www.linkedin.com/in/anitaowens/