How to use multiple linear regression in R to determine which marketing actions drives sales

Anita Owens
6 min readSep 1, 2021

Do price cuts increase sales?

My local grocery store canned goods tomato aisle

We will use multiple linear regression to determine whether a promotion, coupon or price reduction drives more sales of our grocery product of canned tomatoes. Continuing my journey through Wayne L. Winston’s Marketing Analytics: Data-Driven Techniques with Microsoft Excel book, I will build a multiple linear regression model to answer the following:

1. Determine if a promotion, coupon, or price reduction influence sales.

2. Predict sales of canned tomatoes during a week in which you use a shopping cart notice, a coupon, and reduce price by 1 cent.

Our dataset contains 12 weeks of sales data for our canned tomatoes at our supermarket.

Preview grocery store sales data

Week: The week number of the observed sales.
Cart.Notice: Whether or not there was a promotion.
Coupon: Whether or not there was a coupon offering.
Price.Reduction: The price reduction offered. Either no price reduction, 1 cent off or 2 cents off.
Sales: The number of cases of canned tomatoes sold for the week.

We need to prepare our dataset for modeling by turning our character variables into factor type variables which can be done in one line of code for Cart.Notice and Coupon variables. However, Price.Reduction is a numeric variable so we will transform this variable separately.

#Convert all character variables into factor in one line: Make sure tidyverse package is loaded.
grocery <- grocery %>% mutate_if(is.character, as.factor)
#Factor Price Reduction variable
grocery$Price.Reduction <- as.factor(grocery$Price.Reduction)
Transformed dataframe

Dummy variable regression

When working with a categorical variable in regression, the variables take on a value that is a label, so it doesn’t have a numerical interpretation. For instance, our coupon variable takes on 2 labels (yes or no). Since it can only take 2 values, to include these in our regression model, we use a technique called a dummy variable regression.

Luckily for us, simply converting our variables to a factor type labels our variables, so when we feed the categorical variables into our function, R will automatically use one of the categories as a reference variable. The number of dummy variables required in a regression is one less than the number of categories in the categorical variable.

Multiple Linear Regression Model

Now we build our multiple regression model using the lm() function. Multiple regression simply means we have two (2) or more independent or explanatory variables in our model. We will model Sales as explained by Coupon, Cart.Notice and Price.Reduction. We do not need the Week column.

# All variables with the exception of Week as an independent variable
grocery_model_lm <- lm(Sales ~ Coupon + Cart.Notice + Price.Reduction, data = grocery)

Model summary output:

Multiple linear regression output

Let’s interpret the output.

Intercept

The intercept is the value of the dependent variable (Sales) when all of the independent variables are zero. We can expect to sell 14.5 cases of tomatoes per week when there is no coupon, no cart notice and no price reduction.

Coefficients

CouponYes

CouponYes coefficient is -1.17 (negative coefficient), but the p-value is not significant; therefore, our estimated coefficient is not statistically different from 0. In other words, we cannot rule out that the true value of the coefficient is zero. This implies that sales of tomatoes are no different when there is or isn’t a coupon promotion.

Cart.NoticeYes

Cart.NoticeYes coefficient is 20.17 (positive coefficient) and the p-value is statistically significant. We can expect to sell 20 more cases of tomatoes when there is a cart notice, while keeping all other variables at the same level.

Price.Reduction1

The Price.Reduction1 coefficient is 5 (positive coefficient), but the p-value is not significant; therefore, our estimated coefficient is not statistically different from 0. This implies that sales are no different when there is a 1 cent price reduction.

Price.Reduction2

Price.Reduction2 coefficient is 12.8 (positive coefficient) and p-value is statistically significant. We can expect to sell 12.8 more cases of tomatoes when there is a price reduction of 2 cents, while keeping all other variables at the same level.

Insignificant variables

We can conclude that coupon and a price reduction of 1 cent is not significant in influencing sales. We will remove the insignificant variables and run our regression model again. First, I simply create a new column to recode all of our 1 cent price reductions to 0 and factor the new variable and then check the results.

#Create new price reduction variable
grocery <- grocery %>%
mutate(Price.Reduction_new =
case_when(
Price.Reduction == 2 ~ 2,
Price.Reduction == 1 ~ 0,
Price.Reduction == 0 ~ 0))
#We need to factor Price.Reduction_new
grocery$Price.Reduction_new <- as.factor(grocery$Price.Reduction_new)
#Check levels
levels(grocery$Price.Reduction_new)
#Inspect results
str(grocery)
Inspect the results of our new Price.Reduction variable

Final Regression Model

Build model with just the significant variables and check model output.

#Model sales as explained by Cart.Notice and Price.Reduction_new
grocery_model_lm_signifvars <- lm(Sales ~ Cart.Notice + Price.Reduction_new, data = grocery)
#Print model output
summary(grocery_model_lm_signifvars)
Final regression summary model output with significant variables only

Removing the insignificant variables definitely has an impact on model fit.

Goodness of fit measures

An R-squared of .855 indicates that this model is able to explain about 86% of the variation in the sales of tomatoes given all the independent variables together. The remaining 14% goes unexplained.

The p-value of the model itself is 0.0001639 which is less than 0.05 which indicates that this model has predictive power.

The residual standard error for the regression model is 5.29 and measures how far off our model is. You have to interpret the standard error in relative to what you are predicting. In effect, the typical difference between observed sales and predicted sales is 5.29 cases. As long as the residuals are normal, then 68% of our predictions should fall within 1 standard error (5.29 cases) and 95% should be accurate within 2 standard errors (10.58 cases). Predictions greater than 2 standard errors would be considered an outlier.

Notably, removing the insignificant variables did not improve R-squared. But if we look at our adjusted R-squared, the second model did improve ever so slightly. More importantly, model 2 did decrease our residual standard errors. Reducing residual standard errors becomes very important in assessing your model if your goal is to build accurate forecasts.

Forecast sales

We will predict sales of canned tomatoes during a week in which you use a shopping cart notice, a coupon and reduce price by 1 cent.

First, create a new data frame to hold the values (out-of-sample data) we want to predict. Since coupon and 1 cent price reduction has no impact on sales as indicated by our first regression model, we will leave out the Coupon variable completely in our new data frame, but we will need to include Price.Reduction, but code it as a zero (0) so that it fits the model.

# Create new data frame and print output
(new_data_02 <- data.frame(Cart.Notice = "Yes", Price.Reduction_new="0"))
Output of new data frame with the values we want to predict

Second, load the broom package because we will use the augment() function to make our predictions. Our regression model is the first argument and the new data frame is the second argument. We will use our regression model that has the significant variables only.

#Using broom package
augment(grocery_model_lm_signifvars, newdata = new_data_02)

The augment function returns a tibble with a new column called .fitted which holds our prediction(s).

Augment function output contains a new column .fitted with our predicted sales

We can expect to sell 36.5 cases of tomatoes.

Final Summary

1. Determine if a promotion, coupon, or price reduction influence sales.

The most significant factors that influence sales of tomatoes is a cart notice followed by a price reduction of 2 cents.

2. Predict sales of canned tomatoes during a week in which you use a shopping cart notice, a coupon, and reduce price by 1 cent.

36.5 cases.

Full code can be found on Github.

--

--

Anita Owens

Analytics engineer, mentor and lecturer in analytics. The glue person bridging the gap between data and the business. https://www.linkedin.com/in/anitaowens/