How to build a sales forecast in Python

7 min readJul 28, 2022

Does shelf location impact sales? A step-by-step forecasting guide with linear regression.

In an earlier article, I built a forecast model to answer the question of whether grocery store shelf location impacts sales using linear regression in R. In this article, I will replicate the forecast exercise in Python using pandas, numpy, and statsmodels. This dataset is from Chapter 10 of Wayne L. Winston’s Marketing Analytics: Data-Driven Techniques with Microsoft Excel.

Our Oreos dataset has 3 columns:

Observation number
Height in feet
Sales

Preview of our dataset. Image by author.

Data Exploration

Before we step into data modeling, it’s a good idea to visualize our data to get a sense for it.

#Boxplot with striplot - each individual data point super-imposed on boxplotsns.boxplot(data=df, x="Heightinfeet", y="Sales")
sns.stripplot(data=df, x="Heightinfeet", y="Sales", color = 'black', size = 8, alpha = 0.3)
plt.xlabel("Shelf Height in Feet", size = 12)
plt.ylabel("Cookie Sales", size = 12)
plt.show()

Sales by shelf height. It’s clear that sales are higher when shelf height is six (6) feet. Image by author.

How about some summary statistics.

#What are average sales overall?round(df['Sales'].mean(),1)

#What is the average & total sales by shelf location?#Group by heightinfeet and take mean, median, and sum of salesdf.groupby('Heightinfeet').agg({'Sales': ['mean', 'median', 'sum']})

Average sales and sum of sales by height. Image by author.

Some important insights we uncovered during the data exploration step include:

Overall sales over the entire dataset is 44 on average.
Higher sales when shelf height is at 6 feet. On average 61.
Lower sales when shelf height as at 5 feet. On average 44.50

Forecast Modeling

We will build a forecast model using linear regression with the Python statsmodels package and the ols() function.

We only have one (1) dependent variable (Sales) and one (1) independent variable (Heightinfeet). We will assume that Sales are in number of units. We will treat Heightinfeet as a categorical variable. This will affect how we interpret the output of our regression model. We ignore the first column for building our model.

#Convert Heightinfeet from numeric to categorical objectdf['Heightinfeet'] = df['Heightinfeet'].astype('category')
# Create the model objectforecast_model = ols("Sales ~ Heightinfeet", data=df)
# Fit the modelforecast_model = forecast_model.fit()
# Print the summary of the fitted modelprint(forecast_model.summary())

Output of linear regression model. Image by author.

Interpreting the OLS (least squares) output

Going from left to right starting with the coefficients

Coefficients (coef)

A positive coefficient indicates that as the value of the independent variable increases (height), the mean of the dependent variable (sales) also tends to increase.

Intercept or Constant(B0)

The intercept is the value of the dependent variable (y) when all of the independent variables (x) are zero. In this case, since we have only one categorical variable as a predictor, the intercept is the mean value of Oreo sales at a shelf height of five feet (the reference group on the intercept).

Heightinfeet6(B1), Heightinfeet7(B2)

At 5 feet sales are 23.75. Now we interpret 6 and 7 feet in relation to the 5 feet shelf location.

At 6 feet sales increases by a whopping 36 units, but at 7 feet, sales increase by only 16 units in relation to 5 feet.

Overall, 6 feet is a better location for our Oreos if we want to optimize for sales.

Standard error (std err)

Standard error of estimated coefficient. A measure of how precise the slope has been estimated. The lower the better.

t statistic (t)

We want to know if the t-statistic falls within the rejection region. One of three (3) approaches of hypothesis testing using regression output. In regression, hypothesis testing is needed to test if the estimates that we get from the regression model (which is based on a sample of data) are true.

P-values (P>|t|)

Low p-values are good. The p-values are <.05 for both independent variables. (Heightinfeet[T.6] & Heightinfeet[T.7]). This implies that we reject the null hypothesis (that the beta coefficients are zero). The last of the three (3) approaches of hypothesis testing using regression output.

Confidence Intervals [0.025 0.975]

This informs us if the beta coefficient falls within the confidence interval range. For example, at 5 feet sales are 23.75, but could be within the range of 16.8 to 30.7. The second of three (3) approaches of hypothesis testing using regression output.

Starting from the top right of the regression output

R-squared/Coefficient of determination

The r-squared value of 0.886 indicates that 89% of the variation in sales is explained by shelf height. The remaining variation of 11% goes unexplained. A goodness of fit measure for regression output.

Adj. R-squared

Controls for the number of independent variables (x) in a model. Quite useful when comparing multiple regressions models against one another.

F-statistic

The f-statistic of 34.99 tells us whether the model is significant or not significant. The model is significant if any of the coefficients are non-zero, which in this case, is true.

How do we write our equation if we want to use this linear model to forecast sales?

Our equation:

Sales = 23.75 + 36.25(6 feet) + 16(7 feet)

Luckily, for us we can use Python to do the math for us at scale to predict future sales.

Build Forecast Model

Let’s go ahead and build our forecast model. We need to create a new DataFrame to hold our future forecast.

# Create explanatory_dataexplanatory_data = dfprint(explanatory_data)

The explanatory DataFrame is a copy of our original dataFrame. Image by author.

We will use the predict function to make predictions on this dataset and save results to a new DataFrame called prediction_data.

#Use forecast_model to predict future sales using predict functionfuture_sales = forecast_model.predict(explanatory_data)
# Create prediction_data. Future_sales has been added as a column to the explanatory data DataFrameprediction_data = explanatory_data.assign(future_sales = future_sales)
print(prediction_data)

Explanatory data with new column that holds the future predictions. Image by author.

Let’s plot the sales with the predictions using our new prediction_data DataFrame.

#Plot predictions, but first convert Heightinfeet back to numeric objectdf['Heightinfeet'] = pd.to_numeric(df['Heightinfeet'])
# Plot first the data for our original datasetfig = plt.figure()sns.scatterplot(x="Heightinfeet",y="Sales",data=df,color = "darkgrey",marker = "o")
# Add a scatter plot layersns.scatterplot(x="Heightinfeet",y="future_sales",data=prediction_data,color = "red",marker = "s")
# Add titleplt.title('Forecasted Sales marked with red square')
#Add x-labelplt.xlabel('Shelf Height')
# Show the plotplt.show()

Actual sales in grey. Forecasted sales in red. Image by author.

Practically speaking, if we wanted to forecast sales manually at each shelf height, we would calculate it like the following:

sales_at_seven_feet = 23.75 + 36.25*(0) + 16*(1)sales_at_six_feet = 23.75 + 36.25*(1) + 16*(0)sales_at_five_feet = 23.75 + 36.25*(0) + 16*(0)

Sales Forecast Model at All Shelf Heights

Let’s create a forecast model for all shelf heights.

#Create a forecast table where we group by heightinfeet and take mean and sum of salesforecast_table = prediction_data.groupby('Heightinfeet', as_index=False).agg(#get average forecasted sales
forecasted_avg_sales = ('future_sales', 'mean'),#get total forecasted sales
forecasted_total_sales = ('future_sales',sum)
).round(2) #Round numbers after the final aggregation#Make a nice looking Plotly table
fig = go.Figure(data=[go.Table(
    header=dict(values=list(forecast_table.columns),
                fill_color='#A5D8DD',
                align='center'),
    cells=dict(values=[forecast_table.Heightinfeet, forecast_table.forecasted_avg_sales, forecast_table.forecasted_total_sales],
               fill_color='#F1F1F1',
               align='center'))
])fig.show()

Now we have a full forecast model. Conceptually, let’s walk through what our analysis and forecast model means for the business.

7 feet — 5 feet

Putting Oreos at 7 feet relative to 5 feet increases sales by 16

6 feet — 5 feet

Putting Oreos at 6 feet relative to 5 feet increases sales by 36

7 feet — 6 feet

Putting Oreos at 7 feet relative to 6 feet decreases sales by 20

NOTE: If you need to evaluate the point forecast accuracy, you should test your forecast model on data that it has not seen before by splitting the data into training and test sets. Train the model on the training data and test the model on the test data (e.g. the most recent data). Then evaluate forecast based on your performance metric of choice e.g. mean squared error (MSE), mean absolute error (MAE), mean absolute percent error (MAPE) etc.

Final Summary

Business Question: Does shelf location impact sales?

Business Answer: Yes, shelf location impacts sales. If we want to maximize sales, then placing our product at 6 feet gives us the best sales. We will sell on average 60 units vs (24 at 5 feet and 40 at 7 feet.) If we place our product at 6 feet, we can expect to sell approximately 240 total units.

Why does this forecasting method (mean method) work?: Taking the average is very effective as a forecasting tool as all future values are equal to the mean of the historical data.

The dataset and full code (including performance evaluation metrics) can be found on Github.

Tip

When you have one categorical variable in a linear regression model, the coefficients of the model are the means of each category. You can get the means by removing the intercept from the model as illustrated below.

Linear regression model with the intercept removed. Image by author.

References

[1]. Winston, W. L. (2014) Marketing Analytics: Data-driven techniques with Microsoft Excel. Wiley. Pg 225-234.

[2]. Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 15–07–2022.

[3]. Chatterjee, S., & Hadi, A. S. (2012) Regression analysis by example. Wiley. Pg 37.

[4]. Borle, S., Linear Regression for Business [MOOC]. Coursera. https://www.coursera.org/learn/linear-regression-business-statistics.