How to use correlation analysis to improve marketing performance in Python & R

Anita Owens
Geek Culture
Published in
8 min readJun 21, 2022

--

What are the key drivers of sales?

Photo by Nguyen Dang Hoang Nhu on Unsplash

Why correlation?

It is important to understand what drives relationships. For example, if you want to determine how marketing performance impact your sales numbers you have to account for all factors that can explain your sales numbers. This could include factors like:

  1. marketing channels
  2. season
  3. geographic location
  4. etc.

Correlation is a great exploratory tool that can sometimes reveal interesting patterns in your data. Most importantly, once you get the hang of it, it is really easy to add to your analysis toolkit.

What is correlation?

A correlation is a statistic that quantifies the strength between 2 variables. The statistic is called the correlation coefficient denoted as r.

The correlation coefficient or r is a number between +1 and -1 and is calculated so as to represent the linear relationship between two variables. An r close to 1 indicates a strong relationship between the variables while an r close to 0 indicates a weak relationship.

Correlation examples. Image by author.

A positive or negative sign indicates the direction of the relationship. A positive r indicates a positive relationship and a negative r a negative relationship. Also, we can plot the statistic in a correlation plot or matrix (which we will do shortly).

Let’s cover three (3) common correlation methods:

  1. Pearson method — correlation is the default for linear relationships and assumes your data is normally distributed. It is sensitive to outliers and skewed data.
  2. Spearman method — for non-normal populations. Checks for rank or ordered relationships.
  3. Kendall method — for when you have a small dataset and many tied or rank relationships.

Your choice of correlation method should be driven by the underlying distribution of your data.

How to interpret correlation

Correlation thresholds using Jacob Cohen’s Rule of Thumb which is often used in the behavioral sciences to interpret the effect size:

r >= 0.5 large or strong association
r = 0.3 medium association
r = 0.1 small or weak association

If the underlying data distribution is not normal, then you could transform (e.g. logarithm, Box-Cox, etc.) your variables before attempting to apply these thresholds.

What’s an acceptable correlation?

Even if the correlation coefficient is at or near zero, that doesn’t mean no relationship exists. It’s just that relationship isn’t linear, but there could be other relationships which is why it’s important to visualize your variables beforehand.

Exploratory Visualization and Correlation Analysis

We will be using this marketing dataset that is available on Kaggle.

Marketing Mix dataset from Kaggle. Image by author.

The data contains the sales data for two consecutive years of a product of a non-specified brand. Each row contains the Volume of Sales for a week and includes additional information or various promotion methods for that product for each week. Let’s inspect the dataset.

Inspect DataFrame in Python. Image by author.
Inspect dataframe in R. Image by author.

Take notice, we have some null values on the Radio, NewspaperInserts, and Website_Campaign columns.

Let’s do some exploratory data visualization of our dataset.

EDA in R. Image by author.
EDA in Python. Image by author.

What can we tell from the exploratory data visualization?

Sales appear to be relatively normal with perhaps a bit of right skew, but not enough to be particularly worrying.

Higher sales when the base price is low. Less sales when base price is high.

Stout seems to have a negative impact on sales. Not sure what stout refers to as it was not included in the data dictionary and the data source is anonymous.

InStore appears to have a positive impact on sales.

Both Radio & TV impact inconclusive. More sales when radio & tv spending is going on, but not necessarily always the case.

Newspaper insert doesn’t appear to have any significant impact on sales.

Website Campaign appears to have more sales when there is Twitter engagement, but it doesn’t appear to be significantly different when there is no website campaign going on.

Next, let’s create our correlation plot.

#Python
corr = df.corr()
corr.style.background_gradient(cmap='RdYlGn')
#R
df %>%
select_if(is.numeric) %>%
cor() %>%
corrplot(type = "upper", addCoef.col = "black", diag=FALSE)
Correlation matrix with heatmap in Python. Image by author.
Correlation matrix with heatmap in R with corrplot package. Image by author.

We have the correlation coefficients in each box. Positive correlations are in blue. Negative correlations are in red.

Summary of correlations:

Instore and discount both have a medium positive correlation to NewVolSales.

Radio and TV have a weak positive correlation to NewVolSales.

Price has a strong negative correlation to NewVolSales.

Last, but not least, Stout has a medium negative correlation to NewVolSales

NOTE: The coefficient of determination is our correlation coefficient squared. It is the proportion of the variance in the y (dependent) variable that is predictable from our x (independent) variable.

Correlation and it’s relationship to regression

Let’s review how correlation and regression are related by reviewing just 2 variables (NewVolSales and Discount).

The correlation coefficient of NewVolSales and Discount ads is 0.42 (rounded to 2 decimal places).

#Python
round(df['NewVolSales'].corr(df['Discount']),2)
0.42
#R
round(cor(df$NewVolSales, df$Discount),2)
0.42

If we model this in a linear regression model and extract the r-squared, the result is 0.18. (rounded to 2 decimal places).

# Python

sales_vs_discount = ols("NewVolSales ~ Discount", data=df)
sales_vs_discount = sales_vs_discount.fit()print(sales_vs_discount.summary())
# Rlinear_model <- lm(NewVolSales ~ Discount, data = df)summary(linear_model)
Output of linear model for NewVolSales as explained by Discount. Python and R output. Image by author.

If we square the correlation coefficient of 0.42, we will get our r-squared 0.180. In effect, the correlation coefficient squared is the r-squared.

Testing for Significance

Let’s test significance of all the variables in our dataset by using a linear regression model on the entire dataset.

# Python (Full Model)
mkmix_model = ols("NewVolSales ~ Base_Price + Radio + InStore + NewspaperInserts + Discount + TV + Stout + Website_Campaign", data=df)
Fit the model
mkmix_model = mkmix_model.fit()
Print the summary of the fitted model
print(mkmix_model.summary())
# R (Full Model)
model_spec_lm <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
mkmix_model <- model_spec_lm %>%
fit(NewVolSales ~ Base_Price + Radio + InStore + factor(NewspaperInserts) + Discount + TV + Stout + factor(Website_Campaign), data = df)
summary(mkmix_model$fit)
Python full model output. Significant variables outlined in red. Python is giving us warnings in reference to our model. The first one, has no impact so we will ignore. For the second error, I would dig further and perhaps transform, scale or drop the variable(s) causing the collinearity error message. Image by author.
R full model output. Significant variables outlined in red. Image by author.

Our baseline revenue is 54,394. The only positive significant (p-value is less than 0.05) variable is InStore. The negative significant variables are Base_Price, Stout & Website Campaign (no campaign).

The generic interpretation for each of our coefficients is for every one unit increase in the x variable, the y variable (NewVolSales) increases by beta units.

For example, for every 1 unit increase of InStore, sales increase by 28. If our sales volume is in dollars, then this would be a 28 dollar increase.

What can we derive from our correlation analysis and how can we use this to inform marketing?

We will just focus on the relationship between NewVolSales and each independent (x) variable.

What’s not working in marketing?

  1. Price — significant. We lose money when we increase the base price.
  2. Stout — significant
  3. Website Campaign — significant. This means when there is no website campaign going on. This has a negative effect on sales.

What is working with marketing?

  1. Instore — significant

What is not as impactful?

  1. Radio — not significant
  2. TV — not significant
  3. Discount — not significant

You should at this point have a conversation with your marketing stakeholders to understand their marketing goals and tactics for each of their marketing initiatives. There may be different goals for different initiatives. For instance, if marketing is using Radio and TV for top-funnel activities, then what we see in the data makes sense. Radio and TV are great for branding (e.g. awareness), but may have less of an impact on bottom-funnel metrics like sales. So those marketing efforts that aren’t necessarily significant from a statistical point of view, can have impact from a real-world point of view.

Summary

With our correlation analysis we have derived some key insights into what is working and what isn’t working when it comes to increased sales. Now, it’s up to you to go further in your analysis which could include:

1. Adding in additional factors that were not included initially
2. Quantifying the impact of each marketing effort (e.g. ROI). In other words, calculate the marketing contribution towards sales.
3. Build a future forecast based on current levels of marketing spend and promotions.

Photo by Kelly Sikkema on Unsplash

What are some of the pitfalls and drawbacks of correlation?

Pitfalls of correlation:

1. Correlation does not equal causation — correlation gives us a way to
check if there is an association between two variables, but there
could be other explanations.

2. Latent or hidden variables can affect the relationship between two (2)
variables.

Advice:

1. Always check for highly correlated variables. A (correlation coefficient) r > 0.9. You can do this by either checking the correlation matrix or checking the variance inflation factor (VIF).

2. If collinearity exists, remove or transform before modeling. An example transformation could be taking the natural log of a variable.

3. Model data using methods robust to collinearity, e.g. Random Forest
models.

The R code can be found on both Github and RPubs. The Python code can be found on Github.

References:

[1]. Winston, W. L. (2014). Marketing analytics: Data-driven techniques with Microsoft Excel. Wiley. pp. 170–174.

[2]. Chapman, C. and McDonnell Feit, E., (2015). R for marketing research and analytics. Cham: Springer, pp.95–109, 162–191.

--

--

Anita Owens
Geek Culture

Analytics engineer, mentor and lecturer in analytics. The glue person bridging the gap between data and the business. https://www.linkedin.com/in/anitaowens/