Facebook Advertising Analysis

Photo by Jose Francisco Fernandez Saura on Pexels.com

Is our company’s Facebook advertising even worth the effort?

QUESTION:

A company would like to know if their advertising is effective. Before you start, yes…. Facebook does have analytics for users who actually utilize their advertising platform. Our customer does not. Their “advertisements” are posts on their feed and are not marketed by Facebook.

DATA:

Data is from the client’s POS system and their Facebook feed.

MODEL:

KISS. A simple linear model will suffice.

First, we need to obtain our data. We can use a nice Facebook scraper to scrape the last posts in a usable format. 

#install & load scraper
!pip install facebook_scraper

from facebook_scraper import get_posts
import pandas as pd

Lets first scrape the posts from the first 200 posts of their Facebook page.

#scrape
post_list = []

for post in get_posts('clients_facebook_page', pages=200):
  post_list.append(post)
#View the data 
print(post_list[0].keys())
print("Number of Posts: {}".format(len(post_list)))
## dict_keys(['post_id', 'text', 'post_text', 'shared_text', 'time', 'image', 'video', 'video_thumbnail', 'likes', 'comments', 'shares', 'post_url', 'link'])
## Number of Posts: 38

Lets clean up the list, keeping only Time, Image, Likes, Comments, Shares.

post_list_cleaned = []

for post in post_list:
  #create a list of indexes to keep
  temp = []
  indexes_to_keep = ['time', 'image', 'likes', 'comments', 'shares']
  for key in indexes_to_keep:
    temp.append(post[key])
  post_list_cleaned.append(temp)

#Remove image hyperlink, replace with 0, 1 & recast date
for post in post_list_cleaned:
  if post[1] == None:
    post[1] = 0
  else:
    post[1] = 1
  post[0] = post[0].date

We now need to combine the Facebook data with data from the company’s POS system.

#turn into a DataFrame
fb_posts_df = pd.DataFrame(post_list_cleaned)
fb_posts_df.columns = ['Date', 'image', 'likes', 'comments', 'shares']

#import our POS data
daily_sales_df = pd.read_csv('daily_sales.csv')

#merge both sets of data
combined_df = pd.merge(daily_sales_df, fb_posts_df, on='Date', how='outer')

Finally, lets export the data to a csv. We’ll do our modeling in R.

combined_df.to_csv('data.csv')

R Analysis

First, lets import our data from python. We then need to ensure the variables are cast appropriate (ie, dates are actual datetime fields and not just strings’). Finally, we are only conserned with data since the start of 2019.

library(readr)
library(ggplot2)
library(gvlma)

  #set a seed to be reproducable 

data <- read.table("data.csv", header = TRUE, sep = ",")
data <- as.data.frame(data)

  #rename 'i..Date' to 'Date'
names(data)[1] <- c("Date")

  #set data types

data$Sales <- as.numeric(data$Sales)
data$Date <- as.Date(data$Date, "%m/%d/%Y")
data$Image <- as.factor(data$Image)
data$Post <- as.factor(data$Post)

  #create a set of only 2019+ data
data_PY = data[data$Date >= '2019-01-01',]

head(data)
  Date<date>Sales<dbl>Post<fctr>Image<fctr>Likes<int>Comments<int>Shares<int>
12017-01-023274.13200000
22017-01-033216.91800000
32017-01-043367.19400000
42017-01-053313.03400000
52017-01-063281.88211200
62017-01-073206.14411200

6 rows

head(data_PY)
  Date<date>Sales<dbl>Post<fctr>Image<fctr>Likes<int>Comments<int>Shares<int>
6102019-01-023284.68800000
6112019-01-033380.69400000
6122019-01-043604.76800000
6132019-01-053665.37800000
6142019-01-073296.45000000
6152019-01-083315.91200000

6 rows

summary(data_PY)
##       Date                Sales      Post    Image       Likes        
##  Min.   :2019-01-02   Min.   :3181   0:281   0:287   Min.   :  0.000  
##  1st Qu.:2019-04-12   1st Qu.:3370   1: 64   1: 58   1st Qu.:  0.000  
##  Median :2019-07-24   Median :3456                   Median :  0.000  
##  Mean   :2019-07-24   Mean   :3495                   Mean   :  3.983  
##  3rd Qu.:2019-11-02   3rd Qu.:3606                   3rd Qu.:  0.000  
##  Max.   :2020-02-15   Max.   :4432                   Max.   :115.000  
##     Comments           Shares 
##  Min.   : 0.0000   Min.   :0  
##  1st Qu.: 0.0000   1st Qu.:0  
##  Median : 0.0000   Median :0  
##  Mean   : 0.3101   Mean   :0  
##  3rd Qu.: 0.0000   3rd Qu.:0  
##  Max.   :19.0000   Max.   :0

Now that our data’s in, let’s review our summary. We can see our data starts on Jan. 2, 2019 (as we hoped), but we do see one slight problem. When we look at the Post variable, we see it’s highly imbalanced. We have 281 days with no posts and only 64 days with posts. We should re-balance our dataset before doing more analysis to ensure our results aren’t skewed.

I’ll rebalance our data by sampling from the larger group (days with no posts), known as undersampling. I’ll also set a random seed so that our numbers are reproducible.

set.seed(15)
zeros = data_PY[data_PY$Post == 0,]
samples = sample(281, size = (345-281), replace = FALSE)
zeros = zeros[samples, ]
balanced = rbind(zeros, data_PY[data_PY$Post == 1,])

summary(balanced$Post)
##  0  1 
## 64 64

Perfect, now our data is balanced. We should also do some EDA on our dependent variable (Daily Sales). It’s a good idea to know what our distribution looks like and if we have outliers we should address.

hist(balanced$Sales)
boxplot(balanced$Sales)

 

We can see our data is slightly skewed. Sadly, with real-world data our data is never a perfect normal distribution… Luckily though, we appear to have no outliers in our boxplot.

Now we can begin modeling. Since we’re interested in understanding the dynamics of the system and not actually classifying or predicting, we’ll use a standard regression model.

model1 <- lm(data=balanced, Sales ~ Post)
summary(model1)
## 
## Call:
## lm(formula = Sales ~ Post, data = balanced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -316.22 -114.73  -29.78  111.17  476.49 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3467.1       20.5 169.095  < 2e-16 ***
## Post1           77.9       29.0   2.687  0.00819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 164 on 126 degrees of freedom
## Multiple R-squared:  0.05418,    Adjusted R-squared:  0.04667 
## F-statistic: 7.218 on 1 and 126 DF,  p-value: 0.008193
gvlma(model1)
## 
## Call:
## lm(formula = Sales ~ Post, data = balanced)
## 
## Coefficients:
## (Intercept)        Post1  
##      3467.1         77.9  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model1) 
## 
##                         Value p-value                   Decision
## Global Stat         8.351e+00 0.07952    Assumptions acceptable.
## Skewness            6.187e+00 0.01287 Assumptions NOT satisfied!
## Kurtosis            7.499e-01 0.38651    Assumptions acceptable.
## Link Function      -1.198e-13 1.00000    Assumptions acceptable.
## Heteroscedasticity  1.414e+00 0.23435    Assumptions acceptable.

Using a standard linear model, we obtain a result that says, on average, a FaceBook post increases daily sales by $77.90. We can see, based on the t-statistic and p-value that this result is highly statistically significant. We can use the GVLMA feature to ensure our model passes the underlying OLS assumptions. Here, we see we pass on all levels except skewness. We already identified earlier that skewness may be a problem with our data.

A common correction for skewness is a log transformation. Let’s transform our dependent variable and see if it helps. Note that this model (a log-lin model) will produce coefficients with different interpretations than our last model.

model2 <- lm(data=balanced, log(Sales) ~ Post)
summary(model2)
## 
## Call:
## lm(formula = log(Sales) ~ Post, data = balanced)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.092228 -0.032271 -0.007508  0.032085  0.129686 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 8.150154   0.005777 1410.673  < 2e-16 ***
## Post1       0.021925   0.008171    2.683  0.00827 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04622 on 126 degrees of freedom
## Multiple R-squared:  0.05406,    Adjusted R-squared:  0.04655 
## F-statistic: 7.201 on 1 and 126 DF,  p-value: 0.008266
gvlma(model2)
## 
## Call:
## lm(formula = log(Sales) ~ Post, data = balanced)
## 
## Coefficients:
## (Intercept)        Post1  
##     8.15015      0.02193  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model2) 
## 
##                         Value p-value                   Decision
## Global Stat         7.101e+00 0.13063    Assumptions acceptable.
## Skewness            4.541e+00 0.03309 Assumptions NOT satisfied!
## Kurtosis            1.215e+00 0.27030    Assumptions acceptable.
## Link Function      -6.240e-14 1.00000    Assumptions acceptable.
## Heteroscedasticity  1.345e+00 0.24614    Assumptions acceptable.
plot(model2)
hist(log(balanced$Sales))

 

Our second model produces another highly significant coefficient for the FaceBook post variable. Here we see that each post is associated with an average 2.19% increase in daily sales. Unfortunately, even our log transformation was unable to correct for the skewness in our data. We’ll have to note this when presenting our findings later.

Let’s now determine how much the 2.19% is actually worth (since we saw in model 1 that a post was worth $77.90).

mean_sales_no_post <- mean(balanced$Sales[balanced$Post == 0]) 
mean_sales_with_post <- mean(balanced$Sales[balanced$Post == 1]) 

mean_sales_no_post * model1$coefficients['Post1']
##    Post1 
## 270086.1

Very close. Model 2’s coefficient equates to $76.02, which is very similar to our 77.90.

Let’s now run another test to see if we get similar results. In analytics, it’s always helpful to arrive at the same conclusion via different means, if possible. This helps solidify our results.

Here we can run a standard T-test. Yes, yes, for those other analyst reading, a t-test is the same metric used in the OLS (hence the t-statistic it produces). Here, however, let’s run it on the unbalanced dataset to ensure we didn’t miss anything in sampling our data (perhaps we sampled really good or really bad data that will skew our results).

t_test <- t.test(data_PY$Sales[data_PY$Post == 1],data_PY$Sales[data_PY$Post == 0] )
t_test
## 
##  Welch Two Sample t-test
## 
## data:  data_PY$Sales[data_PY$Post == 1] and data_PY$Sales[data_PY$Post == 0]
## t = 2.5407, df = 89.593, p-value = 0.01278
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   13.3264 108.9259
## sample estimates:
## mean of x mean of y 
##  3544.970  3483.844
summary(t_test)
##             Length Class  Mode     
## statistic   1      -none- numeric  
## parameter   1      -none- numeric  
## p.value     1      -none- numeric  
## conf.int    2      -none- numeric  
## estimate    2      -none- numeric  
## null.value  1      -none- numeric  
## stderr      1      -none- numeric  
## alternative 1      -none- character
## method      1      -none- character
## data.name   1      -none- character
ggplot(data = data_PY, aes(Post, Sales, color = Post)) + geom_boxplot() + geom_jitter() 

 

Again, we receive a promising result. On all of the data, our t-statistic was 2.54, meaning we reject the null hypothesis that the difference in the means between the two groups is 0. Our T-test produces a confidence interval of [13.33, 89.59]. This interval includes our previous findings, once again giving us some added confidence.

So now we know our FaceBook posts are actually benefiting the business by generating additional daily sales. However, we also saw earlier that our company hasn’t been posting reguarly (which is why we had to rebalance the data).

length(data_PY$Sales[data_PY$Post == 0])
## [1] 281
ggplot(data = data.frame(post=as.factor(c('No Post','Post')), m=c(280, 54))
       ,aes(x=post, y=m)) + geom_bar(stat='identity', fill='dodgerblue3')

 

Let’s create a function that takes two inputs: 1) % of additional days advertised, 2) % of advertisements that were effective.

The reason for the second argument is that it’s likely unrealistic to assume all over our ads are effective. Indeed, it’s likely we have diminishing returns with more posts (as people probably tire of seeing them or block them in their feed if they become too frequent). Limiting effectiveness will give us some sense of a more reasonable estimate of lost revenue.

Another benefit of creating a custom function is that we can quickly re-run the calculations if management desires.

  #construct a 95% confidence interval around Post1 coefficient
conf_interval = confint(model2, "Post1", .90)

missed_revenue <- function(pct_addlt_adv, pct_effective){
  min = pct_addlt_adv * pct_effective * 280 * mean_sales_no_post * conf_interval[1]
  mean = pct_addlt_adv * pct_effective  * 280 * mean_sales_no_post * model2$coefficients['Post1']
  max = pct_addlt_adv * pct_effective  * 280 * mean_sales_no_post * conf_interval[2]
  
  print(paste(pct_addlt_adv * 280, "additional days of advertising"))
  sprintf("$%.2f -- $%.2f -- $%.2f",max(min, 0), mean, max
}

#Missed_revenue(% of additional days advertised, % of advertisements that were effective)

missed_revenue(.5, .7)
## [1] "140 additional days of advertising"
## [1] "$2849.38 -- $7449.57 -- $12049.75"

So if our company had advertised half of the days they didn’t, and only 70% of those adds were effective, we’d have missed out on an average of $7,449.57.

%d bloggers like this: