Is our company’s Facebook advertising even worth the effort?

### QUESTION:

A company would like to know if their advertising is effective. Before you start, yes…. Facebook does have analytics for users who actually utilize their advertising platform. Our customer does not. Their “advertisements” are posts on their feed and are not marketed by Facebook.

### DATA:

Data is from the client’s POS system and their Facebook feed.

### MODEL:

KISS. A simple linear model will suffice.

**First, we need to obtain our data. We can use a nice Facebook scraper to scrape the last posts in a usable format. **

#install & load scraper !pip install facebook_scraper from facebook_scraper import get_posts import pandas as pd

**Lets first scrape the posts from the first 200 posts of their Facebook page.**

#scrape post_list = [] for post in get_posts('clients_facebook_page', pages=200): post_list.append(post)

#View the data print(post_list[0].keys()) print("Number of Posts: {}".format(len(post_list)))

## dict_keys(['post_id', 'text', 'post_text', 'shared_text', 'time', 'image', 'video', 'video_thumbnail', 'likes', 'comments', 'shares', 'post_url', 'link']) ## Number of Posts: 38

**Lets clean up the list, keeping only Time, Image, Likes, Comments, Shares.**

post_list_cleaned = [] for post in post_list: #create a list of indexes to keep temp = [] indexes_to_keep = ['time', 'image', 'likes', 'comments', 'shares'] for key in indexes_to_keep: temp.append(post[key]) post_list_cleaned.append(temp) #Remove image hyperlink, replace with 0, 1 & recast date for post in post_list_cleaned: if post[1] == None: post[1] = 0 else: post[1] = 1 post[0] = post[0].date

**We now need to combine the Facebook data with data from the company’s POS system. **

#turn into a DataFrame fb_posts_df = pd.DataFrame(post_list_cleaned) fb_posts_df.columns = ['Date', 'image', 'likes', 'comments', 'shares'] #import our POS data daily_sales_df = pd.read_csv('daily_sales.csv') #merge both sets of data combined_df = pd.merge(daily_sales_df, fb_posts_df, on='Date', how='outer')

**Finally, lets export the data to a csv. We’ll do our modeling in R.**

combined_df.to_csv('data.csv')

### R Analysis

**First, lets import our data from python. We then need to ensure the variables are cast appropriate (ie, dates are actual datetime fields and not just strings’). Finally, we are only conserned with data since the start of 2019.**

library(readr) library(ggplot2) library(gvlma) #set a seed to be reproducable data <- read.table("data.csv", header = TRUE, sep = ",") data <- as.data.frame(data) #rename 'i..Date' to 'Date' names(data)[1] <- c("Date") #set data types data$Sales <- as.numeric(data$Sales) data$Date <- as.Date(data$Date, "%m/%d/%Y") data$Image <- as.factor(data$Image) data$Post <- as.factor(data$Post) #create a set of only 2019+ data data_PY = data[data$Date >= '2019-01-01',] head(data)

Date<date> | Sales<dbl> | Post<fctr> | Image<fctr> | Likes<int> | Comments<int> | Shares<int> | |
---|---|---|---|---|---|---|---|

1 | 2017-01-02 | 3274.132 | 0 | 0 | 0 | 0 | 0 |

2 | 2017-01-03 | 3216.918 | 0 | 0 | 0 | 0 | 0 |

3 | 2017-01-04 | 3367.194 | 0 | 0 | 0 | 0 | 0 |

4 | 2017-01-05 | 3313.034 | 0 | 0 | 0 | 0 | 0 |

5 | 2017-01-06 | 3281.882 | 1 | 1 | 2 | 0 | 0 |

6 | 2017-01-07 | 3206.144 | 1 | 1 | 2 | 0 | 0 |

6 rows

head(data_PY)

Date<date> | Sales<dbl> | Post<fctr> | Image<fctr> | Likes<int> | Comments<int> | Shares<int> | |
---|---|---|---|---|---|---|---|

610 | 2019-01-02 | 3284.688 | 0 | 0 | 0 | 0 | 0 |

611 | 2019-01-03 | 3380.694 | 0 | 0 | 0 | 0 | 0 |

612 | 2019-01-04 | 3604.768 | 0 | 0 | 0 | 0 | 0 |

613 | 2019-01-05 | 3665.378 | 0 | 0 | 0 | 0 | 0 |

614 | 2019-01-07 | 3296.450 | 0 | 0 | 0 | 0 | 0 |

615 | 2019-01-08 | 3315.912 | 0 | 0 | 0 | 0 | 0 |

6 rows

summary(data_PY)

## Date Sales Post Image Likes ## Min. :2019-01-02 Min. :3181 0:281 0:287 Min. : 0.000 ## 1st Qu.:2019-04-12 1st Qu.:3370 1: 64 1: 58 1st Qu.: 0.000 ## Median :2019-07-24 Median :3456 Median : 0.000 ## Mean :2019-07-24 Mean :3495 Mean : 3.983 ## 3rd Qu.:2019-11-02 3rd Qu.:3606 3rd Qu.: 0.000 ## Max. :2020-02-15 Max. :4432 Max. :115.000 ## Comments Shares ## Min. : 0.0000 Min. :0 ## 1st Qu.: 0.0000 1st Qu.:0 ## Median : 0.0000 Median :0 ## Mean : 0.3101 Mean :0 ## 3rd Qu.: 0.0000 3rd Qu.:0 ## Max. :19.0000 Max. :0

**Now that our data’s in, let’s review our summary. We can see our data starts on Jan. 2, 2019 (as we hoped), but we do see one slight problem. When we look at the Post variable, we see it’s highly imbalanced. We have 281 days with no posts and only 64 days with posts. We should re-balance our dataset before doing more analysis to ensure our results aren’t skewed.**

**I’ll rebalance our data by sampling from the larger group (days with no posts), known as undersampling. I’ll also set a random seed so that our numbers are reproducible.**

set.seed(15) zeros = data_PY[data_PY$Post == 0,] samples = sample(281, size = (345-281), replace = FALSE) zeros = zeros[samples, ] balanced = rbind(zeros, data_PY[data_PY$Post == 1,]) summary(balanced$Post)

## 0 1 ## 64 64

**Perfect, now our data is balanced. We should also do some EDA on our dependent variable (Daily Sales). It’s a good idea to know what our distribution looks like and if we have outliers we should address.**

hist(balanced$Sales)

boxplot(balanced$Sales)

**We can see our data is slightly skewed. Sadly, with real-world data our data is never a perfect normal distribution… Luckily though, we appear to have no outliers in our boxplot.**

**Now we can begin modeling. Since we’re interested in understanding the dynamics of the system and not actually classifying or predicting, we’ll use a standard regression model.**

model1 <- lm(data=balanced, Sales ~ Post) summary(model1)

## ## Call: ## lm(formula = Sales ~ Post, data = balanced) ## ## Residuals: ## Min 1Q Median 3Q Max ## -316.22 -114.73 -29.78 111.17 476.49 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3467.1 20.5 169.095 < 2e-16 *** ## Post1 77.9 29.0 2.687 0.00819 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 164 on 126 degrees of freedom ## Multiple R-squared: 0.05418, Adjusted R-squared: 0.04667 ## F-statistic: 7.218 on 1 and 126 DF, p-value: 0.008193

gvlma(model1)

## ## Call: ## lm(formula = Sales ~ Post, data = balanced) ## ## Coefficients: ## (Intercept) Post1 ## 3467.1 77.9 ## ## ## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS ## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM: ## Level of Significance = 0.05 ## ## Call: ## gvlma(x = model1) ## ## Value p-value Decision ## Global Stat 8.351e+00 0.07952 Assumptions acceptable. ## Skewness 6.187e+00 0.01287 Assumptions NOT satisfied! ## Kurtosis 7.499e-01 0.38651 Assumptions acceptable. ## Link Function -1.198e-13 1.00000 Assumptions acceptable. ## Heteroscedasticity 1.414e+00 0.23435 Assumptions acceptable.

**Using a standard linear model, we obtain a result that says, on average, a FaceBook post increases daily sales by $77.90. We can see, based on the t-statistic and p-value that this result is highly statistically significant. We can use the GVLMA feature to ensure our model passes the underlying OLS assumptions. Here, we see we pass on all levels except skewness. We already identified earlier that skewness may be a problem with our data.**

**A common correction for skewness is a log transformation. Let’s transform our dependent variable and see if it helps. Note that this model (a log-lin model) will produce coefficients with different interpretations than our last model.**

model2 <- lm(data=balanced, log(Sales) ~ Post) summary(model2)

## ## Call: ## lm(formula = log(Sales) ~ Post, data = balanced) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.092228 -0.032271 -0.007508 0.032085 0.129686 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.150154 0.005777 1410.673 < 2e-16 *** ## Post1 0.021925 0.008171 2.683 0.00827 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.04622 on 126 degrees of freedom ## Multiple R-squared: 0.05406, Adjusted R-squared: 0.04655 ## F-statistic: 7.201 on 1 and 126 DF, p-value: 0.008266

gvlma(model2)

## ## Call: ## lm(formula = log(Sales) ~ Post, data = balanced) ## ## Coefficients: ## (Intercept) Post1 ## 8.15015 0.02193 ## ## ## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS ## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM: ## Level of Significance = 0.05 ## ## Call: ## gvlma(x = model2) ## ## Value p-value Decision ## Global Stat 7.101e+00 0.13063 Assumptions acceptable. ## Skewness 4.541e+00 0.03309 Assumptions NOT satisfied! ## Kurtosis 1.215e+00 0.27030 Assumptions acceptable. ## Link Function -6.240e-14 1.00000 Assumptions acceptable. ## Heteroscedasticity 1.345e+00 0.24614 Assumptions acceptable.

plot(model2)

hist(log(balanced$Sales))

**Our second model produces another highly significant coefficient for the FaceBook post variable. Here we see that each post is associated with an average 2.19% increase in daily sales. Unfortunately, even our log transformation was unable to correct for the skewness in our data. We’ll have to note this when presenting our findings later.**

**Let’s now determine how much the 2.19% is actually worth (since we saw in model 1 that a post was worth $77.90).**

mean_sales_no_post <- mean(balanced$Sales[balanced$Post == 0]) mean_sales_with_post <- mean(balanced$Sales[balanced$Post == 1]) mean_sales_no_post * model1$coefficients['Post1']

## Post1 ## 270086.1

**Very close. Model 2’s coefficient equates to $76.02, which is very similar to our 77.90.**

**Let’s now run another test to see if we get similar results. In analytics, it’s always helpful to arrive at the same conclusion via different means, if possible. This helps solidify our results.**

**Here we can run a standard T-test. Yes, yes, for those other analyst reading, a t-test is the same metric used in the OLS (hence the t-statistic it produces). Here, however, let’s run it on the unbalanced dataset to ensure we didn’t miss anything in sampling our data (perhaps we sampled really good or really bad data that will skew our results).**

t_test <- t.test(data_PY$Sales[data_PY$Post == 1],data_PY$Sales[data_PY$Post == 0] ) t_test

## ## Welch Two Sample t-test ## ## data: data_PY$Sales[data_PY$Post == 1] and data_PY$Sales[data_PY$Post == 0] ## t = 2.5407, df = 89.593, p-value = 0.01278 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 13.3264 108.9259 ## sample estimates: ## mean of x mean of y ## 3544.970 3483.844

summary(t_test)

## Length Class Mode ## statistic 1 -none- numeric ## parameter 1 -none- numeric ## p.value 1 -none- numeric ## conf.int 2 -none- numeric ## estimate 2 -none- numeric ## null.value 1 -none- numeric ## stderr 1 -none- numeric ## alternative 1 -none- character ## method 1 -none- character ## data.name 1 -none- character

ggplot(data = data_PY, aes(Post, Sales, color = Post)) + geom_boxplot() + geom_jitter()

**Again, we receive a promising result. On all of the data, our t-statistic was 2.54, meaning we reject the null hypothesis that the difference in the means between the two groups is 0. Our T-test produces a confidence interval of [13.33, 89.59]. This interval includes our previous findings, once again giving us some added confidence.**

**So now we know our FaceBook posts are actually benefiting the business by generating additional daily sales. However, we also saw earlier that our company hasn’t been posting reguarly (which is why we had to rebalance the data).**

length(data_PY$Sales[data_PY$Post == 0])

## [1] 281

ggplot(data = data.frame(post=as.factor(c('No Post','Post')), m=c(280, 54)) ,aes(x=post, y=m)) + geom_bar(stat='identity', fill='dodgerblue3')

**Let’s create a function that takes two inputs: 1) % of additional days advertised, 2) % of advertisements that were effective.**

**The reason for the second argument is that it’s likely unrealistic to assume all over our ads are effective. Indeed, it’s likely we have diminishing returns with more posts (as people probably tire of seeing them or block them in their feed if they become too frequent). Limiting effectiveness will give us some sense of a more reasonable estimate of lost revenue.**

**Another benefit of creating a custom function is that we can quickly re-run the calculations if management desires.**

#construct a 95% confidence interval around Post1 coefficient conf_interval = confint(model2, "Post1", .90) missed_revenue <- function(pct_addlt_adv, pct_effective){ min = pct_addlt_adv * pct_effective * 280 * mean_sales_no_post * conf_interval[1] mean = pct_addlt_adv * pct_effective * 280 * mean_sales_no_post * model2$coefficients['Post1'] max = pct_addlt_adv * pct_effective * 280 * mean_sales_no_post * conf_interval[2] print(paste(pct_addlt_adv * 280, "additional days of advertising")) sprintf("$%.2f -- $%.2f -- $%.2f",max(min, 0), mean, max } #Missed_revenue(% of additional days advertised, % of advertisements that were effective) missed_revenue(.5, .7)

## [1] "140 additional days of advertising"

## [1] "$2849.38 -- $7449.57 -- $12049.75"

**So if our company had advertised half of the days they didn’t, and only 70% of those adds were effective, we’d have missed out on an average of $7,449.57.**