Leveraging Power of Reinforcement Learning in Digital Marketing

Sourish Dey
DataDrivenInvestor
Published in
11 min readMay 22, 2020

--

In my previous article, I discussed an advanced analytics solution to increase campaign ROI or Return on Marketing Investment (ROMI) through propensity modeling techniques. While supervised learning is a mostly used (at least so far) method in the predictive analytics industry, it has few limitations.r

Firstly as supervised learning uses static lists of attributes, which if not refreshed in regular intervals, becomes irrelevant in the course of time. That means these techniques are not prone to adaptive learning/self-learning (discovering something new intelligence apart from the rule learned during training), i.e. when there is a change in the new data IV(independent variable) or population characteristics, the algorithm’s performance degrades. We tend to see the models are most predictive the closer they are launched to the learning period. Over time external factors (such as shifts in customers’ purchasing behaviors) reduce the predictive power of the model leading to the need for an offline rebuild process. Secondly it is very hard to make it great with bad data — ‘garbage in and garbage out’.

Here reinforcement learning/online learning(foundation stone of AI) comes to solve these problems to a great extent. While supervised learning predicts a class and is trained on class, the reinforcement algorithm learns from the reward/punishment and updates itself, this continues over time. So while Supervised learning is stateless, RL has a temporal dimension, it looks not only at the past state but also at the state now. So as there is always a real-time update in the algorithm, optimizations occur close to real-time with dynamic attribute lists, minimizing the impact of external factors. So here the model performance characteristic is just the opposite of supervised learning- initially, the performance is sub-optimal and when the algorithm sees more data it learns from the reward/punishment structure its predictive power increases with updated guiding principle.

Performance Characteristics over time

Supervised Model vs. Reinforcement Learning (source:Image by Author)

In this article we will solve a business problem in the digital marketing arena using Upper Confidence Bound(UCB)- a breed of Reinforcement Learning family.

Business scenario and RL framework

As the Credit card companies and banks are aspiring and competing to be fully digital with a limited marketing budget and deadlines to realize tangible $ impact, it is imperative to shift from conventional test and target strategy to next-gen approach. So far in digital marketing and web analytics A/B testing is the prevalent method to compare digital campaigns,chose the winning ad and decide targeting strategy.

Although it is easy to design and implement, however this hypothesis-based classical approach has few drawbacks.

1) No memory: It is an independent test every time, not linked with knowledge gained about earlier campaigns. A previous knowledge(prior probability) generally strengthens the confidence of posterior probability and hence the decision-making process.

2) Slow and expensive: As it is a two-step process, first it explores the opportunities through a test campaign and then based on the comparative analysis it exploits the winning digital ad or creative to reap the $ benefit. On the other way the reinforcement learning algorithm acts as a self-learning engine and optimizes both exploration and exploitation at the same time.

3) No guiding principle: There is no self-learning and feedback system to continually update outcomes. Just because creative X performed better over creative Y one year ago does not mean that it will still perform better now.

In the next part of article we will discuss the business case and reinforcement learning-based solution including implementation in python and R.

Digital Campaign case study:

A multinational bank is planning to acquire customers through a new campaign in the digital space, providing a special spend offer for movie tickets for one of its newly launched credit card products. There is a detailed cost-benefit analysis and decisions about offer structure. However the digital marketing team is exploring six different creatives and based on the reward/return (click through rate(CTR)) in the test campaign, the winning ad will be shown to the home page, social media and other partner sites. One option is to do multiple A/B tests with large enough samples and choose the best ad, which will be both time-consuming and expensive and as a pure exploration method there is no significant exploitation (simultaneously) of the best ad.

Let’s see how the Upper Confidence Bound (UCB) algorithm, a reinforcement learning technique, which is closer to the field of Artificial Intelligence(AI) helps us doing exploration and exploitation at the same time to maximize our returns.

Fig- Creative versions for Digital Campaign

So In nutshell our problem has the following steps:

i)we have six creative versions(d) for a digital campaign and each time a user/visitor connects to web page we display one version of this ad.

ii) Each time a user connects to this web page, counts as a round(n).

iii) At each round n, creative d gets a reward rd (n) ϵ {0,1}: rd (n) = 1 if the user clicks on the add, 0 if the user didn’t.

iv) Our goal is to maximize the total reward we get over many round(i)

UCB algorithm and Multi-Armed Bandits problem:

The intuition of the UCB algorithm comes from the working principle of casino slot machines in context with the famous Multi-Armed Bandits problem(to choose the best slot machine, among machines with different probability of success, to maximize profit). Clearly our business problem is no different than the Multi-Armed Bandits problem and the following are steps of a simple version of the UCB algorithm:

Fig 2: UCB Algorithm and Multi-armed Bandit (Source:Image by Author)
Pic source: https://cxl.com/blog/bandit-tests/

In simple words the algorithm initially assumes some prior information (or some assumption after few rounds) about the options regarding probability distribution of customer preference and expected returns of these distributions. During the iterations, the algorithm gains intelligence based on user action — reward (clicks)/punishment(no click) and in each round(iteration) the expected value(average reward) as specified in step 2. Then the second thing in step 2 happens is that, with additional information about each option after every round, it becomes more confident about the expected value and with sample size increases it continually updates the confidence interval(refer to step 2) by shrinking the confidence bound. After gathering a large sample with thousands of iterations when it is more confident about the overall picture, it further narrowed down the confidence intervals, closer to the expected value and finally, the algorithm finds the best solution or best option. Then it starts exploiting by showing only the optimal ad to maximize ROI. Exactly this is the way to find the best slot machine in the Multi-Armed Bandits problem and exploit that slot machine to make money.

Implementation

Now it’s time to do some coding to follow the above steps and understand the entire process more clearly. The data set for this exercise can be downloaded from here .(this data set is prepared purely for academic interest and having no linkage between any business data). Here we will see from our data points of only 10k user actions, how UCB chooses the optimal ad and starts exploiting the option to minimize marketing cost(more ads are shown the social media, more cost is incurred). Six different ad versions are shown to the users when users’ actions are tagged (clicks then reward = 1 otherwise reward = 0). [Please note this is a simulated data set for illustration purposes. However in real life we start with scratch (no data) and the basis of the user action in the course of iterations algorithm decides or updates its strategy to show ad version. That is why reinforcement learning is called online learning/interactive learning]

As there is no ready-made package for this algorithm, we will try to build it from scratch. This is a simple version of implementation and of course there are other ways to implement UCB.

Coding in Python

First let’s see what is the reward w/o adopting any strategy(random selection) and will compare the performance of UCB with random display strategy

#importing necessary libraries
import pandas as pd
import os as os
import numpy as np
import matplotlib.pyplot as plt
import random
import math
#Dataset reading
os.chdir('C://practice/Reinforcement/UCB')
os.getcwd()
ucb = pd.read_csv('ucb_data.csv')
# check out first 10 data points
sales_data.head(10)

The dataset maps user action. e.g. for user 6 clicks for ad2 but doesn’t click any other ad.

Randomly choose ads in different rounds and save the total clicks in the ‘total_reward_rand’ variable and the ‘ad_selected’ list stores different ad versions selected over rounds.

#Initialization
user = ucb.user
ucb = ucb.drop(['user'], axis = 1) #Dropping user field
N = 10000 # no of iterations
d = 6 # no of ad versions
total_reward_rand = 0 #variable to store total rewards or clicks throuout the iterations

#Loop through the iterations and random selection of ad
ad_selected = [] #List to store diff ad versions selected over iterations
for i in range(0,N): #For loop from 0 to 9999
ad = random.randrange(d)
ad_selected.append(ad)
reward = ucb.values[i,ad]
total_reward_rand = total_reward_rand + reward

We Will see how many clicks we get without any strategy

print('total clicks without any strategy: {}'.format(total_reward_rand))#plot- Distribution of ad_selected
plt.hist(ad_selected)
plt.title('Distribution of ads')
plt.xlabel('ad_no')
plt.ylabel('count')
plt.show()

total clicks without any strategy: 1287

As expected with random selection different ads are selected almost uniformly and the click rate is approx 12.87% (1287/10000).[In python index starts from zero,so Ad1 is represented as 0 and so on]. Also as it’s a random process result might be slightly different in another run.

#Let’s see whether the UCB algorithm improves the click rate by exploiting the ad, having the highest probability of getting clicked. The code lines are commented on and self-explanatory. However, feel free to ask.

#Initialization
N = 10000 # no of iterations
d = 6 # no of ad versions
ad_selected = [] #List to store diff ad versions selected over iterations
no_selection = [0] * d #List to store no of times diff ad versions selected
sum_reward = [0] * d #List to store no of clicks diff ad versions get upto round N
total_reward_ucb = 0 # To store Total No clicks upto round N

As mentioned in Fig 2, the inner loop calculates two nos;1)Avg reward for ad version j and 2)Upper confidence bound for that ad version for each iteration.Variable ‘ad’ stores the ad version with the highest UCB and continuously updates its value whenever upper_bnd > max_upper_bnd. max_upper_bnd value is also updated.

#loop through no of rounds 
for i in range(0,N):
max_upper_bnd = 0 # Variable to store value of max UCB and resets at each round
ad = 0 #temporary Variable to store ad version having highest UCB and starts with ad1
for j in range(0,d): #Here loop starts with 0 index till index 5
if (no_selection[j] > 0):
avg_reward = sum_reward[j]/no_selection[j]
delta_j = (math.sqrt(3/2) * math.log(i + 1)/no_selection[j]) #Refer Step2 for fig2
upper_bnd = avg_reward + delta_j #UCB calculation for ad version j for iteration i
else:
upper_bnd = 1e5 #High value assigned during initial rounds
if upper_bnd > max_upper_bnd:
max_upper_bnd = upper_bnd
ad = j
ad_selected.append(ad) #List updates with ad selected at each round
no_selection[ad] = no_selection[ad] + 1
reward = ucb.values[i,ad] #Picks the user action- click/no click for selected ad at round i
sum_reward[ad] = sum_reward[ad] + reward #list updates to store rewards for diff add version
total_reward_ucb = total_reward_ucb + reward #Total rewards/clicks updates after every round

However initially we have no_selection = 0 for any ad version j, so for the initial 10 rounds, the strategy will be to select each ad version once,e.g. for round 1 ad0, for round 2 ad1 will be displayed and so on.Then when we get some info about the rewards(clicks) for each ad version during initial rounds the UCB algorithm starts working. The initial if-else clause is used to show ad according to the UCB algorithm, only when ad version j is selected at least 1. Upper_bnd is assigned with a very high value(1e5), so that the sequence of ad selection is maintained During initial rounds(you can try with any ad version, not selected once, during initial 10 rounds, i.e. ad 2 for round 3).

That’s it.The UCB algorithm is implemented and now time to see the result!

print('total clicks with UCB Algorithm: {}'.format(total_reward_ucb))#plot- Distribution of ad_selected
plt.hist(ad_selected)
plt.title('Distribution of ads')
plt.xlabel('ad_no')
plt.ylabel('count')
plt.show()

total clicks with UCB Algorithm: 2028.

The result is exciting! With UCB we get a click rate of 20.28%, a significant lift of 1.58 times than random strategy.Also it is evident from the histogram of ad selected that Ad 3(‘MORE SCI-FI FOR LESS’) is the clear winner. That means UCB is able to show the ad version, which has the highest probability of getting clicked.

Let’s explore further two vectors ad_selected and sum reward

#How many times Ad 3 selected after 1000 rounds till end
no_displayed = pd.Series(ad_selected[1000:N]).value_counts()
select_rate = no_displayed/len(ad_selected[1000:N])
#Let's also look click though rate of diff ad versions
sum_reward.sort(reverse = True)
no_selection.sort(reverse = True)
click_probability = pd.Series(sum_reward)/pd.Series(no_selection)
ucb_output = pd.DataFrame({'Ad Version':no_displayed.index, 'No Displayed' : no_displayed.values, 'Display Rate' :select_rate.values, 'Click probability(CTR)':click_probability})
ucb_output

So from the output even after 1k rounds algorithm fixes its the strategy of displaying the winning ad version.We can see almost 98% of instances Ad 3(no 2 as per python indexing standard), which has the highest probability of click(CTR) and the algorithm started exploiting the best ad to get an increased no of clicks and subsequent more conversion.This is just a miniature data set. In real life the data set will be humongous and so the realization of dollar benefit.

Implementation in R

You can find the R Code for UCB implementation here. Logic is the same. Run and see whether the result is matching or not.

Road Ahead:

This is just a baby step towards building an intelligent digital marketing ecosystem and there are immense opportunities to utilize the untapped potential of reinforcement learning. The power of big data and efficient analytics combined inspires a whole new wave of intelligent automation that the human race is just beginning to tap into. Digital marketing is one of the many fields that this wave is disrupting. In my next article I will introduce another AI-based approach- Thompson Sampling, to see whether it has a better solution for this problem. Until then ‘stay hungry stay foolish’.

--

--

Sourish has 9+ years of experience in Data Science, Machine Learning,Business Analytics,Consulting across all domains, currently in Berlin with dunnhumby gmbh.