Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu ⇤ , Tao Xu ⇤ , Yanxin Shi ⇤ ,
Antoine Atallah ⇤ , Ralf Herbrich ⇤ , Stuart Bowers, Joaquin Quiñonero Candela
Facebook
1601 Willow Road, Menlo Park, CA, United States
{panjunfeng, oujin, joaquinq, sbowers}@fb.com
ABSTRACT
Online advertising allows advertisers to only bid and pay
for measurable user responses, such as clicks on ads. As a
consequence, click prediction systems are central to most on-
line advertising systems. With over 750 million daily active
users and over 1 million active advertisers, predicting clicks
on Facebook ads is a challenging machine learning task. In
this paper we introduce a model which combines decision
trees with logistic regression, outperforming either of these
methods on its own by over 3%, an improvement with sig-
nificant impact to the overall system performance. We then
explore how a number of fundamental parameters impact
the final prediction performance of our system. Not surpris-
ingly, the most important thing is to have the right features:
those capturing historical information about the user or ad
dominate other types of features. Once we have the right
features and the right model (decisions trees plus logistic re-
gression), other factors play small roles (though even small
improvements are important at scale). Picking the optimal
handling for data freshness, learning rate schema and data
sampling improve the model slightly, though much less than
adding a high-value feature, or picking the right model to
begin with.
1. INTRODUCTION
Digital advertising is a multi-billion dollar industry and is
growing dramatically each year. In most online advertising
platforms the allocation of ads is dynamic, tailored to user
interests based on their observed feedback. Machine learn-
ing plays a central role in computing the expected utility
of a candidate ad to a user, and in this way increases the
⇤ BL works now at Square, TX and YS work now at Quora,
AA works in Twitter and RH works now at Amazon.
Permission to make digital or hard copies of all or part of
this work for personal or classroom use is granted without
fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this
notice and the full citation on the first page. Copyrights for
components of this work owned by others than ACM must
be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions@acm.org.
ADKDD’14, August 24 - 27 2014, New York, NY, USA
Copyright 2014 ACM 978-1-4503-2999-6/14/08$15.00.
http://dx.doi.org/10.1145/2648584.2648589
e?ciency of the marketplace.
The 2007 seminal papers by Varian [11] and by Edelman et
al. [4] describe the bid and pay per click auctions pioneered
by Google and Yahoo! That same year Microsoft was also
building a sponsored search marketplace based on the same
auction model [9]. The e?ciency of an ads auction depends
on the accuracy and calibration of click prediction. The
click prediction system needs to be robust and adaptive, and
capable of learning from massive volumes of data. The goal
of this paper is to share insights derived from experiments
performed with these requirements in mind and executed
against real world data.
In sponsored search advertising, the user query is used to
retrieve candidate ads, which explicitly or implicitly are
matched to the query. At Facebook, ads are not associated
with a query, but instead specify demographic and interest
targeting. As a consequence of this, the volume of ads that
are eligible to be displayed when a user visits Facebook can
be larger than for sponsored search.
In order tackle a very large number of candidate ads per
request, where a request for ads is triggered whenever a user
visits Facebook, we would first build a cascade of classifiers
of increasing computational cost. In this paper we focus on
the last stage click prediction model of a cascade classifier,
that is the model that produces predictions for the final set
of candidate ads.
We find that a hybrid model which combines decision trees
with logistic regression outperforms either of these methods
on their own by over 3%. This improvement has significant
impact to the overall system performance. A number of
fundamental parameters impact the final prediction perfor-
mance of our system. As expected the most important thing
is to have the right features: those capturing historical in-
formation about the user or ad dominate other types of fea-
tures. Once we have the right features and the right model
(decisions trees plus logistic regression), other factors play
small roles (though even small improvements are important
at scale). Picking the optimal handling for data freshness,
learning rate schema and data sampling improve the model
slightly, though much less than adding a high-value feature,
or picking the right model to begin with.
We begin with an overview of our experimental setup in Sec-
tion 2. In Section 3 we evaluate di↵erent probabilistic linear
classifiers and diverse online learning algorithms. In the con-
text of linear classification we go on to evaluate the impact
of feature transforms and data freshness. Inspired by the
practical lessons learned, particularly around data freshness
and online learning, we present a model architecture that in-
corporates an online learning layer, whilst producing fairly
compact models. Section 4 describes a key component re-
quired for the online learning layer, the online joiner, an
experimental piece of infrastructure that can generate a live
stream of real-time training data.
Lastly we present ways to trade accuracy for memory and
compute time and to cope with massive amounts of training
data. In Section 5 we describe practical ways to keep mem-
ory and latency contained for massive scale applications and
in Section 6 we delve into the tradeo↵ between training data
volume and accuracy.
2. EXPERIMENTAL SETUP
In order to achieve rigorous and controlled experiments, we
prepared o✏ine training data by selecting an arbitrary week
of the 4th quarter of 2013. In order to maintain the same
training and testing data under di↵erent conditions, we pre-
pared o✏ine training data which is similar to that observed
online. We partition the stored o✏ine data into training and
testing and use them to simulate the streaming data for on-
line training and prediction. The same training/testing data
are used as testbed for all the experiments in the paper.
Evaluation metrics: Since we are most concerned with
the impact of the factors to the machine learning model,
we use the accuracy of prediction instead of metrics directly
related to profit and revenue. In this work, we use Normal-
ized Entropy (NE) and calibration as our major evaluation
metric.
Normalized Entropy or more accurately, Normalized Cross-
Entropy is equivalent to the average log loss per impression
divided by what the average log loss per impression would
be if a model predicted the background click through rate
(CTR) for every impression. In other words, it is the pre-
dictive log loss normalized by the entropy of the background
CTR. The background CTR is the average empirical CTR
of the training data set. It would be perhaps more descrip-
tive to refer to the metric as the Normalized Logarithmic
Loss. The lower the value is, the better is the prediction
made by the model. The reason for this normalization is
that the closer the background CTR is to either 0 or 1, the
easier it is to achieve a better log loss. Dividing by the en-
tropy of the background CTR makes the NE insensitive to
the background CTR. Assume a given training data set has
N examples with labels y i 2 {?1,+1} and estimated prob-
ability of click p i where i = 1,2,...N. The average empirical
CTR as p