Naive Bayes classifiers deserve their place in Machine Learning 101 as one of the simplest and fastest algorithms for classification. This post explains a very straightforward implementation in TensorFlow that I created as part of a larger system. You can find the code here.
The textbook application of Naive Bayes (NB) classifiers is spam filtering, where word frequency counts are used to classify whether a given message is spam or not. In this example, however, we're going to be using continous data instead. More specifically, we'll be classifying flowers based on measurements of their petals size.
Just a quick refresher on the NB algorithm: as with any classifier, the
training data is a set of training examples
for each of the
The math develops as follows. First, we write down Bayes' theorem:
By definition of conditional probability, the numerator is just the joint
probability distribution
By the naive assumption, the reciprocal conditioning among features can be dropped. In other words, we assume that the value assumed by each feature depends on the class only, and not on the values assumed by the other features. This means that we can simplify the previous formula quite a bit:
Going back to Bayes' theorem, we observe that we can discard the denominator since it's just a normalization factor that doesn't depend on the class. We get:
The last formula shows why NB is so convenient: all
To wrap up the algorithm, we need to make a decision based on these conditional probabilities. An intuitive and very common strategy is **Maximum a Posteriori (MAP)*: we simply pick the most likely class:
With sklearn
, loading the dataset and training the classifier is trivial:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X, y)
This is how a plot of the 3 first principal components looks like in 3D:
You may want to follow along with the actual code here.
We start by grouping the training samples based on their labeled class, and get
a (nb_classes * nb_samples * nb_features)
array.
Based on the discussion above, we can fit individual Gaussian distributions to each combination of labeled class and feature. It's important to point out that, even if we're feeding the data in one go, we are fitting a series of univariate distributions, rather than a multivariate one:
mean, var = tf.nn.moments(tf.constant(points_by_class), axes=[1])
In this trivial example, we're using tf.constant
to get the
training data inside the TensorFlow graph. In real life, you probably want to
use tf.placeholder
or even more performing alternatives like tf.Data
(see
the documentation).
We take advantage of TensorFlow's tf.distributions
module to create
a Gaussian distribution with the estimated mean and variance:
self.dist = tf.distributions.Normal(loc=mean, scale=tf.sqrt(var))
This distribution is the only thing we need to keep around for inference, and
it's luckily pretty compact, since the mean and variance are only (nb_classes, nb_features)
.
For inference, it's important to work in the log probability space to avoid numerical errors due to repeated multiplication of small probabilities. We have:
To take care of the first term, we can assume that all classes are equally likely (i.e. uniform prior):
priors = np.log(np.array([1. / nb_classes] * nb_classes))
To compute the sum in the second term, we duplicate (tile) the feature vectors along a new "class" dimension, so that we can get probabilities from the distribution in a single run:
# (nb_samples, nb_classes, nb_features)
all_log_probs = self.dist.log_prob(
tf.reshape(
tf.tile(X, [1, nb_classes]), [-1, nb_classes, nb_features]))
The next step is to add up the contributions of each feature to the likelihood of each class. In TensorFlow lingo, this is a reduce operation over the features axis:
# (nb_samples, nb_classes)
cond_probs = tf.reduce_sum(all_log_probs, axis=2)
We can then add up the priors and the conditional probabilities to get the posterior distribution of the class label given the features:
joint_likelihood = tf.add(priors, cond_probs)
In the derivation, we ignored the normalization factor, so the expression above
is not a proper probability distribution because it doesn't add up to 1. We
fix that by subtracting a normalization factor in log space using TensorFlow's
reduce_logsumexp
.
Naively computing log(sum(exp(..)))
here won't work here
because of numerical issues (see the logsumexp trick).
norm_factor = tf.reduce_logsumexp(
joint_likelihood, axis=1, keep_dims=True)
log_prob = joint_likelihood - norm_factor
Finally, we exponentiate to get actual probabilities:
probs = tf.exp(log_prob)
By feeding in a grid of points and drawing the contour lines at 0.5 probability, we get the nice plot at the top:
Building a simple Naive Bayes classifier in TensorFlow is a good learning exercise to get familiar with TensorFlow's probability distributions and practice the less common tensor operations.