Naive Bayes classifiers deserve their place in Machine Learning 101 as one of the simplest and fastest algorithms for classification. This post explains a very straightforward implementation in TensorFlow that I created as part of a larger system. You can find the code here.
The textbook application of Naive Bayes (NB) classifiers is spam filtering, where word frequency counts are used to classify whether a given message is spam or not. In this example, however, we're going to be using continous data instead. More specifically, we'll be classifying flowers based on measurements of their petals size.
Just a quick refresher on the NB algorithm: as with any classifier, the training data is a set of training examples $\textbf{x}$, each of which is composed of $n$ features $\textrm{x}i = (x{1}, x_{2}, ..., x_{n})$ and their corresponding class $C_i$ where $i$ is one of $k$ classes. The goal is to learn a conditional probability model:
$$ p(C_k | x_1, x_2, ..., x_k) $$
for each of the $k$ classes in the dataset. Later, we're going to see how a simple rule can be used to make a decision on the basis of these conditional probabilities. Intuitively, learning this multivariate distribution will require a lot of data as the number of features grows. However, we can simplify the task if we assume that features are conditionally independent given the class. While this assumption never holds on real data, it results in a single but surprisingly simple classifier.
The math develops as follows. First, we write down Bayes' theorem:
$$ p(C_k | \mathbf{x}) = \frac{p_(C_k)p(\mathbf{x} | C_k)}{p(\mathbf{x})} $$
By definition of conditional probability, the numerator is just the joint probability distribution $p(C_k, \mathbf{x})$, and can be factored using the chain rule:
$$ p(C_k, \mathbf{x}) = p(x_1 | x_2, ..., x_n, C_k)p(x_2 | x_3, .., x_n, C_k)p(x_n | C_k)p(C_k) $$
By the naive assumption, the reciprocal conditioning among features can be dropped. In other words, we assume that the value assumed by each feature depends on the class only, and not on the values assumed by the other features. This means that we can simplify the previous formula quite a bit:
$$ p(C_k, \mathbf{x}) = p(x_1 | C_k)p(x_2 | C_k)...p(x_n | C_k)p(C_k) $$
Going back to Bayes' theorem, we observe that we can discard the denominator since it's just a normalization factor that doesn't depend on the class. We get:
$$ p(C_k | \mathbf{x}) \sim p(C_k, \mathbf{x}) \sim p(c_K)p(x_1|C_k)p(x_2|C_k)p(x_3|C_k) \sim p(C_k)\prod_{i=1}^{n}p(x_i | C_k) $$
The last formula shows why NB is so convenient: all $p(x_i | C_k)$ distributions can be learned independently. During training, we can split the examples by label, then learn a univariate distribution for each of the features given the class. For this example, we're going to assume that $p(x_i | C_k)$ are Gaussian distributions whose $\mu$ and $\sigma$ we can estimate easily from the samples.
To wrap up the algorithm, we need to make a decision based on these conditional probabilities. An intuitive and very common strategy is **Maximum a Posteriori (MAP)*: we simply pick the most likely class:
$$ \underset{k \in {1,..,K}}{\arg\max}\ p(C_k) \prod_{i=1}^np(x_i | C_k) $$
With sklearn
, loading the dataset and training the classifier is trivial:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X, y)
This is how a plot of the 3 first principal components looks like in 3D:
You may want to follow along with the actual code here.
We start by grouping the training samples based on their labeled class, and get
a (nb_classes * nb_samples * nb_features)
array.
Based on the discussion above, we can fit individual Gaussian distributions to each combination of labeled class and feature. It's important to point out that, even if we're feeding the data in one go, we are fitting a series of univariate distributions, rather than a multivariate one:
mean, var = tf.nn.moments(tf.constant(points_by_class), axes=[1])
In this trivial example, we're using tf.constant
to get the
training data inside the TensorFlow graph. In real life, you probably want to
use tf.placeholder
or even more performing alternatives like tf.Data
(see
the documentation).
We take advantage of TensorFlow's tf.distributions
module to create
a Gaussian distribution with the estimated mean and variance:
self.dist = tf.distributions.Normal(loc=mean, scale=tf.sqrt(var))
This distribution is the only thing we need to keep around for inference, and
it's luckily pretty compact, since the mean and variance are only (nb_classes, nb_features)
.
For inference, it's important to work in the log probability space to avoid numerical errors due to repeated multiplication of small probabilities. We have:
$$ \log p(C_k | \mathbf{x}) = \log p(C_k) + \sum_{i=1}^n \log p(\mathbf{x} | C_k) $$
To take care of the first term, we can assume that all classes are equally likely (i.e. uniform prior):
priors = np.log(np.array([1. / nb_classes] * nb_classes))
To compute the sum in the second term, we duplicate (tile) the feature vectors along a new "class" dimension, so that we can get probabilities from the distribution in a single run:
# (nb_samples, nb_classes, nb_features)
all_log_probs = self.dist.log_prob(
tf.reshape(
tf.tile(X, [1, nb_classes]), [-1, nb_classes, nb_features]))
The next step is to add up the contributions of each feature to the likelihood of each class. In TensorFlow lingo, this is a reduce operation over the features axis:
# (nb_samples, nb_classes)
cond_probs = tf.reduce_sum(all_log_probs, axis=2)
We can then add up the priors and the conditional probabilities to get the posterior distribution of the class label given the features:
joint_likelihood = tf.add(priors, cond_probs)
In the derivation, we ignored the normalization factor, so the expression above
is not a proper probability distribution because it doesn't add up to 1. We
fix that by subtracting a normalization factor in log space using TensorFlow's
reduce_logsumexp
.
Naively computing log(sum(exp(..)))
here won't work here
because of numerical issues (see the logsumexp trick).
norm_factor = tf.reduce_logsumexp(
joint_likelihood, axis=1, keep_dims=True)
log_prob = joint_likelihood - norm_factor
Finally, we exponentiate to get actual probabilities:
probs = tf.exp(log_prob)
By feeding in a grid of points and drawing the contour lines at 0.5 probability, we get the nice plot at the top:
Building a simple Naive Bayes classifier in TensorFlow is a good learning exercise to get familiar with TensorFlow's probability distributions and practice the less common tensor operations.