2 minute read

This is the IPDC Python+data Data-Science Algorithms extension unit activity for the Näive Bayes Classifier algorithm.

Mathematics behind Bayesian Gaussian statistics

Bayes’ Theorem can be stated as:

\[P(A\vert B) = \frac{P(A) P(B\vert A)}{P(B)}\]

or:

\[\text{posterior} = \frac{\text{prior}\times\text{likelihood}}{\text{evidence}}\]

where:

  • $A$ and $B$ are events.
  • $P(A\vert B)$ is the posterior probability of class (A, target) given predictor (B, attributes).
  • $P(A)$ is the prior probability of class.
  • $P(B\vert A)$ is the likelihood which is the probability of predictor given class.
  • $P(B)$ is the prior probability of predictor, $P(B)\ne0$.

Bayesian Gaussian statistics further assumes (as described on Kaggle):

  • The features are independent, so that $P\left(x_{i}\vert y, \dots, x_{i-1}, x_{i+1}, \dots x_{n}\right) = P\left(x_{i}\vert y\right)$ and the joint probability is their product.
  • The feature values are normally distributed with mean $\mu = \frac{\sum_{i=1}^{N}x_{i}}{N}$ and variance $\sigma^{2} = \frac{\sum_{i=1}^{N}x_{i}^{2}}{N} -\mu^{2}$, so that the probability density of $v$ given a class $C_{k}$, can be computed by $p\left(x = v\vert C_{k}\right) = \frac{1}{\sqrt{2\pi\sigma_{k}^{2}}}e^{-\frac{\left(v - \mu_{k}\right)^{2}}{2\sigma_{k}^{2}}}$

Using data to build a classifier model

  • The Titanic survivability data are available on Kaggle.
  • Given .CSV headers ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'train'], the relevant features are Age, Embarked, Fare, Parch, Sex, SibSp. The target is Survived. (The training data are labeled with train.)
  • The features must be verified to meet the näive Gaussian criteria (as in Dimitre Oliveira’s solution).
  • The data must be normalized with numerical values (not categorical) and cleaned of missing values or where missing values are set to the mean where appropriate.

Following the Wikipedia article and the example, yields the following approach:

  • For each of the 6 features $f_{i}$ of each element in the training data, calculate the mean $\mu$ and variance $\sigma^{2}$ for each of two values of $C_k$: survived $S$ and not survived $\overline{S}$.
  • The prior probability is $P(C_{k})$ and the likelihood is $\prod_{i}P({f_{i}\vert C_{k}})$, so the joint probability is: \(P\left(f_{i}\vert C_{k}, \dots, f_{i-1}, f_{i+1}, \dots f_{n}\right) = P\left(f_{i}\vert C_{k}\right) = P\left(C_{k}\right) \prod_{i}P\left({f_{i}\vert C_{k}}\right)\)
  • The total evidence is $Z = P(S) \prod_{i}{P(f_{i}\vert S)} + P(\overline{S}) \prod_{i}{P(f_{i}\vert\overline{S})}$, so the posterior probability of each of two values of $C_k$ are: \(P(S\vert f_{1},\dots,f_{6}) = \frac{P\left(S\right) \prod_{i}P\left({f_{i}\vert S}\right)}{Z}\) \(P(\overline{S}\vert f_{1},\dots,f_{6}) = \frac{P(\overline{S}) \prod_{i}P({f_{i}\vert \overline{S}})}{Z}\)
  • If $P(\overline{S}\vert f_{1},\dots,f_{6}) \lt P(S\vert f_{1},\dots,f_{6})$ for any passenger, then that passenger survived.

Sample code

#data #Python #IPDC