A commonly-used baseline for text classification competitions on Kaggle is NB-SVM, introduced by Sida Wang and Chris Manning in the paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classification in 2012. With Deep Learning based models dominating the field of NLP, it is nice to have a bag-of-words model that trains with a fraction of the resource (time & compute) but performs only slightly worse.

1. Abstract

  • Naive Bayes (NB) and Support Vector Machine (SVM) are widely used as baselines in text-related tasks but their performance varies significantly across variants, features and datasets.
  • Word bigrams are useful for sentiment analysis, but not so much for topical text classification tasks
  • NB does better than SVM for short snippet sentiment tasks, while SVM outperforms NB for longer documents
  • A SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets

2. Methods

  • The main model is formulated as a linear classifier

$$\textbf{y} = \text{sign}(\textbf{w}^T\textbf{x} + \textbf{b})\tag{1}$$

  • Let $\textbf{f}^{(i)}$ be the feature vector for training case $i$ with binary label $y^{(i)} \in {-1, 1}$. Define the two count vectors $\textbf{p}$ and $\textbf{q}$ as

$$\textbf{p} = \alpha + \sum_{i: y^{(i)} = 1} \textbf{f}^{(i)}\tag{2}$$

$$\textbf{q} = \alpha + \sum_{i: y^{(i)} = -1} \textbf{f}^{(i)}\tag{3}$$

  • $\alpha$ is the smoothing parameter
  • Log-count ratio can be defined as

$$\textbf{r} = \text{log}\left( \frac{\textbf{p} / ||\textbf{p}||_1 }{ \textbf{q} / ||\textbf{q}||_1 } \right)\tag{4}$$

2.1 Multinomial Naive Bayes (MNB)

  • For MNB, the feature vectors represent the frequencies with which events are generated by a multinomial distribution $\textbf{p} = (p_1, \cdots, p_n)$.
  • The feature vector $\textbf{x} = (x_1, \cdots, x_n)$ is a histogram, where $x_k$ is the number of times event $k$ was observed in a particular instance.
  • With the multinomial assumption and the Naive Bayes assumption, the likelihood of $\textbf{x}$ conditional on y is given by $$p(\textbf{x} | y) = \frac{\sum_k x_k !}{\prod_k x_k !} \prod_k p_{yk}^{x_k}$$ $$p(y | \textbf{x}) = \frac{p(y) p(\textbf{x} | y)}{p(\textbf{x})} \propto p(y) p(\textbf{x} | y) $$ $$ \begin{align} \text{log} \ p(y | \textbf{x}) &\propto \text{log} \ \left[ p(y) \prod_k p_{yk}^{x_k} \right] \\ &= \text{log} \ p(y) + \sum_k x_k \text{log} \ p_{yk} \tag{4} \end{align} $$
  • The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space, since $Eq. (4)$ can be change into the format of $Eq. (1)$ $$ \begin{align} y &= \text{argmax} \ \text{log} \ p(y) + \sum_k x_k \ \text{log} \ p_{k} \\ &= \text{sign} \left( \text{log} \ \frac{p(y = 1)}{p(y = -1)} + \sum_k x_k \ \text{log} \ \frac{p_{1k}}{p_{-1k}} \right) \\ &= \text{sign} \left( \text{log} \ \frac{N_+}{N_-} + \textbf{r}^T\textbf{f} \right) \end{align} $$

  • Metsis et al. showed that binarizing $\textbf{f}$ is better. Hence we define $\hat{\textbf{f}} = \textbf{1}(\textbf{f} > 0)$ and use $\hat{\textbf{f}}$ to compute $\hat{\textbf{p}}, \hat{\textbf{q}}, \hat{\textbf{r}}$

2.2 Support Vector Machine (SVM)

  • For SVM, we set $\textbf{x} = \hat{\textbf{f}}$ and obtain the $\textbf{w}, b$ by minimizing the loss function

$$L(\textbf{w}, b) = \textbf{w}^T \textbf{w} + C \sum_i \text{max}\left(0, 1 - y^{(i)}(\textbf{w}^T \hat{\textbf{f}}^{(i)} + b)\right)^{2} $$

2.3 SVM with NB features (NBSVM)

  • For NBSVM, we use the log likelihood ratios as the feature vector and set $\textbf{x} = \hat{\textbf{r}} \circ \hat{\textbf{f}}$
  • While works well for longer documents, an interpolation between MNB and SVM performs well for all documents

$$\textbf{w}` = (1 - \beta)\bar{w} + \beta \textbf{w}$$

  • $\bar{w}$ is the mean magnitude of $\textbf{w}$ and $\beta \in [0, 1]$ is the interpolation parameter
  • The interpolation is a form of regularization: trust NB unless the SVM is very confident

3. Results

  • MNB is better at snippets while SVM is better at full-length reviews
  • NBSVM performs well on snippets and longer documents for for sentiment, topic and subjectivity classification. NBSVM is a very strong baseline for sophisticated methods aiming to beat a bag of features.
  • In sentiment classification there are gains from adding bigrams because they can capture modified verbs and nouns.

4. Implementation

A well-written implementation with scikit-learn scikit-learn estimator api and the interpolation parameter $\beta$ can be found here