NB-SVM

A commonly-used baseline for text classification competitions on Kaggle is NB-SVM, introduced by Sida Wang and Chris Manning in the paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classification in 2012. With Deep Learning based models dominating the field of NLP, it is nice to have a bag-of-words model that trains with a fraction of the resource (time & compute) but performs only slightly worse.

1. Abstract

Naive Bayes (NB) and Support Vector Machine (SVM) are widely used as baselines in text-related tasks but their performance varies significantly across variants, features and datasets.
Word bigrams are useful for sentiment analysis, but not so much for topical text classification tasks
NB does better than SVM for short snippet sentiment tasks, while SVM outperforms NB for longer documents
A SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets `
2. Methods
The main model is formulated as a linear classifier

\[\textbf{y} = \text{sign}(\textbf{w}^T\textbf{x} + \textbf{b})\tag{1}\]

Let $\textbf{f}^{(i)}$ be the feature vector for training case $i$ with binary label $y^{(i)} \in {-1, 1}$. Define the two count vectors $\textbf{p}$ and $\textbf{q}$ as

\[\textbf{p} = \alpha + \sum_{i: y^{(i)} = 1} \textbf{f}^{(i)}\tag{2}\] \[\textbf{q} = \alpha + \sum_{i: y^{(i)} = -1} \textbf{f}^{(i)}\tag{3}\]

$\alpha$ is the smoothing parameter
Log-count ratio can be defined as

\[\textbf{r} = \text{log}\left( \frac{\textbf{p} / ||\textbf{p}||_1 }{ \textbf{q} / ||\textbf{q}||_1 } \right)\tag{4}\]

2.1 Multinomial Naive Bayes (MNB)

For MNB, the feature vectors represent the frequencies with which events are generated by a multinomial distribution $\textbf{p} = (p_1, \cdots, p_n)$.
The feature vector $\textbf{x} = (x_1, \cdots, x_n)$ is a histogram, where $x_k$ is the number of times event $k$ was observed in a particular instance.
With the multinomial assumption and the Naive Bayes assumption, the likelihood of $\textbf{x}$ conditional on y is given by $p(\textbf{x} | y) = \frac{\sum_k x_k !}{\prod_k x_k !} \prod_k p_{yk}^{x_k}$ $p(y | \textbf{x}) = \frac{p(y) p(\textbf{x} | y)}{p(\textbf{x})} \propto p(y) p(\textbf{x} | y)$ $\begin{align} \text{log} \ p(y | \textbf{x}) &\propto \text{log} \ \left[ p(y) \prod_k p_{yk}^{x_k} \right] \\\\ &= \text{log} \ p(y) + \sum_k x_k \text{log} \ p_{yk} \tag{4} \end{align}$
The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space, since $Eq. (4)$ can be change into the format of $Eq. (1)$ $\begin{align} y &= \text{argmax} \ \text{log} \ p(y) + \sum_k x_k \ \text{log} \ p_{k} \\\\ &= \text{sign} \left( \text{log} \ \frac{p(y = 1)}{p(y = -1)} + \sum_k x_k \ \text{log} \ \frac{p_{1k}}{p_{-1k}} \right) \\\\ &= \text{sign} \left( \text{log} \ \frac{N_+}{N_-} + \textbf{r}^T\textbf{f} \right) \end{align}$
Metsis et al. showed that binarizing $\textbf{f}$ is better. Hence we define $\hat{\textbf{f}} = \textbf{1}(\textbf{f} > 0)$ and use $\hat{\textbf{f}}$ to compute $\hat{\textbf{p}}, \hat{\textbf{q}}, \hat{\textbf{r}}$

2.2 Support Vector Machine (SVM)

For SVM, we set $\textbf{x} = \hat{\textbf{f}}$ and obtain the $\textbf{w}, b$ by minimizing the loss function

\[L(\textbf{w}, b) = \textbf{w}^T \textbf{w} + C \sum_i \text{max}\left(0, 1 - y^{(i)}(\textbf{w}^T \hat{\textbf{f}}^{(i)} + b)\right)^{2}\]

2.3 SVM with NB features (NBSVM)

For NBSVM, we use the log likelihood ratios as the feature vector and set $\textbf{x} = \hat{\textbf{r}} \circ \hat{\textbf{f}}$
While works well for longer documents, an interpolation between MNB and SVM performs well for all documents

\[\textbf{w}` = (1 - \beta)\bar{w} + \beta \textbf{w}\]

$\bar{w}$ is the mean magnitude of $\textbf{w}$ and $\beta \in [0, 1]$ is the interpolation parameter
The interpolation is a form of regularization: trust NB unless the SVM is very confident

3. Results

MNB is better at snippets while SVM is better at full-length reviews
NBSVM performs well on snippets and longer documents for for sentiment, topic and subjectivity classification. NBSVM is a very strong baseline for sophisticated methods aiming to beat a bag of features.
In sentiment classification there are gains from adding bigrams because they can capture modified verbs and nouns.

4. Implementation

A well-written implementation with scikit-learn scikit-learn estimator api and the interpolation parameter $\beta$ can be found here

PREVIOUSKaggle: Learning and Reflection from the Mercari Price Suggestion Challenge

NEXTFactorization Machines (FM) and Field-aware Factorization Machines (FFM)