A commonly-used baseline for text classification competitions on Kaggle is NB-SVM, introduced by Sida Wang and Chris Manning in the paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classification in 2012. With Deep Learning based models dominating the field of NLP, it is nice to have a bag-of-words model that trains with a fraction of the resource (time & compute) but performs only slightly worse.
1. Abstract
- Naive Bayes (NB) and Support Vector Machine (SVM) are widely used as baselines in text-related tasks but their performance varies significantly across variants, features and datasets.
- Word bigrams are useful for sentiment analysis, but not so much for topical text classification tasks
- NB does better than SVM for short snippet sentiment tasks, while SVM outperforms NB for longer documents
- A SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets
`
2. Methods
- The main model is formulated as a linear classifier
- Let $\textbf{f}^{(i)}$ be the feature vector for training case $i$ with binary label $y^{(i)} \in {-1, 1}$. Define the two count vectors $\textbf{p}$ and $\textbf{q}$ as
- $\alpha$ is the smoothing parameter
- Log-count ratio can be defined as
2.1 Multinomial Naive Bayes (MNB)
- For MNB, the feature vectors represent the frequencies with which events are generated by a multinomial distribution $\textbf{p} = (p_1, \cdots, p_n)$.
- The feature vector $\textbf{x} = (x_1, \cdots, x_n)$ is a histogram, where $x_k$ is the number of times event $k$ was observed in a particular instance.
- With the multinomial assumption and the Naive Bayes assumption, the likelihood of $\textbf{x}$ conditional on y is given by \(p(\textbf{x} | y) = \frac{\sum_k x_k !}{\prod_k x_k !} \prod_k p_{yk}^{x_k}\) \(p(y | \textbf{x}) = \frac{p(y) p(\textbf{x} | y)}{p(\textbf{x})} \propto p(y) p(\textbf{x} | y)\) \(\begin{align} \text{log} \ p(y | \textbf{x}) &\propto \text{log} \ \left[ p(y) \prod_k p_{yk}^{x_k} \right] \\\\ &= \text{log} \ p(y) + \sum_k x_k \text{log} \ p_{yk} \tag{4} \end{align}\)
-
The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space, since $Eq. (4)$ can be change into the format of $Eq. (1)$ \(\begin{align} y &= \text{argmax} \ \text{log} \ p(y) + \sum_k x_k \ \text{log} \ p_{k} \\\\ &= \text{sign} \left( \text{log} \ \frac{p(y = 1)}{p(y = -1)} + \sum_k x_k \ \text{log} \ \frac{p_{1k}}{p_{-1k}} \right) \\\\ &= \text{sign} \left( \text{log} \ \frac{N_+}{N_-} + \textbf{r}^T\textbf{f} \right) \end{align}\)
- Metsis et al. showed that binarizing $\textbf{f}$ is better. Hence we define $\hat{\textbf{f}} = \textbf{1}(\textbf{f} > 0)$ and use $\hat{\textbf{f}}$ to compute $\hat{\textbf{p}}, \hat{\textbf{q}}, \hat{\textbf{r}}$
2.2 Support Vector Machine (SVM)
- For SVM, we set $\textbf{x} = \hat{\textbf{f}}$ and obtain the $\textbf{w}, b$ by minimizing the loss function
2.3 SVM with NB features (NBSVM)
- For NBSVM, we use the log likelihood ratios as the feature vector and set $\textbf{x} = \hat{\textbf{r}} \circ \hat{\textbf{f}}$
- While works well for longer documents, an interpolation between MNB and SVM performs well for all documents
- $\bar{w}$ is the mean magnitude of $\textbf{w}$ and $\beta \in [0, 1]$ is the interpolation parameter
- The interpolation is a form of regularization: trust NB unless the SVM is very confident
3. Results
- MNB is better at snippets while SVM is better at full-length reviews
- NBSVM performs well on snippets and longer documents for for sentiment, topic and subjectivity classification. NBSVM is a very strong baseline for sophisticated methods aiming to beat a bag of features.
- In sentiment classification there are gains from adding bigrams because they can capture modified verbs and nouns.
4. Implementation
A well-written implementation with scikit-learn scikit-learn estimator api and the interpolation parameter $\beta$ can be found here