Wide & Deep Learning for Recommendation Systems

Recently I started working on the applications of text representations in Recommendations. Since I don’t have much background in RecSys other than the few lectures I took in graduate school, I will read some classic RecSys paper. Starting with Wide & Deep Learning for Recommender Systems, the de facto standard baseline for industry recommender systems!

wide_and_deep_model

Wide & Deep Learning for Recommender Systems

Memorization learns the frequent co-occurence of items and feature and exploits the correlation in the training data
Generalization explores new feature combinations that have never or rarely occured in the past via transitivity of correlation
Memorization requires in recommendations that are more directly relevant to user history, while generalization tends to improve the diversity of the recommended items [Sijun] this dichotomy sounds like explore/exploit
Memorization through cross-product transformation requires manually feature engineering and do not generalize to unseen feature pairs
Embedding-based models can generalize to previously unseen feature pairs by learning a low-dimensional dense embedding for query and item. However, it is difficult to learn effective representation when the query-item is sparse and high-rank, such as users with specific preferences or niche itmes with a narrow appeal. In this case, linear models with cross-product features are far more efficient. [Sijun] this is an interesting point-of-view on embeddings in RecSys. I think this is more about matrix factorization than word2vec. Coming from NLP, I rarely think about if niche words are over-generalized.

Wide & Deep Learning

wide_and_deep_model

Wide component is a GLM $y=\textbf{w}^{T}\textbf{x}+b$. One of the most important features is the cross-product transformation, defined as $\theta_{k}(\textbf{x}) = \prod_{i=1}^{d} x_{i}^{c_{ki}} \ \ \ c_{ki} \in {0, 1}$
Deep component is a feed-forward NN where sparse, high-dimensional categorical features are first converted into a embedding. The embedding are initialized randomly and trainable
The wide and deep compoent are combined at the output layer, which is then fed to a logistici loss function for joint training
The wide component is optimized by FTRL with L1 regularization while the deep part is optimized by AdaGrad. [Sijun]: The FTRL algorithm was originally designed to train Google-scale GLMs. It induces sparsity, which is needed in the wide part of the model, as most of the cross-product features are not meaningful. The deep part of the model is trained with AdaGrad, which is an regular adapative optimizer for NNs

System Implementation

Data

Build vocabulary, which maps cateogrical feature to integer IDs during data generation. Categorical features are requried to have a minimal number of occurences to be included.
Continuous features are normlalized to [0, 1] and divided into n quantiles. [Sijun]: I hypothesize they normalize to ensure equal feature scale and use quantization to reduce the noise in real-value features. They kept the quantized feature as a continuous feature instead of categorical feature to account for the ordinal nature.

Model Training

wide_and_deep_structure

32-dimensional vector is learned for each categorical feature. All embeddings along with dense features are concatenated and fed into a MLP with ReLu activation [Sijun]: I wonder if the dense features are under-utilized compared with the categorical features that are represented by embeddings

PREVIOUSScaling NER at Twitter

NEXTDeep Neural Networks for YouTube Recommendations