Reflecting on My Time at Twitter Cortex

A few weeks ago I left Twitter Cortex, where I worked for the last 3 years. Flipping through my blog, I found a old post written in Novermber 2018 announcing my new journey at Twitter. Looking back at the past 3 years, I had a great run - I accomplished everything I set out to accomplish when I joined and more. To wrap up everything, I decided to dust off the cobwebs on my blog and write a post to reflect on 4 projects that I worked on at Twitter and the lessons I learned from them.

NER and My First End-to-End ML Project

One of the reasons why I joined Twitter is that I wanted to work on production NLP models end-to-end and iterate with user feedback. And I got a taste of that immediately, as my first project was to work on the NER system at Twitter. Since Twitter didn’t have a mature ML infrastructure for deep learning back then, we worked on the everything from data sampling and labeling to modeling training and serving. In hindsight, this was the best starting project that I could ask for. On one hand, I was already familiar with the tech stack (Python and Tensorflow for modeling, Scala for data and serving work) so I was able to hit the ground running with minimal ramp up time. On the other hand, the project took me through the entire ML lifecycle at Twitter and I was exposed to the various “raw” elements of production ML systems in the industry (folks who joined a year later didn’t have the same luxury since the internal ML infrastructure matured and a lot of abstractions were adopted).

I worked on the NER system on-and-off through my time at Twitter (as an engineer and later as the lead). Our tech started quite behind but we gradually caught up with the NLP industry standard over the 3 years. In a way, our progress reflected not only our own effort but also the NLP industry moving forward. We started with language-specific models based on BiLSTM-CRF and moved to BERT as it took over the industry. Later we worked on multilingual models, cross-lingual models, distantly-supervised training techniques, model distillation and many others.

ner

About a year and a half into the job, I became less interested in further improving the NER system. Other than the fact that I was getting a bit tired of sequence labeling models, the key issue of NER (and other NLP systems at the time) at Twitter was that they were too far away from the product surfaces and the users. Like its peers, Twitter tracks its metrics religiously through AB testing. The closer a ML model is to the users, the more causal effect it will have on the core metrics that Twitter is after and the more incentive there will be for people to improve on that model. Unfortunately, the NER models are quite low-level in the modeling stack and have layers of aggregations and models on top of it to get to the actual products. Doing frequent AB testing means controlling many variables. Difficult, expensive and probably not worth it.

Self-Supervised Learning and Finding a Product Fit

The biggest challenge I faced when working on NER was the reliance on human-labeled data. This is exacerbated by having to support many languages where each language needs to have its own labeled data. As we move from popular languages to low-resource languages, the impact goes down but the cost of acquiring labels goes up. Learning from my experience working on NER, I set out to look for the problems with little to no reliance on human-labeled data and natural multilingual support.

Self-Supervised Learning quickly jumped out as the solution to my problem. As the labels are generated in a self-supervised fashion, we can generated almost unlimited amount of data in any given language. I started first with DeepMoji, where millions of emoji occurances are used to pretrain models to learn a rich emotional representation and then moved on to other simliar ideas applied to other entities on Twitter (e.g. Hashtag, User Handle, URL) for various semantic representations.

self-supervised-training

Since data was readily-available, we were able to build the models and services in no time. However, the challenges were that we didn’t know how to use the models. While models contain rich semantic knowledge about the labels they were trained on (e.g. emoji for emotional representation, hashtag for topical/eventful representation), we have no immediate product application. For over a year, we had numerous attempts to find the right product fit. Some were successful but most turned out to be failures. Eventually, we were able to produce value by using the embeddings for candidate generation and also as features in ranking models (you can read more about in this post) but the entire project was very much like pushing a rock uphill.

In hindsight, I made the classic error of building a hammer first and then look for the nails. As a technologist, I started with the technology that I was personally interested in, not product problem that I wanted to solve.

Tech Lead

I was NLP engineer No. 2 to join Twitter Cortex. Over the three years, my team grew to almost 20 NLP engineers and scientists. Two years in, I was trusted to become the Tech Lead of my team. At Twitter, becoming Tech Lead is not a promotion but a role change. The most obvious change compared to an individual contributor (IC) engineer is that other than my own work, I have to care about other people’s work. The role change was quite drastic, as I immediately started to lose 30-50% of my time to other responsilibiles such as setting up technical processes and directions, helping teammates with their projects, etc. The increase in meeting load also made it difficult to find long stretch of focus time that were key for my hands-on work. Towards the end of my time at Twitter, I no longer work on any significant project where I am the primary engineer. Instead, I “work” on every project of my team by reviewing, facilitating and giving technical mentorship and guidance.

While becoming a Tech Lead is a natural step of career progression, I struggled with it at the beginning. Other than the change in daily schedule stated above, the most fundamental change was the indirect nature of the work and the loss of control. When working on a project as an IC, I get to drive and control the project in a very direct fashion. But as the Tech Lead, I can only influence a project indirectly by reviewing and giving recommendations. Other Tech Lead responsilibiles are even more indirect, such as setting up technical processes and infrastructure, etc. As I read some books on this topic (such as The Manager’s Path and Staff Engineer) and spent more time in the Tech Lead role, I started to understand the role of a Tech Lead. As a Tech Lead, work is no longer about me woring on the most technically sophisticated project (doing addition) but how I empower my team (doing multiplication). Becoming a Tech Lead is the most common way to reach the staff level because it develops a skillset outside of writing code, such as various project management and people skills. These are necessary skills for engineers to grow beyond a certain point (Among staff+ engineers. domain expert ICs who work on key components do exist, but there are much more Tech Leads). As engineers expand their circle of influence, it’s natural to trade depth for breadth.

Building ML Tooling

As part of my TL responsibility, I started looking at improving the ML development tooling and processes of my team. The state of ML was in flux at Twitter, as a major migration from Tensorflow 1 to Tensorflow 2 was underway. Various modern ML tools such as Jupyter Notebooks, TFX and GCP offerings had just been introduced to Twitter and at different points on the adoption curve. In a way, working on ML was harder than before. Previously, we had one inefficient but proven way to do things but now we have many productive yet unfamiliar options. Outwide Twitter, the NLP industry was also rapidly moving forward. By early 2021, BERT and various transformer-based pre-trained models have become the norm. With its fast tokenizer, many pre-trained models and easy-to-use API, HuggingFace is the new favorite of the industry.

Through a separate project, we discovered it was surprisingly difficult to finetune a pre-trained model at Twitter. This was partially due to the lack of ML tooling at Twitter but also because of Tensorflow 1 and the TF Estimator interface. Fortunately, all of these restrictions were getting lifted and it was time for Twitter to catch up with the industry standard. We asked around and obviously there was a lot of interest for a HuggingFace-esque ML development experience. The opportunity was apparent: a one-liner experience to share, load, finetune and serve pre-trained models at Twitter. For various infrastructure reasons, HuggingFace wasn’t the best fit for us so we ended up developing our own lightweight library that is optimized for Twitter’s ML needs. Most of the work was talking to our intended customers and testing various platforms that we needed to support. The end product adds up to less than 200 lines of code but it quickly became the go-to tool for anybody doing text modeling at Twitter. It not only provided convenience for our modelers to get better model performance by finetuning pre-trained models, but also established the incentive for our researchers to keep experimenting with pre-training techniques.

In hindsight, this was the my favorite project at Twitter. The project made sense technologically (the ML tech stack was ripe for such a project) and strategically (it unlocked future workstreams). As we figured out exactly what to build by talking among ourselves and to our customers, engineering isn’t all that important.