“The 3 ingredients to our success.” | Winners dish on their solution to Google’s QUEST Q&A Labeling | Kaggle Winner’s Interview

Kaggle Team
Kaggle Blog
Published in
10 min readMar 4, 2020

--

First place foursome, ‘Bibimorph’ share their winning approach to the QUEST Q&A Labeling competition by Google, and more!

Photo by Anna Vander Stel on Unsplash

Congratulations to the (four!) first-place winners of the Quest Q&A Labeling competition, Dmitriy Danevskiy, Yury Kashnitsky, Oleg Yaroshevskiy, and Dmitry Abulkhanov who make up the team “Bibimorph”!

In the QUEST Q&A Labeling competition by Google, participants were challenged to build predictive algorithms for different subjective aspects of question-answering. The provided dataset contained several thousand question-answer pairs, mostly from StackExchange. These pairs were human-labeled to reflect whether the question was well-written, whether the answer was relevant, helpful, satisfactory, contained clear instructions, etc. Results from the competition will hopefully foster the development of Q&A systems, contributing to them becoming more human-like. In this winner interview, we catch up with team Bibimorph to learn more about their approach to solving this unique challenge:

Hi guys. Let’s get to know you! What would you like to share about your backgrounds, including how you got started with Kaggle?

Dmitriy Danevskiy

Dmitriy: My background lies in applied math and physics. Four years ago, I started doing machine learning primarily for industrial applications. For two years, I’ve been working for a small AI service company where I was responsible for designing and implementing complex deep learning solutions for time series segmentation, face recognition, and speech processing. Now I work for a startup called Respeecher, where I’m one of the lead engineers doing cutting-edge research on audio synthesis.

I’ve got a long track record of participating in different Kaggle competitions not constraining myself to any particular problem or domain, I think that modern deep learning methods are quite universal and can be applied to almost any unstructured and structured data. This might sound controversial, but I prove it empirically getting gold medals in image, text, audio, and tabular data competitions. Winning in the Google Quest Q&A Labeling competition was of particular importance for me — I managed to get the honorable Competitions Grandmaster tier.

Yury Kashnitsky

Yury: My background is also in physics and applied math (Ph.D.). Being keen on aviamodelism in childhood, I entered the Moscow Institute of Physics and Technology, where I studied aviation. At that time, I only started programming and learned Python working with databases for virtual reality applications. After a couple of years dealing with databases and Business Intelligence switched to academia entering a full-time Ph.D. program in applied math. Then a first Data Scientist position in Russian IT giant Mail.Ru Group, and currently I live and work in the Netherlands doing R&D mostly in NLP. For the last three years, I’ve been leading mlcourse.ai, an open ML course with a heavy emphasis on Kaggle competitions.

As for Kaggle, I’ve got a long story of learning, suffering, and learning again. After first experiments with AutoML entering every competition I could reach, I started competing seriously only in NLP competitions some two years ago. There was a long period of imposter syndrome when my mlcourse.ai students would win gold medals in one competition after another, while I was only lucky enough to climb to the top of the silver zone. Winning in the Google Quest Q&A Labeling competition finally brought me that long-awaited Competitions Master tier.

And here’s Elmo with the Nvidia Quadro P6000 card:

Oleg Yaroshevskiy

Oleg: Hi! I’ve got a background in applied statistics and Computer Science. As a student, I was interested in the impact of technologies on society and in works of cybernetics pioneers Norbert Wiener and Victor Glushkov. Encouraged by Andrej Karpathy’s famous article “The Unreasonable Effectiveness of Recurrent Neural Networks,” I decided to switch from software engineering to machine learning.

Since day one as a research engineer, I’ve been designing deep models for speech processing, machine translations, machine comprehension, and other NLP tasks. In July 2017, I learned about transformers, and that changed everything in my career. Passionate about literature and written arts, I hope one day to see AI-generated plays or even to create one of them.

Today I am a research engineering consultant and active Kaggler. Fitting hundreds of architectures, I believe, Kaggle helps to build a deep intuition behind training deep neural networks and pattern recognition. I encourage others to give data science competitions a try and to join this fast-growing community of Kaggle enthusiasts.

Dmitriy Abulkhanov

Dmitry: I’ve got a background in math and physics as well, studied in Moscow Institute of Physics and Technology, and Yandex School of Data Analysis. As a student, I participated in many data science hackathons, where I got the understanding that there are no unsolvable problems, only a lack of available time. I believe that participation in competitions is capable of giving helpful expertise to tackle various problems in data science.

Currently, I work as an NLP researcher at Huawei.

What a crew! How did your team form? And how did you help one another succeed in this competition?

Yury: Three of the four of us entered the Google QUEST Q&A Labeling competition right after the TensorFlow 2.0 Question Answering competition, where we narrowly missed golden medals and therefore wanted revenge!

What helped was that this competition was the same format as our two previous competitions (Code Competition), so we felt ready! Most of the scoring errors and peculiarities that made other participants mad in the first 2–3 weeks of the Google QUEST Q&A Labeling competition was easy for us.

So, Dmitry, Oleg, and I merged with Dmitriy A., who came up with a powerful technique for language model pretraining with StackExchange data.

Oleg started with a simple PyTorch baseline based on one of the public Notebooks (this one, kudos to Nirjhar Roy). He also trained the BART model. Dmitriy A. and Yury mostly worked on pretraining language models. Dmitriy D. led the team training models, setting up validation and model blending schemes.

We think competing as a team was the best part of the whole experience and let us win in this competition with quite a good margin.

What was your team’s most important finding?

In two words: transfer learning. Considering we had a pretty small public dataset in this competition, leveraging large amounts of unlabeled data happened to be the crucial thing.

But actually we have three major “secret” ingredients of our solution to share:

  1. Language model pre-training
  2. Pseudo-labeling
  3. Post-processing predictions

Secret ingredient #1: language model pre-training

We used an additional dump of ~7 mln StackExchange questions to fine-tune the BERT language model through a Masked Language Model task (MLM, see the BERT paper for details) and an additional Sentence Order Prediction (SOP) task (refer to the ALBERT paper).

Apart from that, we also built additional auxiliary targets — while fine-tuning LM we also predicted 5 targets (question_score, question_view_count, question_favorite_count, answer_score, answers_count) which we engineered based on StackExchange data.

We used a customized extended cased vocabulary for a simple reason: StackExchange questions often contain not just pure spoken language but also math and code. Extending vocabulary with LaTeX symbols, pieces of math formulae and part of code snippets helped to capture this fact.

In general, this additional task of LM pretraining played a crucial role in improving our models due to the two functions:

  • Transfer learning. Our models have “seen” 10x more data before they were actually trained with competition data
  • Domain adaptation. Owing to a customized vocabulary and auxiliary targets for LM fine-tuning we made our pretrained models much better adapted to the data in hand

Secret ingredient #2: pseudo-labeling

Pseudo-labeling was once a cool and hot topic on Kaggle, but now it’s rather a well-known and commonly used technique.

Image credit: “Pseudo-labeling a simple semi-supervised learning method” tutorial by Vinko Kodžoman

The idea is summarized in the figure above. You can refer to the mentioned tutorial for details. In a couple of words, one can use model predictions (for some unlabeled dataset) as “pseudo-labels” (meaning that they are not the actua, ground truth labels) to extend the labeled training dataset in hand.

We used pseudo-labels with 20k and 100k samples from a StackExchange questions dump to improve three out of four trained models.

Secret ingredient #3: post-processing predictions

The metric chosen for the competition is Spearman correlation. For each of 30 target labels, Spearman correlation between predictions and ground truth is calculated. Then an average of 30 Spearman correlation coefficients produces the final metric.

As observed in this post on Kaggle, Spearman correlation is pretty sensitive to some of predictions being equal or not:

The toy example above shows that a vector of predictions b can be “thresholded” to produce b2 and thus increase its Spearman correlation with a (ground truth) from 0.89 to 1.

Actually, that was one of drawbacks of the whole competition — the target metric was a bit too sensitive to hacks like thresholding predictions. Many teams were applying various thresholding heuristics as post-processing, often for each of 30 target columns separately. We clearly recognized that as overfitting. However, we still applied some post-processing for model predictions.

Instead of thresholding predictions, we discretized them into buckets following the distribution in the training set. The idea is to make the distribution of predictions for a particular target column match the corresponding distribution of the corresponding column in the training dataset. For additional details, you can refer to the solution code shared by us, namely, to this step.

Can you show us what your final solution looks like?

Baseline model

Our baseline model was pretty much vanilla BERT with a linear layer on top of average-pooled hidden states. As for input, we only passed the question title, question body and answer body separated with special tokens.

Some more hacks (apart from the three “secrets” described above) include softmax-normalized weights for hidden states from all BERT layers (ELMO-like) and multi-sample dropout.

Final blending

The final solution is a blending of four models’ (two BERT base ones, one RoBERTa base, and one BART large) out-of-fold predictions with three “secret” ingredients described above: pretrained language models (“pretrained” superscript), pseudo-labeling (“pl” superscript), and post-processing predictions.

What learnings (if any) are you four taking away from this competition?

Oh, plenty of learnings!

  • Don’t enter a competition too early. Let the technical problems be resolved first.
  • In case of small training datasets focus on leveraging additional large datasets in a proper way.
  • Transfer learning in NLP is a great deal indeed. It works not only in Computer Vision tasks.
  • In case of small training datasets, be especially careful with validation and don’t apply hacks that would only overfit your solution to the Public Leaderboard.
  • Look for team mates who can introduce diversity to the final solution, in terms of skills, approaches, models etc.

Do you have advice for those just getting started in data science?

We can summarize Yury’s advice from the video “How to jump into Data Science” (with slides shared here).

There are 8 major steps:

  1. Python. Learn the basics of this programming language through Kaggle Learn, Dataquest, CodeAcademy or similar. Advanced Python is hardly needed for a Junior Data Scientist but it’s good to progress with Python at work.
  2. SQL. Learn the basics, again Kaggle Learn will do, refresh your SQL skills before interviews, the rest you’ll pick up at work.
  3. Math. Basics of calculus, linear algebra, optimization and statistics are essential to understand the toolset you’re going to use. Open MIT courseware is probably the single best resource for that.
  4. Algorithms. It’s a controversial question to what extent these are needed, but there are two classic courses by R. Sedgewick and T. Roughgarden, leetcoding would also help
  5. DevOps. Background in software engineering is highly appreciated. The term “ML engineer” is actually much hotter now than “Data Scientist” (because the business is not run on Jupyter notebooks, you’ll have to deploy stuff to production). Anyway, you’d better know how to at least use git and Docker.
  6. Machine learning. Basic ML is covered in mlcourse.ai. Some Coursera specializations would also be a good entry point. As for Deep Learning, Stanford’s cs231n or fast.ai are two good options.
  7. Pet projects and/or competitions. It’s good to show that you’ve done a minimal viable product with ML running somewhere under the hood. Through pet projects, you can learn a lot. Competitions are a nice alternative but be careful with the gamification part and maximize the knowledge that you gain on Kaggle.
  8. Interviews. Don’t just sit and study. Get some practice with interviews. Try. Fail. Learn. Iterate. And one day you’ll succeed.

Did you like this interview?
Did you learn something?
If yes, let us know with some 👏👏👏

Resources

--

--

Kaggle Team
Kaggle Blog

Official authors of Kaggle winner’s interviews + more! Kaggle is the world’s largest community of data scientists. Join us at kaggle.com.