mlreef brand

Transformers: the era of BERT

by Erika Torres // Nov 18, 2020

A new era of transfer learning in NLP

Designed by vectorpocket / Freepik

Credit:Image Designed by vectorpocket / Freepik

“Do not forget what you have learned of our past, Rodimus. From its lessons the future is forged.”. Optimus Prime

This story started when in 2018 Google published a paper called “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, presenting a new way to solve several NLP problems with only one tool. BERT is a novel method to pre-train language representations that can be fine tuned to change the functionality performing small changes to the architecture. Hence the word Transformers has been used to name this types of ML entities.

At first, BERT was conceived mainly to solve “Question Answering”, though later it was be adapted to perform text classification, sentiment analysis, name entity recognition among others. The key aspect of this solution is the use of Embeddings: pre-trained language representations as the starting point to solve other problems. At the same time, several variations have emerged such as: ULMFit, GPT and ELMo which are all unidirectional or shallowly bidirectional.

The ML community was very excited about the implications of this new technique and they had very good reasons to be. First, the usage of pretrained-models reduces the amount of data needed to solve a myriad of problems. This aspect was a great blocker to create NLP solutions with real usability. Second, the models can be repurposed with minor modifications. And third, the results obtained with this new approach were far better than its predecessors.

This method is the result of Google’s research team working on models that process complete sentences while its predecessors only used each single word, also capable of learning the relationship between words inside a text. BERT’s language model is trained bi-directionally, giving a deeper sense of language context than a single-direction model. “Single-direction” means that each word is only contextualized using the adjacent word to the right or left. While BERT transforms each word and fuses both directions starting from the bottom of the neural network.

Google has released two pre-trained models from the paper: BERT-Base and BERT-Large. The different size means that you can choose the more suitable according to the resources that you have in hand. To be able to fine-tune these models a GPU like a Titan X or GTX 1080 is required.

We can find BERT models trained using very specific corpus such as BioBERT (biomedical text), SciBERT (scientific publications), ClinicalBERT. The models trained within a context have shown better performance in more specific tasks. All this models and more are publicly available in GitHub and other sites.

This story will continue…


I hope that this brief introduction about transformers gives you enough curiosity to go deeper and do a bit more research about it. Thanks for reading!


  1. Repository for BERT:
  2. BERT paper: “Pre-training of Deep Bidirectional Transformers for Language Understanding”