Model size of different pretrained language models

Hot techniques for compressing big models like BERT are pruning and distillation. Shortly afterwards, multiple approaches came out to make big Transformer models more efficient:

-Distilling BERT Models with spaCy:

http://www.nlp.town/blog/distilling-bert/

Multilingual BERT fine-tuned on a sentiment analysis dataset into spaCy’s convolutional neural networks.

-DistilBERT:

https://medium.com/huggingface/distilbert-8cf3380435b5

Smaller language model that performs similarly on downstream tasks while being faster. The model, however, still requires a lot of compute for pretraining.

-Multilingual MiniBERT:

https://arxiv.org/abs/1909.00100

A smaller (3 layer) BERT model by distilling multilingual BERT.

-Adaptive attention span:

https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/

Facebook researchers propose an adaptive attention span that makes it more efficient to scale Transformers to long sequences.

Model size of different pretrained language models

Recent Posts

Comentarios