top of page
Search

Model size of different pretrained language models

Hot techniques for compressing big models like BERT are pruning and distillation. Shortly afterwards, multiple approaches came out to make big Transformer models more efficient:


-Distilling BERT Models with spaCy:

Multilingual BERT fine-tuned on a sentiment analysis dataset into spaCy’s convolutional neural networks.


-DistilBERT:

Smaller language model that performs similarly on downstream tasks while being faster. The model, however, still requires a lot of compute for pretraining.


-Multilingual MiniBERT:

A smaller (3 layer) BERT model by distilling multilingual BERT.


-Adaptive attention span:

Facebook researchers propose an adaptive attention span that makes it more efficient to scale Transformers to long sequences.




 
 

Recent Posts

See All

Comentarios


bottom of page