Hot techniques for compressing big models like BERT are pruning and distillation. Shortly afterwards, multiple approaches came out to make big Transformer models more efficient:
-Distilling BERT Models with spaCy:
Multilingual BERT fine-tuned on a sentiment analysis dataset into spaCy’s convolutional neural networks.
-DistilBERT:
Smaller language model that performs similarly on downstream tasks while being faster. The model, however, still requires a lot of compute for pretraining.
-Multilingual MiniBERT:
A smaller (3 layer) BERT model by distilling multilingual BERT.
-Adaptive attention span:
Facebook researchers propose an adaptive attention span that makes it more efficient to scale Transformers to long sequences.

Comentarios