TY - GEN
T1 - Optimizing Vision Transformers
T2 - 6th IEEE International Conference on AI Circuits and Systems, AICAS 2024
AU - Bich, Philippe
AU - Boretti, Chiara
AU - Prono, Luciano
AU - Pareschi, Fabio
AU - Rovatti, Riccardo
AU - Setti, Gianluca
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The research on Deep Neural Networks (DNNs) continues to enhance the performance of these models over a wide spectrum of tasks, increasing their adoption in many fields. This leads to the need of extending their usage also on edge devices with limited resources, even though, with the advent of Transformer-based models, this has become an increasingly complex task because of their size. In this context, pruning emerges as a crucial tool to reduce the number of weights in the memory-hungry Fully Connected (FC) layers. This paper explores the usage of neurons based on the Multiply-And-Max/min (MAM) operation, an alternative to the conventional Multiply-and-Accumulate (MAC), in a Vision Transformer (ViT). This enhances the model prunability thanks to the usage of Max and Min operations. For the first time, many MAM-based FC layers are used in a large state-of-the-art DNN model and compressed with various pruning techniques available in the literature. Experiments show that MAM-based layers achieve the same accuracy of traditional layers using up to 12 times less weights. In particular, when using Global Magnitude Pruning (GMP), the FC layers following the Multi-head Attention block of a ViT-B/16 model, fine-tuned on CIFAR-100, count only 560000 weights if MAM neurons are used, compared to the 31.4 million that remain when using traditional MAC neurons.
AB - The research on Deep Neural Networks (DNNs) continues to enhance the performance of these models over a wide spectrum of tasks, increasing their adoption in many fields. This leads to the need of extending their usage also on edge devices with limited resources, even though, with the advent of Transformer-based models, this has become an increasingly complex task because of their size. In this context, pruning emerges as a crucial tool to reduce the number of weights in the memory-hungry Fully Connected (FC) layers. This paper explores the usage of neurons based on the Multiply-And-Max/min (MAM) operation, an alternative to the conventional Multiply-and-Accumulate (MAC), in a Vision Transformer (ViT). This enhances the model prunability thanks to the usage of Max and Min operations. For the first time, many MAM-based FC layers are used in a large state-of-the-art DNN model and compressed with various pruning techniques available in the literature. Experiments show that MAM-based layers achieve the same accuracy of traditional layers using up to 12 times less weights. In particular, when using Global Magnitude Pruning (GMP), the FC layers following the Multi-head Attention block of a ViT-B/16 model, fine-tuned on CIFAR-100, count only 560000 weights if MAM neurons are used, compared to the 31.4 million that remain when using traditional MAC neurons.
UR - http://www.scopus.com/inward/record.url?scp=85199885876&partnerID=8YFLogxK
U2 - 10.1109/AICAS59952.2024.10595859
DO - 10.1109/AICAS59952.2024.10595859
M3 - Conference contribution
AN - SCOPUS:85199885876
T3 - 2024 IEEE 6th International Conference on AI Circuits and Systems, AICAS 2024 - Proceedings
SP - 337
EP - 341
BT - 2024 IEEE 6th International Conference on AI Circuits and Systems, AICAS 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 22 April 2024 through 25 April 2024
ER -