TY - GEN
T1 - WiP
T2 - 2024 Workshop on Edge and Mobile Foundation Models, EdgeFM 2024
AU - Wang, Liangyu
AU - Wang, Junxiao
AU - Wang, Di
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/6/3
Y1 - 2024/6/3
N2 - The large language models (LLMs) that everyone is using are not deployed locally. Users need to send relatively private and important data to LLM when using it. Handing over private and important data to LLM will cause people to worry, especially now that many people have begun to use LLM to deal with life and work affairs. Such concerns cannot be easily dispelled by various guarantees and agreements. However, LLMs are often resource-intensive and computationally demanding, making the transition from server-side to device-side difficult because LLM's self-attention module contains a large number of tensor multiplications that are heavy and inefficient for hardware. While previous work proposed approximate neural operators that enable hardware-efficient implementation of multiplication-less neural networks, they introduce new challenges of significant accuracy loss, making these methods inefficient in practice. In this paper, we examine the problem of light adaptation of LLMs. We propose a new neural operator that enables the adapted LLM to obtain original accuracy without fine-tuning or only requiring a few fine-tuning steps, while our neural operator has high hardware inference efficiency.
AB - The large language models (LLMs) that everyone is using are not deployed locally. Users need to send relatively private and important data to LLM when using it. Handing over private and important data to LLM will cause people to worry, especially now that many people have begun to use LLM to deal with life and work affairs. Such concerns cannot be easily dispelled by various guarantees and agreements. However, LLMs are often resource-intensive and computationally demanding, making the transition from server-side to device-side difficult because LLM's self-attention module contains a large number of tensor multiplications that are heavy and inefficient for hardware. While previous work proposed approximate neural operators that enable hardware-efficient implementation of multiplication-less neural networks, they introduce new challenges of significant accuracy loss, making these methods inefficient in practice. In this paper, we examine the problem of light adaptation of LLMs. We propose a new neural operator that enables the adapted LLM to obtain original accuracy without fine-tuning or only requiring a few fine-tuning steps, while our neural operator has high hardware inference efficiency.
KW - large language model
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85197283684&partnerID=8YFLogxK
U2 - 10.1145/3662006.3662065
DO - 10.1145/3662006.3662065
M3 - Conference contribution
AN - SCOPUS:85197283684
T3 - EdgeFM 2024 - Proceedings of the 2024 Workshop on Edge and Mobile Foundation Models
SP - 30
EP - 32
BT - EdgeFM 2024 - Proceedings of the 2024 Workshop on Edge and Mobile Foundation Models
PB - Association for Computing Machinery, Inc
Y2 - 3 June 2024 through 7 June 2024
ER -