TY - GEN
T1 - Unlocking the Power of Inline Floating-Point Operations on Programmable Switches
AU - Yuan, Yifan
AU - Alama, Omar
AU - Fei, Jiawei
AU - Nelson, Jacob
AU - Ports, Dan R.K.
AU - Sapio, Amedeo
AU - Canini, Marco
AU - Kim, Nam Sung
N1 - Publisher Copyright:
© 2022 by The USENIX Association. All Rights Reserved.
PY - 2022
Y1 - 2022
N2 - The advent of switches with programmable dataplanes has enabled the rapid development of new network functionality, as well as providing a platform for acceleration of a broad range of application-level functionality. However, existing switch hardware was not designed with application acceleration in mind, and thus applications requiring operations or datatypes not used in traditional network protocols must resort to expensive workarounds. Applications involving floating point data, including distributed training for machine learning and distributed query processing, are key examples. In this paper, we propose FPISA, a floating point representation designed to work efficiently in programmable switches. We first implement FPISA on an Intel Tofino switch, but find that it has limitations that impact throughput and accuracy. We then propose hardware changes to address these limitations based on the open-source Banzai switch architecture, and synthesize them in a 15-nm standard-cell library to demonstrate their feasibility. Finally, we use FPISA to implement accelerators for training for machine learning as an example application, and evaluate its performance on a switch implementing our changes using emulation. We find that FPISA allows distributed training to use one to three fewer CPU cores and provide up to 85.9% better throughput than SwitchML in a CPU-constrained environment.
AB - The advent of switches with programmable dataplanes has enabled the rapid development of new network functionality, as well as providing a platform for acceleration of a broad range of application-level functionality. However, existing switch hardware was not designed with application acceleration in mind, and thus applications requiring operations or datatypes not used in traditional network protocols must resort to expensive workarounds. Applications involving floating point data, including distributed training for machine learning and distributed query processing, are key examples. In this paper, we propose FPISA, a floating point representation designed to work efficiently in programmable switches. We first implement FPISA on an Intel Tofino switch, but find that it has limitations that impact throughput and accuracy. We then propose hardware changes to address these limitations based on the open-source Banzai switch architecture, and synthesize them in a 15-nm standard-cell library to demonstrate their feasibility. Finally, we use FPISA to implement accelerators for training for machine learning as an example application, and evaluate its performance on a switch implementing our changes using emulation. We find that FPISA allows distributed training to use one to three fewer CPU cores and provide up to 85.9% better throughput than SwitchML in a CPU-constrained environment.
UR - http://www.scopus.com/inward/record.url?scp=85137497379&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85137497379
T3 - Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
SP - 683
EP - 700
BT - Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
PB - USENIX Association
T2 - 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
Y2 - 4 April 2022 through 6 April 2022
ER -