TY - GEN
T1 - Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
AU - Liu, Banruo
AU - Ojewale, Mubarak Adetunji
AU - Ding, Yuhan
AU - Canini, Marco
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/9/4
Y1 - 2024/9/4
N2 - We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment and the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.
AB - We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment and the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.
KW - Distributed Deep Learning Training
KW - DNN Training Emulation
KW - Machine Learning Systems
UR - http://www.scopus.com/inward/record.url?scp=85205112835&partnerID=8YFLogxK
U2 - 10.1145/3678015.3680478
DO - 10.1145/3678015.3680478
M3 - Conference contribution
AN - SCOPUS:85205112835
T3 - APSys 2024 - Proceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems
SP - 88
EP - 94
BT - APSys 2024 - Proceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems
PB - Association for Computing Machinery, Inc
T2 - 15th ACM SIGOPS Asia-Pacific Workshop on Systems, APSys 2024
Y2 - 4 September 2024 through 5 September 2024
ER -