Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

Banruo Liu*, Mubarak Adetunji Ojewale, Yuhan Ding, Marco Canini

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment and the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.

Original languageEnglish (US)
Title of host publicationAPSys 2024 - Proceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems
PublisherAssociation for Computing Machinery, Inc
Pages88-94
Number of pages7
ISBN (Electronic)9798400711053
DOIs
StatePublished - Sep 4 2024
Event15th ACM SIGOPS Asia-Pacific Workshop on Systems, APSys 2024 - Kyoto, Japan
Duration: Sep 4 2024Sep 5 2024

Publication series

NameAPSys 2024 - Proceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems

Conference

Conference15th ACM SIGOPS Asia-Pacific Workshop on Systems, APSys 2024
Country/TerritoryJapan
CityKyoto
Period09/4/2409/5/24

Keywords

  • Distributed Deep Learning Training
  • DNN Training Emulation
  • Machine Learning Systems

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation'. Together they form a unique fingerprint.

Cite this