Seminar Announcement

Training Acceleration for Distributed Machine Learning Applications at Scale: A Network-Centric Approach.

  • Speaker: Dr. Yonggang Wen
  • Nanyang Technological University (NTU), School of Computer Science and Engineering (SCSE)
  • Date: Friday, August 24, 2018
  • Time: 1:00pm - 2:00pm
  • Location: Room T3 (NVC)

Abstract

Distributed machine-learning (ML) applications play an important role in fueling the emerging artificial intelligence revolution. In this context, the parameter server (PS) framework is widely used to train models at scale in modern ML systems, such as Petuum, MxNet, TensorFlow and Factorbird. It tackles the big-data problem by having worker nodes perform data-parallel computation, and server nodes maintain globally shared parameters. However, when training models of large size, worker nodes frequently pull parameters from server nodes and push updates to server nodes, often resulting in high communication overhead. Our investigations show that modern distributed ML applications could spend up to 5 times more time on communication than computation. To address this problem, we propose an optimized communication layer for the PS framework, called as Parameter Flow (PF). The PS employs a Swiss-army-knife approach by staking three complementary techniques. First, we introduce an update-centric communication (UCC) model to exchange data between worker/server nodes via two operations: broadcast and push. Second, we develop a dynamic value-bounded filter (DVF) to reduce network traffic by selectively dropping updates before transmission. Third, we design a tree-based streaming broadcasting (TSB) system to efficiently broadcast aggregated updates among worker nodes. Our proposed PF can significantly reduce network traffic and communication time. Extensive performance evaluations have showed that PF can speed up popular distributed ML applications by a factor of up to 4.3 in a dedicated cluster, and up to 8.2 in a shared cluster, compared to a generic PS system without PF. The PF framework has been used by a few industry partners.

Speaker's Biography

Dr. Yonggang Wen is an associate professor with School of Computer Science and Engineering (SCSE) at Nanyang Technological University (NTU), Singapore. He is the Associate Dean (Research) at College of Engineering (CoE) and the Acting Director of Nanyang Technopreneurship Centre (NTC) at NTU. He received his PhD degree in Electrical Engineering and Computer Science (minor in Western Literature) from Massachusetts Institute of Technology (MIT), Cambridge, USA, in 2008. Previously he has worked in Cisco to lead product development in content delivery network, which had a revenue impact of 3 Billion US dollars globally. Dr. Wen has published over 180 papers in top journals and prestigious conferences. His systems research has gained global recognitions. His work in Multi-Screen Cloud Social TV has been featured by global media (more than 1600 news articles from over 29 countries) and received ASEAN ICT Award 2013 (Gold Medal). His work on Cloud3DView for Data Centre Life-Cycle Management, as the only academia entry, has won the 2015 Data Centre Dynamics Awards - APAC (the 'Oscar' award of data centre industry) and 2016 ASEAN ICT Awards (Gold Medal). He is the winner of 2017 Nanyang Award for Innovation and Entrepreneurship, the highest recognition at NTU. He is a co-recipient of Best Paper Awards at 2016 IEEE Globecom, 2016 IEEE Infocom MuSIC Workshop, 2015 EAI Chinacom, 2014 IEEE WCSP, 2013 IEEE Globecom and 2012 IEEE EUC, and a co-recipient of 2015 IEEE Multimedia Best Paper Award. He serves on editorial boards for IEEE Communications Survey & Tutorials, IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Wireless Communication, IEEE Transactions on Signal and Information Processing over Networks, IEEE Access Journal and Elsevier Ad Hoc Networks, and was elected as the Chair for IEEE ComSoc Multimedia Communication Technical Committee (2014-2016). His research interests include artificial intelligence, blockchain, cloud computing, green data center, big data analytics, multimedia network and mobile computing.