RaNNC (Rapid Neural Network Connector), automatic parallelization middleware for deep learning developed jointly by Data-driven Intelligent System Research Center (DIRECT), the National Institute of Information and Communications Technology (NICT, President: TOKUDA Hideyuki, Ph.D.), and the University of Tokyo (President: FUJII Teruo), won first place at the PyTorch Annual Hackathon 2021 (PyTorch Developer Tools & Libraries category).
PyTorch is the de facto standard framework for deep learning, and this hackathon is the only worldwide event that both awards PyTorch projects and is officially held by Facebook, the leading PyTorch developer (https://pytorch2021.devpost.com/). RaNNC drastically simplifies training of large-scale neural networks, which has been very difficult with existing functions of PyTorch. RaNNC is available as open-source software, and anyone can download and use it for free, even for commercial purposes.
PyTorch Annual Hackathon is an event where users of PyTorch, the de facto standard framework for deep learning, come together and develop software or machine learning models using PyTorch. It has been held annually since 2019 and is known as the only worldwide event that both awards PyTorch projects and is officially held by Facebook, which is the leading player in the development of PyTorch. This year, 1,947 people participated from around the world, and 65 projects were submitted.
RaNNC is middleware that automatically partitions large-scale neural networks and parallelizes their training using many GPUs. To train a large-scale neural network, users need to partition it and compute the parts on multiple GPUs because the parameters of a huge neural network do not fit into the memory of a GPU. However, it is very difficult, even for experts, to partition a neural network, considering memory usage and the efficiency of parallel processing. In contrast, RaNNC takes a description of a neural network that is designed to be computed on a single GPU and automatically partitions the network so that partitioned sub-networks fit into the memory of each GPU and a high training speed is achieved. This drastically simplifies training of large-scale neural networks.
RaNNC was first released in March 2021 and significantly updated for PyTorch Annual Hackathon 2021. To reduce usage of GPU memory, the new feature introduced during the hackathon keeps most parameters of a neural network on the main memory, which is much larger than GPU memory, and moves only the necessary parameters to GPU memory just before the computations on the GPU use the parameters. This enables us to train large-scale neural networks with much less GPU memory.
RaNNC has been developed through collaborative research by NICT and the University of Tokyo. This research group has the following members:
|Masahiro Tanaka||Senior Researcher, Data-driven Intelligent System Research Center,
Universal Communication Research Institute, NICT
|Kenjiro Taura||Professor, Department of Information and Communication Engineering, Graduate School of Information Science and Technology, The University of Tokyo
/ Director of Information Technology Center, The University of Tokyo
|Toshihiro Hanawa||Professor, Supercomputing Research Division, Information Technology Center, The University of Tokyo|
|Kentaro Torisawa||NICT Fellow
/ Associate Director General, Universal Communication Research Institute, NICT
/ Distinguished Researcher, Data-driven Intelligent System Research Center, Universal Communication Research Institute, NICT
We confirmed that RaNNC could automatically parallelize training of a neural network with 100 billion parameters. To train such a huge network, in previous work, human experts have had to significantly rewrite the description of a neural network to optimize parallel processing. However, RaNNC can automatically parallelize training and achieve a high training speed given a description of a neural network that is not designed for parallel processing.
In addition, some well-known frameworks designed to train large-scale neural networks are only applicable to specific types of networks, including Transformer, while RaNNC is basically applicable to any type of neural network.
The source code and usage examples of RaNNC are now available at GitHub (https://github.com/nict-wisdom/rannc). RaNNC is licensed under an MIT license, which allows users to use RaNNC for free, even for commercial purpose.