ML Workloads Allocation for CloudRobotics WASP-NEST
Title: ML Workloads Allocation for CloudRobotics WASP-NEST
DNr: Berzelius-2024-55
Project Type: LiU Berzelius
Principal Investigator: Florian Pokorny <fpokorny@kth.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2024-03-01 – 2024-09-01
Classification: 10207
Homepage: https://wasp-sweden.org/nest-project-cloud-robotics/
Keywords:

Abstract

This application is for the continued use of Project Berzelius-2023-97 (Cloud-Robotics-NEST), which is a part of the WASP-NEST initiative titled “CloudRobotics-NEST: Intelligent Cloud Robotics for Real-Time Manipulation at Scale” (https://wasp-sweden.org/nest-project-cloud-robotics/), one of the large Network, Excellence Synergies and Teams research projects funded by WASP (https://wasp-sweden.org). As explained in more detail at the above website, our project focuses on addressing fundamental cloud robotics challenges. A core part of this endeavor involves the development and training of large scale machine learning models for which we would like to seek GPU and storage resources in this joint application here. The project is coordinated by Assoc. Prof. Florian Pokorny, KTH in collaboration with co-PIs Prof. Erik Elmroth, Assist. Prof. Monowar Bhuyan (Umea) and Prof. Martina Maggio (Lund). The project in particular also employs multiple PhD students at (2 KTH and 1 at Umea) with another one starting shortly at Lund University who will be the primary users of the requested GPU resources. Two of the current PhD students (KTH: Shutong Jin, Ruiyu Wang) are working on deep neural networks (DNN) for robotic manipulation and one PhD student (Umea: Obaidullah Zaland) is targeting federated machine learning (ML) approaches for robotic manipulation in an edge-cloud setting. Two Postdocs (Umea: Antonio Seo, Chanh Nguyen) are working on the resource allocation in the cloud setting. Yde Sinnema, a PhD student at Lund, is working on the study of response delay in robotic control. For the DNN task, the application of deep learning methods such as deep reinforcement learning and transformer networks rely heavily on the training data, which the project is able to generate at scale from a parallel robotic system at KTH with currently 32 robot arms. Approximately 2 TB of initial training data and initial model architectures have been created and tested, and the project is now at a stage where additional GPU compute resources are required to compete with internationally leading research institutions in this research direction.