By Vibin Vijay, Data and AI Solution Specialist at OCF
OCF has been part of Cluster Challenge at Computing Insight UK, organised by Science and Technology Facilities Council (STFC), since its inception. For this year’s challenge we chose to do something slightly different; a challenge to exploit AI techniques with NVIDIA A100 GPUs, due to the exponential use of deep learning and AI in the modern-day compute clusters. We wanted to add value to this challenge by giving students industry leading software and technology which will have an impact in their technical careers and professional life. Tackling challenges and finding solutions is in OCF’s DNA, all our employees tackle it in our everyday tasks big or small.
Being an AI specialist in the organisation, naturally the responsibilities were put on my shoulders. I was asked to come up with a design and execution plan with various stakeholders involved. My first instinct was to approach our AI partner NVIDIA - I approached Martin Piercy and Adam Grzywaczewski from NVIDIA, and they were both very keen to assist us in the project. We came up with various challenge ideas from computer vision, data analytics, to natural language processing (NLP), finally we decided on NVIDIA’s conversational AI involving NVIDIA Riva and Triton server, capable of processing Automatic Speech Recognition (ASR), Text to Speech (TTS), and Natural Language Understanding (NLU). Having NVIDIA onboard was a great first step, but it was only a piece of the puzzle, as during the challenge one of the most time consuming tasks for organisers is to allocate access to 40+ users at various times over the duration of the challenge, whilst keeping track of timings, answering questions etc... we really needed a smart user management system to allow me to dedicate my time to the students rather than administrating the environment. In addition, we needed some GPU allocation smarts to make best use of the hardware we had available. That’s when we approached our friendly partner Run:ai for support; not surprisingly, they were very keen to get involved.
We only had 2 weeks to execute our plan, the plan was to get our engineers to set up a 2U server with 3 A100 GPU’s and to install a fresh operating System and Kubernetes in the first couple of days. Later we installed Run:ai to manage the orchestration of K8s and NVIDIA packages, leaving us a week to design the challenge between OCF and NVIDIA. Everything was going to plan until we had some issues with GPU operators on RHEL 8, rather than spending time troubleshooting we chose to move to Ubuntu 20.04 and get Kubernetes installed as this was a known working solution. This costed us a good 3 – 4 days to rearrange, but with a great team working together OCF, Run:ai and NVIDIA managed to get the project back on track. We also had some challenges sorting Kubernetes, DNS, IP (internal & external), spinning pods etc, because of the complexity involved in many moving parts in the Kubernetes configuration. I would like to specially thank Erez Kirson from Run:ai who has been very helpful throughout the process. Once we had the Kubernetes set up, installing NVIDIA Riva and triton server was straightforward with Adam’s assistance.
We wanted the challenge to be unique in a way, where each team had to build an end-to-end solution of their choice. Adam Grzywaczewski is a senior data scientist at NVIDIA who has experience in training scientist via NVIDIA Deep Learning Institute (DLI) and had a clear understanding of what was achievable during the 3-hour challenge window.
Overview of the competition
Since we have two competitions (Online and Onsite) we have designed the challenge such that they are split into two parts. Onsite (part 2) of the challenge will be the continuation of Online (Part 1). Part 1 will be focusing more on the planning and designing phase of the solution, whereas Part 2 will be the execution and implementation of the solution.
The Online challenge ran between the 31st of October to the 9th of November. The team leaders must communicate with the OCF personnel in charge to book a 4-hours window, 1 hour for preparation and the remaining 3 hours for the actual challenge. Every team were given Run:ai logins where they would spin up a Job and respective example Jupyter notebook. The example Jupyter notebook had the translation from (Polish to English), text summarisation and speech recognition examples. The task was to build on the examples to gain points -
We will meet again in a few weeks during the onsite challenge on Friday 2nd December, 9am-12pm, at CIUK Manchester.
On this day the teams will execute the part 2 of the challenge, that is the continuation part 1online challenge. At the end of the challenge the teams should have implemented a solution design proposed in part 1 and demo the working model of the solution via a recorded session (max 5 mins). At the end, the solutions are scored by our experts, based on the complexity of the problem, technology used and presentation mode.
By completing both challenges, the teams should have achieved a strong knowledge and understanding of conversational AI (ASR, TTS, and NLP) using NVIDIA Riva with the help of Run:ai. Later in the day at 2pm the final winner of the CIUK 2022 cluster challenge will be announced. The winner will get the opportunity to compete in International Supercomputing (ISC 2023) Student Cluster Challenge at Hamburg in June 2023.
During the challenge
We had 6 teams each with 6 researchers/students participating in the competition from University of York, University of Bristol (A & B team), University of Birmingham, University college London and Durham university. Each team had a team leader and it’s their responsibility to agree a 4-hour slot between OCF and respective teams for the challenge. The 4-hour slot was designed by CIUK in two parts – the first hour is used for user access and familiarisation of the server and the later 3 hours for the actual challenge. However, in our case (36) user accounts were created prior to the challenge in Run:ai and grouped by respective projects to run jobs, so it was only a matter of sharing username and password so the researchers could spin their first Jupyter notebook in less than 5 mins. The remaining time was used to focus on running example codes and in familiarisation of the Riva platform. Run:ai also acts as an efficient monitoring tool, as shown below - all the health parameters, such as GPU and CPU utilisation, were analysed in the real time.
Students seem to have enjoyed the challenge as they could develop creative ideas of their choice. There were some problems with user privileges, as some wanted to try and bring in newer libraries in python for voice recording, but we were able to sort the hurdles in less than 10 minutes.
After the challenge
Honestly, our team have been very impressed with the creativity shown by the University Teams during the challenge. The solution design ideas are varied, ranging from rap-battling robots to learning tools to help people with disabilities. Some teams have really gone the extra mile into the depth of the conversational AI technology and tested the models and translation engine. Most teams have managed to clear the tasks and look ready to take on the next challenge in building the proposed solution in Part 2 during our onsite challenge at CIUK on 2nd Dec between 9am to 12pm.
Proposed solutions by the teams
Winners of the online challenge will be announced on 10th November 2022.
Stay tuned and good luck!
Acknowledgements and thanks;