By Vibin Vijay, Product Specialist AI/ML - OCF
Supercomputing has come a long way since its beginnings in the 1960s. Initially, many supercomputers were based on mainframes, however, their cost and complexity were significant barriers to entry for many institutions.
The idea of utilising multiple low-cost PCs over a network to provide a cost-effective form of parallel computing led research institutions along the path of HPC clusters starting with "Beowulf” clusters in the 90’s.
Beowulf clusters are very much the predecessors to today’s HPC clusters. The fundamentals of the Beowulf architecture are still relevant to modern day HPC deployments; however, multiple desktop PC’s have been replaced with purpose-built, high-density server platforms.
Networking has significantly improved, with High Bandwidth/Low Latency InfiniBand (or, as a nod to the past, increasingly Ethernet) and high-performance parallel filesystems such as SpectrumScale, Lustre and BeeGFS have been developed to allow the storage to keep up with the compute.
The growth of excellent, often open source, tools for managing high performance distributed computing has also made adoption a lot easier.
We’ve also recently witnessed the advancement of HPC from the original, CPU-based clusters to systems that do the bulk of their processing on GPUs) resulting in the growth of GPU accelerated computing.
While HPC was scaling up with more compute resource, the data was growing at a far faster pace. Since the outset of 2010, there has been a huge explosion in unstructured data from sources like webchats, cameras, sensors, video communications and so on.
This has presented big data challenges for storage, processing, and transfer. Newer technology paradigms such as big data, parallel computing, cloud computing, Internet of Things (IoT) and artificial intelligence (AI) came into the mainstream to cope with the problems caused by the data onslaught.
What these paradigms all have in common is that they are capable of being parallelised to a high degree. HPC’s GPU parallel computing has been a real game changer for AI as parallel computing can process all this data, in a short amount of time using GPUs.
As workloads have grown, so too have GPU parallel computing and AI machine learning. Image analysis is a good example of how the power of GPU computing can support an AI project. With one GPU it would take 72 hours to process an imaging deep learning model, but it only takes 20 minutes to run the same AI model on an HPC cluster with 64 GPUs.
Beowulf is still relevant to AI workloads. Storage, networking, and processing are important to make AI projects work at scale, this is when AI can make use of the large scale, parallel environments that HPC infrastructure (with GPUs) provides to help process workloads quickly.
Training an AI model takes far more time than testing one. The importance of coupling AI with HPC is that it significantly speeds up the ‘training stage’ and boosts the accuracy and reliability of AI models, whilst keeping the training time to a minimum.
The right software is needed to support the HPC and AI combination. There are traditional products and applications that are being used to run AI workloads from within HPC environments, as many share the same requirements for aggregating large pools of resources and managing them.
However, everything from the underlying hardware, the schedulers used, Message Passing Interface (MPI) and even to how software is packaged up is beginning to change towards more flexible models, and a rise in hybrid environments is a trend that we expect to see continue.
Join me in part 2 of this blog to learn more about the benefits of how HPC is supporting AI and conversely how AI can help traditional HPC problems. If you’d like to contact us for more information on how OCF can help you with AI and HPC challenges, please get in touch here.