Simplifying the use of clustered compute resources for AI

Simplifying the use of clustered compute resources for AI

By Vibin Vijay | Product Specialist - AI / ML 

Current advancements in artificial intelligence (AI) are underpinned by three driving factors:

The challenge of making AI scale

  • The increased sophistication and size of Neural Networks means that the most complex of tasks can be solved, leading to breakthroughs as the algorithms can recognise hidden patterns and correlations in raw data, cluster and classify it, and – over time – continuously learn and improve.
  • A tsunami of data generated by, for example, the mass adoption of smart devices and wearables, passive sensors and the digitalisation of data produced by a variety of industries drives the need for AI.
  • The availability of powerful compute accelerators, such as GPUs offers unprecedented compute density and increases performance 2-3 times for each new generation.

However, these factors are both the solution and the root cause of different challenges in AI. Since the beginning of AI progress in 2012, Neural Networks were growing at a dramatic pace and in less than 10 years, model sizes have grown several orders of magnitude. This is because there is a relation between the size of a Neural Network and its performance, whereby bigger models not only perform better but they also learn faster. 

For use case applications, this implies that building an early proof of concept (PoC) or a demo prototype using a medium scale dataset and a small model, using little compute resource, is easy. But to go beyond a PoC and increase model performance by 10%, the dataset and the model size are going to be one order or more of magnitude greater, thus increasing the need of compute power from one or a couple of GPUs to a dense GPU server or a cluster of them. 

Scaling a project using datacentre grade resources is not that simple and can bring several challenges, such as:      

  • Optimising the utilisation of resources without compromising the freedom of the users to experiment.
  • Making proper sizing for the resource needed.
  • Sharing resources efficiently and keeping the working environment always up to date with the latest framework and scientific libraries versions.
  • Interconnection between the nodes for data intensive read/write.

Despite the investment and commitment from leadership teams, many organisations are still struggling to unleash the full potential of AI. The 2020 State of Enterprise Machine Learning from Algorithm reports that 55 percent of companies are yet to deploy a model. One of the barriers to using machine learning (ML) models in production is the long process and timeline involved in their development and deployment.  

Where AI meets HPC

A vast array of methodologies from AI have been proposed in recent years and are currently being explored in the field of High Performance Computing (HPC).  A critical intersection between AI and HPC is coming into focus, showing amazing promise across a range of applications, including physics, linguistics, weather prediction, genomics sequencing, and global climate modelling.  

Large and complex volumes of data are pushing HPC practitioners to increase their traditional methods with AI techniques and data scientists can benefit immensely from HPC systems that can scale to a massive degree. 

The domains of HPC and AI, other than the similarity of the hardware infrastructure, such as GPU-accelerated and networked compute nodes connected to large storage, are distinctly different in toolsets, management, orchestration and development frameworks.  

Traditional HPC stacks, including workload manager and job scheduler that works with most Linux distributions are a suitable choice for ML and deep learning (DL) workloads, as it helps the rapid dispatch of high multiples of tasks in parallel, allowing ML frameworks scaling to tens of thousands of cores as well as allowing both traditional HPC and AI workloads to coexist and intersect within the same infrastructure. 

Enterprises and research labs have invested hundreds of millions of dollars on building Slurm-based HPC infrastructures and related software and they are expanding rapidly into using DL to solve business and product problems.  

Finally, even if your team has already invested time and efforts in building these DL capabilities on Slurm, achieving a reliable, complex and efficient ML platform is often very long and cumbersome, with the user experience highly prone to becoming disjointed at every turn. Therefore, enterprises are looking to find solutions that provide AI models development on top of their existing HPC infrastructure. 

Introducing Lenovo LiCO 

Lenovo´s LiCO is a software platform providing a set of templates that aim to make AI training and inference simpler, more accessible, and faster to implement on Slurm -based infrastructure. The accelerated AI templates differ from the other templates in LiCO in that they do not require the user to input a program but users can leverage pre-installed state of the art AI models to be trained with a labelled dataset. 

LiCO also supports and provides three templates based on Intel oneAPI – Intel MPI, Intel MPITune, and Intel OpenMP that are optimised to run on Intel processors. It supports and enables customers to take advantage of popular AI Frameworks like: Caffe, Intel-Caffe, TensorFlow, MxNet, Neon, Chainer, Pytorch, Scikit-learn by using job templates for specific AI frameworks.  LiCO also provides the ability to define multiple job submissions into a single execution action, called a Workflow.  

By working with a trusted technology partner like OCF that is focused on technical and research computing, you can add value to help your organisation through the AI journey by integrating various pieces of the puzzle from hardware to software which are located from edge to core. We have a team of dedicated engineers who are specialised in AI compute, storage and application.

We can assist you in AI adoption no matter where you are on this journey. You  can email me at to find out more.