By Vibin Vijay, OCF’s AI Product Specialist at OCF
In the first part of my blog, I talked about how HPC is supporting AI growth and so I will continue to expand on the further benefits.
As traditional use cases for HPC applications are so well established, changes often happen relatively slowly. However, the updates for many HPC applications are only necessary every 6 to 12 months.
On the other hand, AI development is happening so fast, updates and new applications, tools and libraries are being released roughly daily.
If you employed the same update strategies to manage your AI as you do for your HPC platforms, you would get left behind.
That is why a solution like NVIDIA’s DGX containerised platform allows you to quickly and easily keep up to date with rapid developments from NVIDIA GPU CLOUD (NGC), an online database of AI and HPC tools encapsulated in easy to consume containers.
It is becoming standard practice within the HPC community to use a containerised platform for managing instances that are beneficial for AI deployment. Containerisation has accelerated support for AI workloads on HPC clusters.
AI models can be used to predict the outcome of a simulation without having to run the full, resource intensive, simulation.
By using an AI model in this way input variables/design points of interest can be narrowed down to a candidate list quickly and at much lower cost. These candidate variables can be run through the known simulation to verify the AI model’s prediction.
Quantum Molecular Simulations (QMS), Chip Design and Drug Discovery are areas this technique is increasingly being applied, IBM also recently launched a product that does exactly this known as IBM Bayesian Optimization Accelerator (BOA).
How can OCF help with your AI infrastructure?
Start with a few simple questions; How big is my problem? How fast do I want my results back? How much data do I have to process? How many users are sharing the resource?
HPC techniques will help the management of an AI project if the existing dataset is substantial, or if contention issues are being experienced on the infrastructure from having multiple users.
If you are at a point where you need to put four GPUs in a workstation and this is becoming a problem by causing a bottleneck, you need to consult with OCF as we have experience in scaling up infrastructure for these types of workloads.
Some organisations might be running AI workloads on a large machine or multiple machines with GPUs and your AI infrastructure might look more like HPC infrastructure than you realise.
There are HPC techniques, software and other aspects that can really help to manage that infrastructure. The infrastructure looks quite similar, but there are some clever ways of installing and managing it specifically geared towards AI modelling.
Storage is very often overlooked when organisations are building infrastructure for AI workloads, and you may not be getting the full ROI on your AI infrastructure if your compute is waiting for your storage to be freed up.
We can provide the best advice for sizing and deploying the right storage solution for your cluster.
Big data doesn’t necessarily need to be that big, it is just when it reaches that point when it becomes unmanageable for an organisation.
When you can’t get out of it what you want, then it becomes too big for you. HPC can provide the compute power to deal with the large amounts of data in AI workloads.
It is an exciting time for both HPC and AI, as we are seeing incremental adaptation by both technologies.
The challenges are getting bigger every day, with newer and more distinct problems which need faster solutions. For example, countering cyber-attacks, discovering new vaccines, detecting enemy missiles and so on.
It will be interesting to see what happens next in terms of inclusion of 100% containerised environments onto HPC clusters, and technologies such as Singularity and Kubernetes environments.
Schedulers today initiate jobs and wait until they finish which may not be an ideal scenario for AI environments.
More recently, newer schedulers monitor the real time performance and execute jobs based on priority and runtime and will be able to work alongside containerisation technologies and environments such as Kubernetes to orchestrate the resource needed.
Storage will become increasingly important to support large deployments, as vast volumes of data need to be stored, classified, labelled, cleansed, and moved around quickly.
Infrastructure such as flash storage and networking become vital to your project, alongside storage software which can scale with demand.
Both HPC and AI will continue to have an impact on both organisations and each other and their symbiotic relationship will only grow stronger as both traditional HPC users and AI infrastructure modellers realise the full potential of each other.
If you’d like to get in touch to speak to someone from OCF about how we can help you with your AI and HPC environments, please click here.