NVIDIA DGX versus HGX platforms

NVIDIA DGX versus HGX platforms

NVIDIA's DGX and HGX platforms represent cutting-edge AI (Artificial Intelligence) infrastructure, each tailored to meet distinct requirements in the field of artificial intelligence.

NVIDIA DGX versus HGX platforms

The DGX series is celebrated for its robust performance and user-friendliness. It aims to facilitate end-to-end AI development and stands out for its integrated approach, combining hardware and software to deliver a comprehensive solution that significantly reduces the time required to gain meaningful insights. 

Conversely, the HGX platform is designed to serve as a foundational component that enables manufacturers to construct bespoke AI systems. Its modular architecture allows remarkable flexibility, permitting vendors to expand or customise their systems to meet specific demands. Companies such as Lenovo, Supermicro, Fujitsu and Dell have all used this adaptability to deliver various solutions tailored to diverse industry requirements. 

NVIDIA HGX/EGX 8 GPU Servers

Presently, OEMs such as Dell (PowerEdge series), Supermicro (X13 & H13), Gigabyte (G593 series) are offering systems equipped with an 8-way NVIDIA H100 GPU configuration within the HGX framework. Lenovo will enter the market soon with its new series of air-cooled servers, such as the SR680a V3 and SR685a V3. Lenovo is also offering water-cooled servers with SR780a V3 series. These are designed to operate with NVIDIA's GPUs, NVLink, NVIDIA networking, fully optimised AI, and high-performance computing (HPC) software stacks to provide the highest application performance and generate the fastest time to insights interconnect technology. The advanced networking capabilities of HGX are crucial for ensuring efficient data transfer rates, which is a key factor in mitigating bottlenecks within HPC settings. This level of performance positions these systems on par with the NVIDIA DGX H100 in terms of computational power. 

NVIDIA AI Enterprise (NVAIE)

The DGX series is complemented by a comprehensive support structure and software ecosystem, exemplified by the NVIDIA AI Enterprise (NVAIE) platform. NVAIE is a complete, cloud-native software suite that enhances data science workflows and simplifies developing and deploying enterprise-grade generative AI applications, including co-pilots. Additionally, the features of NVAIE can be integrated into HGX systems as an optional package, allowing customisation based on customers' unique use cases and necessities. 

Cost

While the HGX platform provides significant pricing flexibility, the DGX series is positioned at a premium, reflecting its status as the gold standard in AI infrastructure. DGX H100 offers a balance of performance and cost, offering 32 petaflops of AI performance. In contrast, a similar HGX system would be priced 30% cheaper compared to DGX, which offers similar performance metrics.

For the most accurate and up-to-date information and cost, consult OCF directly at info@ocf.co.uk 

Flexibility 

DGX systems offer a robust, out-of-the-box solution with high-end GPUs and a comprehensive software stack, including NVIDIA Base Command and NVAIE. This can be ideal for customers looking for a turnkey solution that minimises setup complexity. On the other hand, HGX provides a modular approach, allowing customers to tailor their hardware and software configurations to their specific needs. This flexibility can be crucial for those who require a particular set of components for their computational tasks or those who wish to integrate the system into an existing infrastructure with preferred management and scheduling tools. OCF and our partners offer bespoke solutions and tools on HGX systems such as OCF steel, Run:ai and Slurm. Users can pick and choose based on their workload and usage, providing flexibility to build your software stack for AI and HPC workloads.  

OCF NVIDIA - AI SUITE

OCF's distinctive approach to AI infrastructure is characterised by a comprehensive suite of partner technologies and in-house expertise as shown in the figure below. The integration capabilities are designed to be flexible, catering to the specific needs of a bespoke AI infrastructure solution. Storage solutions are diverse, including options like Lustre for DDN, Spectrum Scale, WekaIO, and Vast, provided by various vendors, ensuring that clients can access the storage solutions that best fit their requirements. The infrastructure can run on NVIDIA's networking solutions, whether it be ethernet or the high-performance InfiniBand, allowing for scalable and efficient data transfer that is crucial for AI workloads.  

OCF provides HGX and EGX solutions for compute-intensive tasks, which leverage the latest NVIDIA GPUs, such as Blackwell, Hopper, and Lovelace, delivering unparalleled acceleration for complex computations. Additionally, NVIDIA's Grace CPU compute board offers a unique proposition for combined CPU/GPU workloads, optimising performance for the most demanding applications.  

The management layer is integral to the operation of the AI infrastructure, with OCF's proprietary software, OCF Steel, enabling effective management and orchestration of the cluster. This robust management tool is already being rigorously used by current customers, demonstrating its reliability and effectiveness. For alternative cluster management solutions, OCF also supports Run:ai and NVIDIA Enterprise AI, providing additional options for clients to manage their AI infrastructure efficiently.  

Beyond the technical offerings, OCF's service layer is tailored to meet customers' individual needs, encompassing installation, technical support, managed services, and consultancy. This holistic approach ensures that clients receive state-of-the-art AI infrastructure and the ongoing support and expertise necessary to maximise their investment.  

In conclusion

While DGX provides an out-of-the-box AI research and development solution, HGX offers a customisable approach that enables vendors to build specialised systems. DGX offers ease of use and streamlined deployment for research teams, HGX provides flexibility and scalability for complex, enterprise-level AI infrastructure. Both platforms play a pivotal role in advancing AI technology, and each offers unique advantages depending on the application.  

With over two decades of experience, OCF's consultative approach bridges the gap between client requirements and technology providers, ensuring that solutions not only meet but exceed performance expectations. Selection of platforms can be tricky at times - OCF Consulting is positioned to assist in selection of DGX or designing an optimized HGX platform tailored to specific needs, leveraging their vendor-neutral stance to integrate the best offerings available in the market. For up-to-date information, consult OCF directly at info@ocf.co.uk or message me.