What Does a Big Storage Future Look Like in HPC?

Some of you may remember Samsung announcing last year its 30.72TB drives, positioning them as enterprise SSDs. Along with their huge capacity, these drives have about four times the read and three times the write capabilities of its consumer SSD.

But at a price point of between $10,000-$20,000 who would actually use them?

Is bigger better in storage?

So clearly these drives are targeted at organisations with pretty significant budgets, so how do you continue to take best advantage of the largest capacity drives on a budget?

Bigger is better in the storage industry – we always want more of it. Many organisations choose larger drives and mostly rely on traditional hard disk drives because of cost implications. The alternative, SSDs, are both costly and have limited capacity.

Even though it’s great having 30TB worth of capacity, often ignored is how this amount of storage will impact performance.  If one of our customers requests a certain amount of storage capacity, but also needs performance above a particular rate, we have to consider that most traditional hard drives peak out in performance at around 300MB/s.

As you start putting bigger and bigger drives into a system, you are reducing the number of drives required to meet capacity. Consequently, this decreases the performance you can get out of a system, resulting in the need for more capacity than required, just to attain the desired performance figure.

Often people fail to acknowledge that the larger the drive becomes, the more data is potentially at risk should the drive fail. This is the same whether using SSD, tape or hard disk drives. With a failure, you could potentially lose all the data on that particular drive.

Challenge of recovery time

Traditional RAID technologies haven’t really moved on since the 1980s when they were first developed.

There are a lot of industries still using RAID 6, which allows for two disk failures within the RAID set before any data is lost.

Due to failure rates and rebuild times, you are limited by the number of drives that are in that particular RAID group and you are also limited by their speed in trying to rebuild the missing drive and its data.

As the capacity of drives continues to grow at an exponential rate, it will take much longer to rebuild.  It already takes days to rebuild drives on the capacity we already have.

With drives of around 30TB capacity, it could take over a week to reconstruct a failed drive. With such a long recovery time, this increases the failure risk of another drive in the RAID group.

These challenges started to be addressed a few years ago in HPC and the cloud, where rather than using traditional RAID, we’ve seen organisations using de-clustered arrays which essentially place many more drives into the same pool and data is distributed more widely across more disks.

This lessens the impact of a drive failure, so only a proportion of the data is lost rather than its entirety. It also allows part of the missing data to be re-built before complete drive failure and all drives to participate in the reconstruction on a single drive failure.

Convergence of compute and storage

Another noticeable difference in how storage systems are being created and used is through the convergence of both compute and storage.

With the availability of fast network interconnects, such as InfiniBand and the advent of 100 Gigabit Ethernet, it has become possible to populate individual compute nodes with large capacity drives and have them participate in the storage subsystem.

This allows for practically linear scaling of storage and performance each time you scale your compute.

In traditional HPC, this newer approach hasn’t quite caught on yet, and there’s still the use of separate storage and compute system elements.

When you are looking at cloud platforms, these are becoming more converged in the use of technologies, so the failure of an individual component becomes less of an issue.  When running HPC on premise, the components are more important, particularly with storage, when using traditional RAID.

Using 30TB drives and significantly increasing capacity will drive the HPC market to look more at de-clustered arrays, to allow faster re-build times in the event of a drive failure.

We’ve seen this recently with IBM, Lenovo and NetApp all offering their own version of de-clustered array products. This will be a more realistic option for organisations looking for larger capacity on a budget.

If you would like to speak to us for further advice on storage for your HPC requirements, please get in touch here.

By Laurence Horrocks-Barlow, Lead Storage Consultant at OCF