The Future of HPC Sustainability – Part I

The Future of HPC Sustainability – Part I

by Mischa van Kesteren, pre-sales engineer and sustainability officer at OCF

It is an unavoidable fact, HPC is an energy intensive environment.

It is an unavoidable fact, HPC is an energy intensive environment. However, I’d like to discuss what steps can we take to address this and achieve a more sustainable future for our HPC customers in universities, research centres and other organisations.

Some primary considerations include getting the most out of your power usage and considering renewable electricity sources. There are also options in offsetting the power you use within your organisation to make your operations more sustainable and ‘green’.

Organisations like Plan Vivo can help legitimise the carbon offset programme you choose to work with really is sustainable. The important thing to remember with offsetting is that it is best used a tool to incentivise the reduction of resource consumption.

Offsetting helps to internalise the environmental cost of consuming power, however for it to be effective the additional cost must be passed on to the people consuming that power. So, if HPC users are currently being billed or allocated resource based on core hours / job run time they won’t see the additional power cost as clearly.

Switching to a resource allocation scheme based on power consumption would be more effective. Tools such as the EAR energy management framework can be used to provide per-job energy consumption accounting through integration with Intel CPUs and SLURM accounting functionality.

When building a new build HPC system, it is important to understand what your workload is going to look like. Energy efficiency comes down to the level of utilisation within the cluster. There are definitely more energy-efficient architectures.

Generally, higher core count, lower clock speed processors tend to provide greater raw compute performance per watt, but you’ll need to have an application that will parallelise and is able to use all those hundreds of cores at once.

If your application doesn’t parallelise well, or if it needs higher frequency processors, then the best thing you can do is pick the right processor and the right number of them, so you are not wasting power on CPU cycles that are not being used. When cycles are not being used, the CPUs should be configured to downclock to save power.

As many of you know already, HPC managers will be assessed on how satisfied users are with their service, so many will artificially force all of the processors and nodes to run continually at 100 percent clock speed, so the processor won’t be put in a dormant state or be allowed to reduce its frequency. Ultimately, energy consumption just isn’t a major concern.

If our customers come to us and want to improve energy efficiency based on their current estate, we would look at features being used in the scheduling software which can power off compute nodes, or at least put them into a dormant state if the processor supports that technology. We would check if these types of features are enabled and if they are making the most of them.

However, for some older clusters that do not support these features and generally provide much less power per watt performance than today’s technologies, it is often worth replacing a 200-node system that is 10 years old with something that is maybe 10 times smaller and provides just as much in terms of computing resource.

You can make a reasonable total cost of ownership (TCO) argument for ripping and replacing that entire old system, in some cases that will actually save money (and resources) over the next three to five years.

In my next post, I’ll explain more about cloud bursting as another useful approach to sustainable HPC. If you’d like more information on how OCF can help you with HPC sustainability, please get in touch here.