Maximising Available Budgets

Maximising Available Budgets

Extending the lifespan of a system

As many of my colleagues in the public sector research space eagerly await to hear more detailed news on how the recently announced “ten-year budgets” unveiled on 19th May will affect their organisations, I thought it would be worth sharing how OCF can support customers that are looking to maximise their budgets or existing assets until future funding becomes a bit more clear.

Whilst not as glamorous as supplying a brand new HPC cluster based on the very latest technologies... there is often scope to extend the useful life of existing HPC systems out.

Re-platforming Existing Clusters


In what has been referred to as a “New Era of Threats” the risk of HPC systems being targeted by bad actors has never been higher, exposing organisations to risk of:

  • Theft of research or proprietary data
  • Unauthorised use for financial gain, such as crypto mining
  • Data Loss and Service Disruptions attacks such as Ransomware


Old, unsupported Operating Systems, software and File systems clearly present a risk.

The OCF team is able to take any existing HPC Cluster (OCF supplied or not), “re-platform” it with latest OCF SteelStack (our in-house toolkit that allows HPC Administrators to manage their HPC estate including hardware, networking, operating system, scheduler/workload manager, applications, and programming tools) or NVIDIA Base Command Manager (previously Bright Cluster Manager) along with a supported RedHat/Rocky 9.x Operating System.

This service has been used by a number of our customers to migrate ageing but serviceable clusters away from CentOS Linux 7 which reached its end of life on June 30, 2024 and therefore no longer receives security updates or bug fixes. 

Typically, a re-platforming project will also look to:

  • Upgrade “core” components such as head nodes with new, supported, hardware (if necessary)
  • Take a balanced approach to adding vendor or 3rd party support to more critical components
  • Upgrade the software on the different nodes to the latest versions, or repurpose/re-provision them for new tasks.
  • Bring the system into a “good” state by reviewing failed hardware, building up a number of production nodes from a pool of failed hardware and/or purchasing replacement components.
  • Consolidate multiple clusters into a single environment, upgrading all clusters at the same time and reducing the ongoing management overhead.
  • Add an OCF Front-line support or Managed Service contract for peace of mind.

 

If you are interested whether we could help you refresh the software environment on your cluster,
please feel free to get in touch with myself or your account manager.

Support & Spares


OCF is able to offer support for systems coming to the end of their contracts or out of contract, both direct with the manufacturers and through our trusted network of 3rd party providers. We can also source spare and replacement components for hard-to-find HPC specific items, such as older InfiniBand switches and cables.

OCF has extensive experience in typical HPC/research computing technologies including cluster management, schedulers, InfiniBand and parallel file systems.

We can draw on this experience and advise how best to move forwards without support components & software, we are always happy to discuss how we can best offer support - whether that’s to swap in an alternative technology, utilise a 3rd party support organisation or agree reasonable best endeavours terms.

Professional Services


In addition to larger and more complex “re-platforming” projects, the OCF team can carry out smaller tasks on a project by project basis – need your parallel file system upgrading to the latest version? Have a long list of end user applications to install? Need to move your hardware from one site to another? Get in touch with the team!

Making the most of what you have


When all your hardware appears to be fully utilised it can feel the only option is to add more compute capacity, when in fact there will almost certainly be a percentage of failed and/or poorly optimised jobs are resulting in unused potential in your existing infrastructure:

  • Understanding your workloads
    • OCF works closely with UCIT “OKA” Software, which together with some consultancy can help our customers Identify “atypical” user behaviours. For example, did you spot that novice user submitting bursts of jobs in the last 2 days? Or that user who has less than 10% of his jobs that end correctly? Do you have a high proportion of failed/cancelled/timeout jobs?

  • Optimisation & Orchestration
    • GPU’s are an extremely expensive resource, technologies like NVIDIA Run:ai can help you maximise their utilisation enabling GPU’s to be sliced up and shared on the fly, even overprovision which could be extremely useful in teaching and learning user scenarios.

  • Storage Management
    • Storage, again, is a valuable resource – are your users really only using your scratch storage for scratch? Is duplicated/low value/old data filling up your expensive “Tier 0” storage?
    • Are you getting a big bill for pushing and pulling data back and forth from the public cloud? Products like Starfish can not only give you insight to the data you are storing, it can also help you manage multiple tiers – maybe you can save money (and as a bonus carbon footprint) by tiering data to an on-site Tape Library?


Whilst these examples use commercial pieces of software, the vendors offer generous public sector discounts. I am confident that they will very quickly pay for themselves (we can even model it with you), allowing you to release unused capacity whether that’s CPU Compute, GPU’s or Flash Storage potentially without having to purchase additional hardware requiring space, power and cooling.

As an added bonus, the insight OKA and Starfish gives you will also certainly help make an informed purchasing decision when you are looking for your next HPC Cluster or Storage solution maximising your budget next time round.

Financing

Not something we generally talk about to our Public Sector clients about, but it’s probably worth mentioning OCF is able to offer a number of different financing tools available from our partners:

  • Financial Services:
    • Leasing
    • Hire Purchase
    • Deferred Payment
    • Infrastructure-as-a-Service e.g. Lenovo Tru-scale a “cloud-like” purchasing model for On-Prem Hardware.


In the commercial space, we are certainly seeing interest in Leasing and Infrastructure-as-a-Service as our customers are looking to compare TCO together with cashflow implications of buying/deploying on prem vs using the public cloud. And you never know, a deferred payment model might help bring a desperately needed upgrade forward a few months…

In Conclusion


The above is only a snapshot of the solutions and services we can offer for midlife / extended life clusters – of course we love deploying shiny brand-new environments with the very latest technologies, but this is hopefully a reminder we are also still here if you just want us to spend a couple of days helping update a handful of user applications on an environment you’re hoping to replace in the near future, whether we initially supplied that cluster or not.