The explosion of AI over the last couple of years has been nothing short of phenomenal. As a software engineer at a High-Performance Computing and AI infrastructure company, I’ve had a front-row seat to the evolution of both the hardware and software powering this new wave of intelligence.
One of the biggest advantages to my role has been access to performant infrastructure, including NVIDIA A100 GPUs, which are ideal for running large language models (LLMs) locally. With the right tools and some engineering, this allowed us to build our own internal AI assistant tailored specifically to our workflows and security needs.
Security was the driving factor. We work with sensitive infrastructure code and architectural decisions. By deploying everything internally, we can ensure all data and code remains internal. It also gives us greater control over how to deploy our models and how they are used, so that we can meet new and interesting use cases, beyond just answering questions. Additionally, an internal model gives us the freedom to tune performance, latency, and costs without worrying about API rate limits or vendor lock-in. |
![]() |
At the core of our internal assistant is a 7 billion parameter open-source model, deployed locally on an A100 GPU server. Whilst this GPU is a few years old now it still offers a good balance between size, performance, and resource demands (and I'm fortunate enough to work for a company that has this kind of kit lying around!). It could be an interesting project to do a bit more testing to understand how changing the GPU for a more recent model might offer more performance or the same end result but using less power for example. The choice of LLM is also something still being considered and tested as the use cases for the LLM expand.
To make our AI system more useful, we built a Retrieval-Augmented Generation (RAG) pipeline. Instead of relying solely on the LLM’s training data, RAG lets us inject up-to-date, company-specific context into each response. This dramatically improves accuracy and relevance for engineering queries.
Behind the scenes, we index our internal engineering documentation using a vector store. When a query is made, relevant documents are retrieved and passed to the model alongside the question.
In terms of the software, the backend is written in Python, leveraging the Hugging Face `transformers` library and custom RAG utilities. We fine-tuned response formatting and context length to match our use cases, while optimising for inference speed using the A100 GPU. For the frontend, we kept it simple and fast with vanilla JavaScript, HTML, and CSS, no framework overhead. Both the frontend and backend will be containerised using Docker.
Though still a prototype, we have identified several planned use cases for our internal assistant: helping engineers to navigate code, accelerate onboarding, and to author documentation, which is a common request from customers. By integrating the assistant into these workflows, we aim to boost engineering efficiency and ensure more consistent, up-to-date internal knowledge. Ultimately, this translates to faster feature delivery, better reliability, and clearer documentation for the systems our customers depend on.
Building this internal model has given us valuable insight into how large language models function in practice. As our engineers begin using the assistant, we will continue to refine the model, and the underlying infrastructure based on real-world feedback and ever evolving use cases.