Generative AI On-premises

Author: Erwin Veenstra

Given that AI is compute-intensive, it is often assumed that its use can only be handled in the cloud, that it is dependent on big data centres, and that it is provided by just a handful of companies. This does not always have to be the case. In fact, in some use cases, fairly little is needed to achieve satisfying results.

As the discussion around the use of AI in education continues to grow, certain topics which are not primarily didactic in nature are beginning to overlap and gain urgency. Hardware and software security, independence, and data privacy have become more relevant than ever since the emergence of cloud computing. Given that AI is compute-intensive, it is often assumed that its use can only be handled in the cloud, that it is dependent on big data centres, and that it is provided by just a handful of companies. This does not always have to be the case. In fact, in some use cases, fairly little is needed to achieve satisfying results.

Here at LLInC, we’re piloting our FET (Formative Evaluation of Teaching) project and the results are, thus far, looking promising. Without going into detail (feel welcome to read more here), this project aims to process written text into a summarised report using LLMs. For this, we are currently using GPT 4.5 in an enclosed Microsoft Azure environment.

Image credits to ilgmyzin.

Although our process has been deemed safe and controlled, if not strictly necessary, it remains undesirable to send data over the internet and involve third party companies abroad. The primary reason for the use of GPT 4.5 in our case is that it is a very powerful LLM and one of the strongest available. It seemed unlikely that similar results could be achieved outside of the cloud.

Since we have computers in our LLInC studio with powerful GPUs (Graphical Processing Units) intended for work with 3D environments, we elected to use one of these containing a Nvidia RTX 5090 GPU to run LLMs and observe the results on FET source data (a .csv file outputted by Qualtrics). This GPU runs the GPT-OSS 20b version [1] – an open-sourced model resembling ChatGPT and which has been made freely available by OpenAI.

Image credits to Mitchell de Jong.

When handling (open text) data in this way, the output must be qualitative, free of hallucinations, and consistent. It is worth noting that achieving this consistency requires very detailed prompting as well as the tweaking of certain settings to prevent randomness. Moreover, if we want to extract meaning from input data, the output should be factually correct and reproducible.

Although the FET project as a whole is still to be evaluated (hence no definitive verdict can yet be offered concerning the semantic performance), the results produced up to now by GPT 4.5 appear highly promising. Interestingly and significantly, however, when running the same data through our locally running version of GPT-OSS, the results were similar. The same could be said of the speed with which the results were produced.

Could this mean that this particular use case for generative AI does not require any special infrastructure at all? Potentially. A computer with some (albeit powerful and high-end) consumer grade hardware could, if needed, perform the task, perhaps even fully offline. In such a case, there would be no need for “the cloud”, data centres, privacy concerns, or storing data in unnecessary places. Control would also be increased since open-source models can more easily and securely be fine-tuned and fitted for specific tasks.

Although this is just a small example of what may be possible, it goes to show that generative AI use cases don’t always require dependency on big tech. It should be emphasised, however, that a solution such as this is mostly suitable for tasks which are able to be planned and which do not require the constant availability of a model – tasks for which computing power can be shared and anticipated. Running a powerful LLM on premises which is ready for anyone to use at any time is a different story. Although no impossibility, this would require considerable infrastructural investment.

All in all, running LLMs on premises might be especially interesting for staff, teachers, and researchers who want (or need) to work with very sensitive data or have other reasons to not make use of commercially available solutions. While smaller models might have their limitations, it remains possible to extend a model’s capabilities through access to additional data as well as call functions which produce (predictable) results unable to come from inference on training data alone.

¹ 20b means 20 billion parameters, which means that the model is significantly smaller than ChatGPT itself, or bigger versions of GPT-OSS (of which an 120b version is also available, which would require much stronger but still publicly available hardware).

Generative AI On-premises

Related stories