What I learned setting up LLM infrastructure for a company — alone

My boss asked me during a business trip whether we could run AI inference internally — have our own server instead of depending on external APIs. I said yes.

I had never done this before. What I knew: AI inference needs a GPU, and GPUs meant Nvidia. That was the full extent of my relevant knowledge at that moment. Somehow it was enough to commit to the task, because thirty seconds later I was responsible for specifying the hardware we'd buy.

The hardware decision

Budget was the first real constraint. We weren't buying a research cluster or a dedicated AI workstation from some enterprise catalogue. We were buying whatever made sense for a company that had never run inference before and wasn't sure yet how much they'd use it.

After some research — which at this point meant reading whatever I could find quickly — we landed on an RTX A4000 with 16GB of VRAM. It fit the budget. It was a professional-grade card. 16GB seemed substantial compared to consumer options. I wrote up the spec, got approval, and we ordered it.

My confidence at this stage was based on a data point that turned out to be almost entirely irrelevant: I had run Ollama on my ThinkPad before, with no dedicated GPU at all, just the CPU. It was slow — uncomfortably slow — but it worked. A 7B model could answer questions. It didn't crash. If that was possible with no GPU, then 16GB of dedicated VRAM felt like headroom we'd never fully use.

This is the wrong conclusion to draw from that data. I just didn't know it yet.

What actually happened

The server arrived. I installed the Nvidia drivers, set up the software stack, and tried to run a 12B parameter model. It struggled. Not "a bit slow to start" — it genuinely struggled. Inference was painful, and under any kind of real load the system wasn't stable. This was not what I had implied I could deliver.

The problem wasn't raw GPU speed. The problem was memory. A 12B model loaded at full FP16 precision takes roughly 24GB just for the weights — before you account for the KV cache that grows with every token generated. I had 16GB. The math doesn't work. The model was being forced into CPU offloading, which meant most of the inference was happening on the same CPU path I'd already proved was too slow on my ThinkPad.

I had been thinking of VRAM like system RAM — more is better, 16GB is plenty for most things. I hadn't understood that model size in memory is determined by the number of parameters multiplied by the bytes per parameter, and that "12 billion parameters" has a direct, calculable relationship to how much VRAM you actually need.

The learning, done properly this time

Quantization was the first thing I actually studied. The concept is straightforward once you understand the problem it solves: instead of storing each weight as a 16-bit float, you reduce the precision — 8-bit, 4-bit — and trade a small amount of accuracy for a large reduction in memory footprint. A 12B model that doesn't fit in 16GB at FP16 can be made to fit at 4-bit with quality loss that's acceptable for most practical tasks.

There are several quantization formats, and they're not equivalent. I ended up using AWQ — Activation-aware Weight Quantization — after reading comparisons. The core idea is that not all weights matter equally during inference. Some activations are consistently large and have an outsized effect on the output; AWQ identifies these during calibration and protects them from the worst rounding errors during compression. In practice this means meaningfully better output quality at the same 4-bit size compared to simpler quantization approaches. For a production use case where the results actually matter, this distinction is worth caring about.

The second gap in my knowledge was throughput. My initial setup was simple: load the model, expose an endpoint, handle requests one at a time. This works fine for a single user testing the system. Under any real concurrent load it falls apart, because each request waits for all the ones ahead of it, and LLM inference is inherently slow per token. I needed batching — the ability to interleave multiple requests through the model simultaneously.

vLLM solved this. Its key contribution is PagedAttention, a memory management technique that treats the KV cache like virtual memory pages rather than pre-allocating fixed blocks. This makes batching efficient even when requests have different lengths, because you're not reserving the maximum possible memory for every slot upfront. For a single GPU with limited VRAM, the difference in how many concurrent requests you can handle is substantial.

I want to be accurate about how I learned this: I didn't read about PagedAttention and then implement vLLM. I hit a throughput problem, searched for solutions, found vLLM, integrated it, and then understood why it worked. The theory made sense after the fact. That's how most of this went — encounter the problem, find the tool, understand the mechanism while using it.

What runs in production now

The A4000 currently serves a document OCR pipeline. PaddleOCR handles the visual extraction — pulling text regions from images — and a Qwen 4B model quantized in AWQ handles the language understanding layer on top of that. The 4B model was the deliberate choice here: the task is structured and well-defined enough that you don't need a large model to do it accurately, and a smaller footprint means PaddleOCR and the LLM can coexist on the same GPU without memory pressure forcing the system into degraded operation.

vLLM handles the serving. The system batches requests correctly. Under production load — real documents, real throughput — it doesn't slow down, and it hasn't gone down unexpectedly since we deployed it. The whole thing, from that conversation on a business trip to a stable production system, took under three months.

I won't pretend it was smooth. There were nights of failed inference runs, driver issues, model configuration I got wrong three times before getting right. But the end state is something I can stand behind: a system I built alone, on budget hardware, that works reliably for the use case it was designed for.

What I'd tell someone starting this

Your first hardware decision is going to be made with incomplete information. That's unavoidable — you can't know what you don't know yet. The RTX A4000 with 16GB was the correct purchase within our budget. It just required quantization to actually be usable for inference workloads, something I didn't understand when I wrote the spec. Had I known about AWQ from the start, I'd have made the exact same purchase with more confidence and less anxiety when the first naive approach failed.

The tools that mattered — vLLM for efficient serving, AWQ for fitting models into real hardware — both have documentation, but the documentation assumes you already understand the problems they're solving. Read them after you've encountered those problems. They make considerably more sense when you arrive at them with the context of something that's already broken.

The hardest part wasn't the technical knowledge — that's learnable. It was making a concrete hardware purchase decision before I had the knowledge to make it confidently, and then being responsible for making it work with whatever arrived. That sequence, decision then learning, is just how it goes when there's no one to ask and a real deadline.

If you're working on something similar — setting up local inference, choosing hardware, or integrating LLMs into a product — and want someone who has done it in production under real constraints, I'm available for freelance work.