LLM Infrastructure — Rizki Nugraha

Tech Stack

vLLMPythonDockerLinuxPaddleOCRQwen

Type

AI Infrastructure

Overview

Self-hosted AI inference infrastructure serving OCR and document processing for internal company use. Built from zero knowledge on a business trip conversation, running stably in production ever since.

The Challenge

The company needed internal AI capabilities without cloud costs or data privacy concerns. Budget was limited. Nobody on the team had done this before. I was asked to list the hardware to buy before I fully understood what was needed.

The Solution

Procured an RTX A4000 (16GB VRAM) within budget constraints. After discovering a 12B model was too large to run reliably, learned quantization and switched to a 4B AWQ quantized Qwen model. Combined with PaddleOCR for document processing. Deployed vLLM for fast inference and high throughput. System has never gone down or slowed under production load.