Moving Large Language Models from quick demos to real production is tough. Traffic is unpredictable, context windows grow, and costs can explode if your serving setup isn’t optimized.
Here’s a better way: Use **llm-d** (a Kubernetes-native open-source framework) on Oracle Cloud Infrastructure (OCI) to separate prompt processing from token generation. This “disaggregated” approach delivers more consistent latency, higher GPU efficiency, and lower overall cost — without throwing more hardware at the problem.
Why Traditional LLM Serving Falls Short in Production
- Prompt ingestion (prefill) is compute-heavy
- Token generation (decode) is memory-bandwidth sensitive
- Running both on the same GPU replicas leads to poor utilization and inconsistent latency
- Scaling out identical replicas wastes resources under real user loads
llm-d solves this by letting you run specialized workers for prefill and decode phases independently — giving each phase exactly what it needs.
Key Benefits You’ll See in Production
- 10–30% better GPU efficiency
- Much more stable latency even as user count grows
- Lower infrastructure cost for the same performance
- Easy scaling using familiar Kubernetes tools
Architecture Overview
llm-d on OCI uses Oracle Kubernetes Engine (OKE) + Bare Metal AMD MI300X GPUs. Prefill workers handle heavy prompt processing, while decode workers focus on fast token streaming. RDMA networking keeps communication between nodes extremely fast.
Quick Start: Deploy llm-d on OKE
1. Prepare Your OKE Cluster
# Create OKE cluster with AMD MI300X bare metal nodes
# Use shapes like BM.GPU.MI300X.8 for high-memory GPUs
# Install kubectl and configure access
oci ce cluster create-kubeconfig --cluster-id <your-cluster-id>
2. Deploy llm-d with Disaggregated Mode
# Clone the llm-d repo with OCI/AMD examples
git clone https://github.com/llm-d/llm-d.git
cd llm-d
# Apply the disaggregated deployment (prefill + decode)
kubectl apply -f examples/oci-amd/disaggregated/
# Check pods
kubectl get pods -n llm-serving
kubectl get services -n llm-serving
3. Configure Your Model
# Example values for Llama-3.1-70B or similar
model:
name: meta-llama/Llama-3.3-70B-Instruct
tensor-parallel: 8
pipeline-parallel: 2
serving:
prefill:
replicas: 4
gpu-memory-utilization: 0.85
decode:
replicas: 8
gpu-memory-utilization: 0.75
Real-World Performance Gains
Teams running disaggregated inference on OCI typically see:
- Flatter, more predictable latency curves under load
- Better throughput per GPU compared to traditional serving
- Ability to serve more concurrent users on the same hardware
Best Practices for Production
- Start with 2–4 node clusters and scale horizontally
- Monitor GPU utilization separately for prefill and decode pods
- Use OCI Monitoring + Prometheus for custom dashboards
- Implement request routing based on prompt length when possible
- Enable auto-scaling based on queue depth and latency SLAs
Conclusion
llm-d on OCI gives you a modern, production-ready way to serve large language models efficiently. By separating prefill and decode phases on powerful AMD MI300X GPUs with OKE, you get better performance, lower costs, and much more predictable behavior under real traffic.
Whether you’re building copilots, RAG systems, or agentic workflows — this approach helps you move from “it works in the demo” to “it works reliably at scale.”
Ready to try it? Start with the official llm-d OCI examples and scale from there.