Self-Hosting LLMs: Llama 3, Mistral on Your Server
Self-hosting open-source LLMs (Llama 3 70B, Mistral 7B) gives you privacy, cost control, and no rate limits. Here's the practical setup guide for 2026.
Self-hosting open-source LLMs gives you three things API providers don't: data never leaves your infrastructure, no rate limits, and predictable cost at scale. In 2026, Llama 3 70B and Mistral 7B are genuinely capable for most production tasks. Here's the practical setup.
When self-hosting makes sense
Self-hosting is the right choice when:
- Data privacy is non-negotiable (healthcare, legal, internal tools with sensitive data)
- Volume is high enough that API costs exceed hosting costs (~$500+/month in API bills)
- Rate limits block you (enterprise-scale batch processing)
- Latency must be predictable (no shared infrastructure throttling)
For most startups under $200/month in API costs: use hosted APIs. The operational overhead isn't worth it.
Model choices in 2026
| Model | Size | VRAM | Best for | |-------|------|------|---------| | Llama 3.1 8B | 8B | 16GB | Fast inference, low cost | | Llama 3.1 70B | 70B | 140GB (A100×2) | Near-GPT-4 quality | | Mistral 7B | 7B | 14GB | Fast, good at instruction following | | Mixtral 8x7B | ~46B active | 90GB | MoE, fast at 8x7B scale | | Qwen 2.5 72B | 72B | 144GB | Strong at code, multilingual |
For most production workloads: Llama 3.1 8B (fast, cheap) for classification/summarization; Llama 3.1 70B (expensive but capable) for complex reasoning.
Hardware requirements
Llama 3.1 8B (float16):
GPU: 1× A10G (24GB VRAM) — $0.76/hr on AWS
Throughput: ~60 tokens/sec
Llama 3.1 70B (float16):
GPU: 2× A100 80GB — $6.14/hr on AWS
Throughput: ~20 tokens/sec
Mistral 7B (4-bit quantized):
GPU: 1× T4 (16GB VRAM) — $0.53/hr on AWS
Throughput: ~40 tokens/sec (quantized)
Quantization (4-bit via bitsandbytes or GGUF) cuts VRAM requirements roughly in half with ~5-10% quality loss. For most tasks, acceptable.
Setup with Ollama (simplest)
Ollama is the fastest way to run LLMs locally or on a server:
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull and run Llama 3.1 8B
ollama pull llama3.1:8b
ollama run llama3.1:8b
# Serve as OpenAI-compatible API (port 11434)
OLLAMA_HOST=0.0.0.0 ollama serve
Ollama exposes an OpenAI-compatible API — any client using the OpenAI SDK works with a base URL change:
from openai import OpenAI
client = OpenAI(
base_url="http://your-server:11434/v1",
api_key="ollama" # doesn't matter
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Explain gradient descent"}]
)
Setup with vLLM (production)
vLLM is the production-grade serving engine — higher throughput via continuous batching:
pip install vllm
# Serve Llama 3.1 8B
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--port 8000
vLLM's continuous batching allows multiple concurrent requests to share GPU computation — crucial for production where dozens of requests arrive per second.
Throughput comparison:
| Engine | 8B model, 1× A10G | Notes | |--------|-------------------|-------| | Ollama | ~60 tokens/sec | Good for development | | vLLM | ~180 tokens/sec | Production-grade | | llama.cpp | ~40 tokens/sec (CPU) | No GPU needed |
Deploying on AWS
# ECS task definition for vLLM on G5 instance
containerDefinitions:
- name: vllm
image: vllm/vllm-openai:latest
command:
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --tensor-parallel-size
- "1"
- --max-model-len
- "8192"
resourceRequirements:
- type: GPU
value: "1"
environment:
- name: HUGGING_FACE_HUB_TOKEN
value: !Ref HFToken
portMappings:
- containerPort: 8000
Use G5 instances (NVIDIA A10G) on AWS for the best price/performance ratio. G5.2xlarge ($1.21/hr): 1× A10G 24GB, sufficient for 8B models at moderate load.
Cost comparison at scale
At 10M tokens/day:
| Option | Monthly cost | |--------|-------------| | Claude Haiku ($0.00025/1k) | $75 | | GPT-4o mini ($0.00015/1k) | $45 | | Llama 3.1 8B on G5.2xlarge | $880 (always-on) | | Llama 3.1 8B on spot G5.2xlarge | $290 (spot) | | OpenRouter free (rate limited) | $0 |
Self-hosting pays off above ~50M tokens/day for 8B models. Below that, hosted APIs (especially via OpenRouter free tier) are cheaper.
FAQ
What is the best open-source LLM for self-hosting in 2026? Llama 3.1 8B for most tasks — fast, capable, and fits on a single A10G GPU. Llama 3.1 70B for tasks requiring near-GPT-4 quality where cost/speed tradeoff is acceptable.
What's the difference between Ollama and vLLM? Ollama is designed for ease of use (local development, prototyping). vLLM is production-grade with continuous batching for significantly higher throughput under concurrent load.
How much VRAM do I need for Llama 3 70B? ~140GB for float16. With 4-bit quantization: ~35-40GB (fits on two A100 40GB GPUs). Quantization reduces quality slightly but is acceptable for most tasks.
Is self-hosting LLMs secure? Self-hosting means your data never leaves your infrastructure — more secure for sensitive data than API providers. However, you're responsible for securing the inference server, managing access control, and keeping the model software updated.
Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: LLM Cost Optimization: Cut AI API Bills 10x · Deploy Always-On AI Agents on AWS for ~$17/mo.