Self-Hosting LLMs: Llama 3, Mistral on Your Server

Self-hosting open-source LLMs gives you three things API providers don't: data never leaves your infrastructure, no rate limits, and predictable cost at scale. In 2026, Llama 3 70B and Mistral 7B are genuinely capable for most production tasks. Here's the practical setup.

When self-hosting makes sense

Self-hosting is the right choice when:

Data privacy is non-negotiable (healthcare, legal, internal tools with sensitive data)
Volume is high enough that API costs exceed hosting costs (~$500+/month in API bills)
Rate limits block you (enterprise-scale batch processing)
Latency must be predictable (no shared infrastructure throttling)

For most startups under $200/month in API costs: use hosted APIs. The operational overhead isn't worth it.

Model choices in 2026

| Model | Size | VRAM | Best for | |-------|------|------|---------| | Llama 3.1 8B | 8B | 16GB | Fast inference, low cost | | Llama 3.1 70B | 70B | 140GB (A100×2) | Near-GPT-4 quality | | Mistral 7B | 7B | 14GB | Fast, good at instruction following | | Mixtral 8x7B | ~46B active | 90GB | MoE, fast at 8x7B scale | | Qwen 2.5 72B | 72B | 144GB | Strong at code, multilingual |

For most production workloads: Llama 3.1 8B (fast, cheap) for classification/summarization; Llama 3.1 70B (expensive but capable) for complex reasoning.

Hardware requirements

Llama 3.1 8B (float16):
  GPU: 1× A10G (24GB VRAM) — $0.76/hr on AWS
  Throughput: ~60 tokens/sec
  
Llama 3.1 70B (float16):
  GPU: 2× A100 80GB — $6.14/hr on AWS
  Throughput: ~20 tokens/sec

Mistral 7B (4-bit quantized):
  GPU: 1× T4 (16GB VRAM) — $0.53/hr on AWS
  Throughput: ~40 tokens/sec (quantized)

Quantization (4-bit via bitsandbytes or GGUF) cuts VRAM requirements roughly in half with ~5-10% quality loss. For most tasks, acceptable.

Setup with Ollama (simplest)

Ollama is the fastest way to run LLMs locally or on a server:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run Llama 3.1 8B
ollama pull llama3.1:8b
ollama run llama3.1:8b

# Serve as OpenAI-compatible API (port 11434)
OLLAMA_HOST=0.0.0.0 ollama serve

Ollama exposes an OpenAI-compatible API — any client using the OpenAI SDK works with a base URL change:

from openai import OpenAI

client = OpenAI(
    base_url="http://your-server:11434/v1",
    api_key="ollama"  # doesn't matter
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain gradient descent"}]
)

Setup with vLLM (production)

vLLM is the production-grade serving engine — higher throughput via continuous batching:

pip install vllm

# Serve Llama 3.1 8B
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --port 8000

vLLM's continuous batching allows multiple concurrent requests to share GPU computation — crucial for production where dozens of requests arrive per second.

Throughput comparison:

| Engine | 8B model, 1× A10G | Notes | |--------|-------------------|-------| | Ollama | ~60 tokens/sec | Good for development | | vLLM | ~180 tokens/sec | Production-grade | | llama.cpp | ~40 tokens/sec (CPU) | No GPU needed |

Deploying on AWS

# ECS task definition for vLLM on G5 instance
containerDefinitions:
  - name: vllm
    image: vllm/vllm-openai:latest
    command:
      - --model
      - meta-llama/Meta-Llama-3.1-8B-Instruct
      - --tensor-parallel-size
      - "1"
      - --max-model-len
      - "8192"
    resourceRequirements:
      - type: GPU
        value: "1"
    environment:
      - name: HUGGING_FACE_HUB_TOKEN
        value: !Ref HFToken
    portMappings:
      - containerPort: 8000

Use G5 instances (NVIDIA A10G) on AWS for the best price/performance ratio. G5.2xlarge ($1.21/hr): 1× A10G 24GB, sufficient for 8B models at moderate load.

Cost comparison at scale

At 10M tokens/day:

| Option | Monthly cost | |--------|-------------| | Claude Haiku ($0.00025/1k) | $75 | | GPT-4o mini ($0.00015/1k) | $45 | | Llama 3.1 8B on G5.2xlarge | $880 (always-on) | | Llama 3.1 8B on spot G5.2xlarge | $290 (spot) | | OpenRouter free (rate limited) | $0 |

Self-hosting pays off above ~50M tokens/day for 8B models. Below that, hosted APIs (especially via OpenRouter free tier) are cheaper.

FAQ

What is the best open-source LLM for self-hosting in 2026? Llama 3.1 8B for most tasks — fast, capable, and fits on a single A10G GPU. Llama 3.1 70B for tasks requiring near-GPT-4 quality where cost/speed tradeoff is acceptable.

What's the difference between Ollama and vLLM? Ollama is designed for ease of use (local development, prototyping). vLLM is production-grade with continuous batching for significantly higher throughput under concurrent load.

How much VRAM do I need for Llama 3 70B? ~140GB for float16. With 4-bit quantization: ~35-40GB (fits on two A100 40GB GPUs). Quantization reduces quality slightly but is acceptable for most tasks.

Is self-hosting LLMs secure? Self-hosting means your data never leaves your infrastructure — more secure for sensitive data than API providers. However, you're responsible for securing the inference server, managing access control, and keeping the model software updated.

Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: LLM Cost Optimization: Cut AI API Bills 10x · Deploy Always-On AI Agents on AWS for ~$17/mo.