_SH Log's
Back to Root
EST: 5 min read

Docker for AI Workloads: Isolation & GPU Access

AI workloads need proper Docker isolation — untrusted code execution, GPU access for inference, and resource limits. Here's the production configuration I use.

#docker#ai#security#devops

AI workloads have unusual Docker requirements: some need GPU access for inference, others need strict isolation for running untrusted LLM-generated code, and all need careful resource limits to prevent runaway processes. Here's how I configure Docker for each pattern.

Pattern 1: Isolated code execution (no GPU)

QuantumSketch runs LLM-generated Manim code in Docker. The code is untrusted — it could contain import subprocess or file system operations that shouldn't run on the host.

Security requirements:

  • No network access (prevent data exfiltration, no outbound calls)
  • Read-only filesystem (prevent host modification)
  • Limited resources (prevent DoS from infinite loops)
  • No privilege escalation
docker run --rm \
  --network none \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=512m \
  --memory 2g \
  --memory-swap 2g \  # same as memory = no swap
  --cpus 2.0 \
  --pids-limit 128 \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  -v /host/output:/output \  # write-only output directory
  my-sandbox-image:latest \
  python /workspace/scene.py

Flag breakdown:

  • --network none: completely disables networking
  • --read-only: root filesystem is read-only
  • --tmpfs /tmp: writable tmpfs in /tmp only, noexec prevents executing binaries from it
  • --memory-swap = --memory: disables swap (prevents swap exhaustion on host)
  • --cap-drop ALL: removes all Linux capabilities
  • --security-opt no-new-privileges: prevents setuid escalation

Pattern 2: GPU inference container

For GPU-accelerated inference (vLLM, Whisper, Stable Diffusion):

docker run --rm \
  --gpus all \                    # pass all GPUs
  --ipc=host \                    # shared memory for PyTorch (required)
  --ulimit memlock=-1 \           # required for CUDA pinned memory
  --shm-size=8g \                 # shared memory for GPU operations
  -e CUDA_VISIBLE_DEVICES=0 \    # limit to specific GPU
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct

Key flags for GPU containers:

  • --gpus all: exposes GPU via NVIDIA Container Toolkit
  • --ipc=host: PyTorch multiprocessing requires shared memory between processes; IPC namespace isolation breaks this
  • --ulimit memlock=-1: CUDA pinned memory requires no lock limit
  • --shm-size=8g: shared memory for GPU tensor operations

Never run GPU inference containers with --network none — they need to download model weights on first run. Use network-enabled for GPU, network-disabled for untrusted code.

Pattern 3: Multi-stage build for lean images

Production AI service images should be small — faster pulls, faster cold starts:

# Multi-stage: build stage
FROM python:3.11-slim AS builder

WORKDIR /build
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt

# Production stage
FROM python:3.11-slim

# Don't run as root
RUN useradd -m -u 1000 appuser

WORKDIR /app
COPY --from=builder /install /usr/local
COPY --chown=appuser:appuser . .

USER appuser

# Explicit CMD, not shell form
CMD ["/usr/local/bin/python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0"]

Key practices:

  • Multi-stage: build dependencies in one stage, copy only what's needed to production
  • Non-root user: USER appuser — never run production as root
  • CMD in exec form (array), not shell form — proper signal handling for graceful shutdown
  • --no-cache-dir on pip: reduces image size

Pattern 4: Docker Compose for local development

# docker-compose.yml for QuantumSketch local dev
version: "3.9"

services:
  api:
    build: ./services/quantumsketch-api
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: postgres://dev:dev@postgres:5432/qs
      REDIS_URL: redis://redis:6379
      TEMPORAL_HOST: temporal:7233
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started

  worker:
    build: ./services/quantumsketch-worker
    environment:
      TEMPORAL_HOST: temporal:7233
    depends_on:
      - temporal

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: qs
      POSTGRES_USER: dev
      POSTGRES_PASSWORD: dev
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "dev"]
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine

  temporal:
    image: temporalio/auto-setup:latest
    environment:
      DB: postgresql
      DB_PORT: 5432
      POSTGRES_USER: dev
      POSTGRES_PWD: dev
      POSTGRES_SEEDS: postgres
    depends_on:
      postgres:
        condition: service_healthy

volumes:
  pgdata:

pgvector/pgvector:pg16 — official PostgreSQL image with pgvector pre-installed. No manual extension setup.

Resource limits for production (ECS Fargate)

ECS Fargate enforces CPU and memory hard limits:

{
  "containerDefinitions": [{
    "name": "manim-worker",
    "image": "123456789.dkr.ecr.ap-south-1.amazonaws.com/manim-worker:latest",
    "cpu": 2048,
    "memory": 4096,
    "memoryReservation": 2048,
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/manim-worker",
        "awslogs-region": "ap-south-1",
        "awslogs-stream-prefix": "ecs"
      }
    },
    "ulimits": [
      {"name": "nofile", "softLimit": 65536, "hardLimit": 65536}
    ]
  }]
}

memoryReservation < memory: the container is guaranteed memoryReservation (soft limit) but can burst to memory (hard limit). This allows overcommitting when other containers aren't using their full allocation.

FAQ

How do you isolate LLM-generated code in Docker? Use --network none, --read-only, --cap-drop ALL, --security-opt no-new-privileges, and --pids-limit. Together, these prevent network access, filesystem writes, privilege escalation, and process proliferation.

What is --ipc=host and when do I need it? --ipc=host shares the host's IPC namespace with the container, enabling shared memory between processes. Required for PyTorch multi-GPU inference and some multiprocessing workloads. Do not use for untrusted code containers — it reduces isolation.

Why does --gpus all require NVIDIA Container Toolkit? Docker doesn't have native GPU passthrough. NVIDIA Container Toolkit (nvidia-container-runtime) intercepts --gpus flags and sets up device bindings between the GPU driver on the host and the container. Install it on the host before using --gpus.

How do I use pgvector in Docker for local development? Use the official pgvector/pgvector:pg16 image — it's PostgreSQL 16 with pgvector pre-installed. Just docker run pgvector/pgvector:pg16 and run CREATE EXTENSION vector;.


Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: Building QuantumSketch: AI + Manim for STEM Video · Microservices as One Engineer.