oLLM — Dumb‑It‑Down Guide

Step 2 — Hardware Reality Check

What can your machine *actually* run? This page gives you three sane paths, a reality‑check on VRAM vs. model size, and copy‑paste probes to discover what you’ve got. Zero fluff.

Reality check: Dense 150B models do not fit in 12 GB VRAM. With 4‑bit quantization you’re realistically in the 7B–13B range on a single 12 GB card (CPU offload okay). Mixture‑of‑Experts (MoE) models may say “150B total,” but VRAM still bites; assume heavy memory.

Pick Your Path

Path A — 12 GB VRAM Works Today

Example GPUs: RTX 3060 12GB, 2060 12GB, some laptop 4070s.

  • System RAM: 32–64 GB recommended
  • Models: 7B–13B (Q4/Q5). Great chat + light tools.
  • Fine‑tuning: LoRA/QLoRA on 7B (modest batch/seq).
  • Storage: 1 TB NVMe is comfy; 4 TB is future‑friendly.

Path B — 24 GB VRAM Sweet Spot

Example GPUs: RTX 3090/4090; some pro cards.

  • Models: 13B–20B silky; 7B/13B fine‑tunes very comfy.
  • Agents: RAG + tool‑calling + TTS/STT all local is smooth.
  • Thermals/PSU: watch power + airflow.

Path C — 48 GB+ VRAM Ambitious

Example: Dual 3090s, 4090 + 3090, A6000/RTX 6000 Ada.

  • Models: 70B with heavy offload becomes practical.
  • Multi‑GPU: check NVLink/PCIe limits & runtime support.
  • Noise/heat: treat it like a small server.

What Runs on My Hardware?

VRAM System RAM Dense Models (4–5b) MoE (headline params) Notes
8–12 GB 32–64 GB 7B–13B Up to ~“~50–150B” labels Good chat/agents; offload to CPU OK; expect slower big contexts.
24 GB 48–96 GB 13B–20B (snappy) High‑headline MoE possible Great for RAG + tools; comfy QLoRA fine‑tunes.
48–80 GB 64–128 GB 33B–70B (hybrid/offload) Very high MoE Think workstation/server; power & cooling matter.

Rule of thumb: bigger context + bigger model ⇒ more VRAM + RAM. Quantization helps, not magic.


Probe Your Machine (Copy & Paste)

GPU + Driver

Check your GPU and driver on Linux (live USB from Step 1 works):

lspci | grep -Ei 'vga|3d|display'

nvidia-smi || echo "No NVIDIA driver loaded" /opt/rocm/bin/rocminfo 2>/dev/null | head -n 40 || echo "No ROCm (AMD)" 

CPU, RAM, Disk

lscpu | sed -n '1,12p' free -h lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT | grep -E 'nvme|sd' 

Quick VRAM Reality

Try to load a modest model first. If it swaps to CPU constantly, lower the model or quant further.

# (Will be used in Step 3 with Ollama)

ollama run mistral:instruct


Checklist — Before Step 3

Minimums

  • GPU: 12 GB VRAM (works) / 24 GB (ideal)
  • RAM: 32–64 GB
  • Disk: 1 TB NVMe (okay) / 4 TB (best)
  • Stable power + ventilation

Decisions

  • Pick Path A/B/C above
  • Decide: chat + agents only, or fine‑tunes too?
  • Plan your context length needs (short chat vs. long docs)

Nice‑to‑Haves

  • UPS (battery) for safe shutdowns
  • External backup drive
  • Second NVMe slot for model zoo

Darren’s Outline Notes

Paste the key bullets from anykeycafe.com/little-ougway here when you have them. We’ll map each line to your Step 3 software choices (Ollama, Open WebUI/AnythingLLM, RAG, tools, TTS/STT), and flag anything that needs a reality tweak.

Next: Step 3 — Software Setup

We’ll install Ollama and a friendly UI (Open WebUI or AnythingLLM), then wire in agents/tools. If you’ve finished the checklist above, you’re ready.