oLLM — Dumb‑It‑Down Guide

Step 2 — Hardware Reality Check

What can your machine *actually* run? This page gives you three sane paths, a reality‑check on VRAM vs. model size, and copy‑paste probes to discover what you’ve got. Zero fluff.

Reality check: Dense 150B models do not fit in 12 GB VRAM. With 4‑bit quantization you’re realistically in the 7B–13B range on a single 12 GB card (CPU offload okay). Mixture‑of‑Experts (MoE) models may say “150B total,” but VRAM still bites; assume heavy memory.

Pick Your Path

Path A — 12 GB VRAM Works Today

Example GPUs: RTX 3060 12GB, 2060 12GB, some laptop 4070s.

System RAM: 32–64 GB recommended
Models: 7B–13B (Q4/Q5). Great chat + light tools.
Fine‑tuning: LoRA/QLoRA on 7B (modest batch/seq).
Storage: 1 TB NVMe is comfy; 4 TB is future‑friendly.

Path B — 24 GB VRAM Sweet Spot

Example GPUs: RTX 3090/4090; some pro cards.

Models: 13B–20B silky; 7B/13B fine‑tunes very comfy.
Agents: RAG + tool‑calling + TTS/STT all local is smooth.
Thermals/PSU: watch power + airflow.

Path C — 48 GB+ VRAM Ambitious

Example: Dual 3090s, 4090 + 3090, A6000/RTX 6000 Ada.

Models: 70B with heavy offload becomes practical.
Multi‑GPU: check NVLink/PCIe limits & runtime support.
Noise/heat: treat it like a small server.

What Runs on My Hardware?

VRAM	System RAM	Dense Models (4–5b)	MoE (headline params)	Notes
8–12 GB	32–64 GB	7B–13B	Up to ~“~50–150B” labels	Good chat/agents; offload to CPU OK; expect slower big contexts.
24 GB	48–96 GB	13B–20B (snappy)	High‑headline MoE possible	Great for RAG + tools; comfy QLoRA fine‑tunes.
48–80 GB	64–128 GB	33B–70B (hybrid/offload)	Very high MoE	Think workstation/server; power & cooling matter.

Rule of thumb: bigger context + bigger model ⇒ more VRAM + RAM. Quantization helps, not magic.

Probe Your Machine (Copy & Paste)

GPU + Driver

Check your GPU and driver on Linux (live USB from Step 1 works):

lspci | grep -Ei 'vga|3d|display'

nvidia-smi || echo "No NVIDIA driver loaded" /opt/rocm/bin/rocminfo 2>/dev/null | head -n 40 || echo "No ROCm (AMD)"

CPU, RAM, Disk

lscpu | sed -n '1,12p' free -h lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT | grep -E 'nvme|sd'

Quick VRAM Reality

Try to load a modest model first. If it swaps to CPU constantly, lower the model or quant further.

# (Will be used in Step 3 with Ollama)

ollama run mistral:instruct

Checklist — Before Step 3

Minimums

GPU: 12 GB VRAM (works) / 24 GB (ideal)
RAM: 32–64 GB
Disk: 1 TB NVMe (okay) / 4 TB (best)
Stable power + ventilation

Decisions

Pick Path A/B/C above
Decide: chat + agents only, or fine‑tunes too?
Plan your context length needs (short chat vs. long docs)

Nice‑to‑Haves

UPS (battery) for safe shutdowns
External backup drive
Second NVMe slot for model zoo

Darren’s Outline Notes

Paste the key bullets from anykeycafe.com/little-ougway here when you have them. We’ll map each line to your Step 3 software choices (Ollama, Open WebUI/AnythingLLM, RAG, tools, TTS/STT), and flag anything that needs a reality tweak.

Goal: “Give the LLM a vehicle + tools” — list tools here
Hardware claims: add verbatim lines + our notes
Training: if he means LoRA/QLoRA vs. full training, capture details

Next: Step 3 — Software Setup

We’ll install Ollama and a friendly UI (Open WebUI or AnythingLLM), then wire in agents/tools. If you’ve finished the checklist above, you’re ready.

← Back to Guide Index Back to Top ↑