What can your machine *actually* run? This page gives you three sane paths, a reality‑check on VRAM vs. model size, and copy‑paste probes to discover what you’ve got. Zero fluff.
Example GPUs: RTX 3060 12GB, 2060 12GB, some laptop 4070s.
Example GPUs: RTX 3090/4090; some pro cards.
Example: Dual 3090s, 4090 + 3090, A6000/RTX 6000 Ada.
VRAM | System RAM | Dense Models (4–5b) | MoE (headline params) | Notes |
---|---|---|---|---|
8–12 GB | 32–64 GB | 7B–13B | Up to ~“~50–150B” labels | Good chat/agents; offload to CPU OK; expect slower big contexts. |
24 GB | 48–96 GB | 13B–20B (snappy) | High‑headline MoE possible | Great for RAG + tools; comfy QLoRA fine‑tunes. |
48–80 GB | 64–128 GB | 33B–70B (hybrid/offload) | Very high MoE | Think workstation/server; power & cooling matter. |
Rule of thumb: bigger context + bigger model ⇒ more VRAM + RAM. Quantization helps, not magic.
Check your GPU and driver on Linux (live USB from Step 1 works):
lspci | grep -Ei 'vga|3d|display'
nvidia-smi || echo "No NVIDIA driver loaded" /opt/rocm/bin/rocminfo 2>/dev/null | head -n 40 || echo "No ROCm (AMD)"
lscpu | sed -n '1,12p' free -h lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT | grep -E 'nvme|sd'
Try to load a modest model first. If it swaps to CPU constantly, lower the model or quant further.
# (Will be used in Step 3 with Ollama)
ollama run mistral:instruct
Paste the key bullets from anykeycafe.com/little-ougway here when you have them. We’ll map each line to your Step 3 software choices (Ollama, Open WebUI/AnythingLLM, RAG, tools, TTS/STT), and flag anything that needs a reality tweak.
We’ll install Ollama and a friendly UI (Open WebUI or AnythingLLM), then wire in agents/tools. If you’ve finished the checklist above, you’re ready.