Local AI

Can I run this AI model?

Estimate whether a laptop, desktop, or workstation has enough VRAM and RAM for a specific local model, quantization, and context length.

When to use this

Use it before downloading a large model or upgrading a GPU

This checker is for Ollama, LM Studio, llama.cpp, Open WebUI, and similar local AI setups where the practical question is whether model weights plus KV cache fit in memory.

Default result

Llama 3.1 8B at Q4_K_M with 8,192 tokens needs about 7.7 GB after reserve and safety margin, so an 8 GB GPU fits with a thin margin.

Worked example: Llama 3.1 8B on an 8 GB GPU

Using Q4_K_M at about 4.5 effective bits per weight, an 8B model needs about 8 x 4.5 / 8 = 4.5 GB for weights before runtime overhead. The Ollama Llama 3.1 8B row lists a 4.9 GB model file, which is a useful real-world check against the formula.

With the Llama 3.1 8B architecture, an 8,192-token FP16 KV cache is 2 x 32 layers x 8 KV heads x 128 head dimension x 8,192 tokens x 2 bytes / 2^30 = 1.0 GB. Add a 1.5 GB runtime reserve and a 10% safety margin: (4.5 + 1.0 + 1.5) x 1.10 = 7.7 GB.

That default scenario fits into 8 GB VRAM, but the margin is only about 0.3 GB. If you raise context to 32,768 tokens, the KV cache becomes 4.0 GB and the same setup no longer has a comfortable 8 GB GPU fit.

How we calculate local model memory

The estimate separates model weights from the KV cache. Weight memory uses the model parameter count and the selected effective bits per weight. KV cache memory uses the Llama-style formula: 2 x layers x KV heads x head dimension x context tokens x bytes per KV value, divided by 2^30.

Can I run Llama 3.1 8B on 8 GB VRAM?

Often yes at Q4 with moderate context, but the margin is small. The default example fits because the estimated total is 7.7 GB. A longer context, multiple sessions, or a heavier runtime can push the same setup into spill territory.

How context length changes local AI memory

Context length affects KV cache memory linearly. If you raise context from 8,192 to 32,768 tokens, the KV cache is about four times larger for the same architecture and KV precision.

What to change when the model spills out of VRAM

First reduce context length or active sessions. Then test Q4 if you were using Q8 or FP16. If the model weights alone are larger than VRAM, expect CPU/RAM offload or a smaller model to be the more realistic path.

Benchmark anchor vs real tokens per second

The speed estimate is an anchor from a cited llama.cpp CPU benchmark, scaled only as a planning hint. Real throughput depends on memory bandwidth, GPU backend, CPU generation, drivers, batch size, prompt shape, and runtime settings.

Source notes and assumptions

Reference values are stamped as of June 2026. Verify Ollama model sizes, context windows, and provider documentation before using the result for a purchase decision. Runtime reserve, safety margin, and size-adjusted speed are editable planning assumptions.

Model	Listed size	Listed context	Source	Notes
Llama 3.1 8B	4.9 GB	128K	Ollama model library	Ollama lists Llama 3.1 8B, 70B, and 405B sizes. Verify the current tag before downloading.
Llama 3.1 70B	43 GB	128K	Ollama model library	A 70B Q4-class file is far beyond most single 24 GB GPUs once KV cache and reserve are included.
Qwen2.5 7B	4.7 GB	32K table row	Ollama model library	The Ollama table row shows 32K for the listed tag; the readme also mentions support up to 128K.
Qwen2.5 32B	20 GB	32K table row	Ollama model library	Use the listed file size as a model-file check, then add KV cache and runtime reserve.
Mistral 7B	4.4 GB	32K	Ollama model library	The Ollama Mistral row is a practical 7B comparison point for local testing.

Format	Effective bits	Benchmark anchor	Notes
Q4_K_M	about 4.5 bits per weight	5.12 tokens/sec	Llama 3.1 8B llama.cpp CPU benchmark, tg128 decode, dual Xeon Platinum 8488C.
Q8_0	8 bits per weight	5.03 tokens/sec	Same benchmark environment; use as an anchor, not a hardware promise.
FP16	16 bits per weight	2.83 tokens/sec	Full-precision baseline in the same benchmark.

Primary sources

Model rows come from the Ollama model library pages for Llama 3.1, Qwen2.5, and Mistral. KV cache behavior is cross-checked against the Hugging Face Transformers KV cache documentation. Quantization notes use the Hugging Face bitsandbytes documentation and the 2026 llama.cpp quantization benchmark. Llama 3.1 architecture values are from Meta's Llama 3 Herd of Models paper.

FAQ

Why can a small model fail when the context window is large?

Model weights are not the only memory load. The KV cache stores key and value states for previous tokens, so memory use grows with context length, layer count, KV heads, and active sessions.

Does Q4 mean the whole model uses exactly 4 bits per parameter?

Not exactly. GGUF formats include metadata and some tensors may stay at higher precision. This checker uses the cited Q4_K_M effective-bit benchmark as a planning estimate, then shows the published Ollama file sizes as a reality check.

Can I run a model if it does not fully fit in VRAM?

Sometimes. A runtime may offload layers or KV cache to system RAM, but that usually lowers throughput and increases latency. Treat a spill result as a test-first path, not a comfortable setup.

Are the tokens/sec numbers guaranteed?

No. The speed number is anchored to a cited llama.cpp CPU benchmark for Llama 3.1 8B on a specific dual-Xeon system. Your GPU, CPU, memory bandwidth, runtime, drivers, and prompt shape can change real speed.

Should I choose Q4, Q8, or FP16?

Q4 is usually the first local test when memory is limited. Q8 uses more memory and can preserve more detail. FP16 is much larger and is mostly for hardware with generous memory or validation work.

Should I buy hardware from this result alone?

No. Use this as a screening tool. Before buying, test the exact model file, runtime, quantization, context length, and workload you plan to use.

Decision path

What to do next

Alternative path