On 30 March 2026, Ollama announced a preview build that runs on Apple Silicon using MLX, Apple’s machine learning framework, instead of the previous stack path for those Macs. The company frames it as the fastest way to run Ollama on Apple silicon to date, with concrete prefill and decode figures on a named model setup, new NVFP4 quantization support for inference parity with production-style pipelines, and a cache overhaul aimed at coding agents and branching chats. This article summarizes what shipped in the announcement, what hardware Ollama asks for in the preview, and the exact ollama launch / ollama run commands from the post.

Direct answer: The preview moves Ollama’s Apple Silicon runtime onto MLX for better use of unified memory and, on M5 / M5 Pro / M5 Max, new GPU Neural Accelerators to improve time to first token and tokens per second. Ollama published before/after numbers for 0.18 vs 0.19 on Qwen3.5-35B-A3B (NVFP4 vs Q4_K_M in their described harness), documents NVFP4 and cache upgrades, and tells users to run Ollama 0.19 on a Mac with more than 32 GB unified memory for the highlighted coding model workflow.

What changed in the MLX preview

According to Ollama’s official blog post, the Apple Silicon build is now built on MLX so the runtime can lean on Apple’s unified memory architecture. The narrative is not a cosmetic version bump: Ollama claims a large speedup across all Apple Silicon devices, with additional uplift on M5, M5 Pro, and M5 Max where the stack can use GPU Neural Accelerators for both prefill (prompt processing) and decode (token generation).

The post explicitly calls out workload classes that benefit:

  • Personal assistants such as OpenClaw.
  • Coding agents such as Claude Code, OpenCode, or Codex-class flows (the page also mentions Pi in the same breath as Claude Code for agent acceleration).

If you are already experimenting with mlx-lm outside Ollama, our earlier walkthrough on Apple Silicon RAM reality remains a useful complement: Run DeepSeek-V4 on Apple Silicon with mlx-lm.

Official demo videos (coding agent and OpenClaw)

Ollama hosts two screen recordings on files.ollama.com that illustrate the MLX preview in real workflows. Clips load directly from Ollama’s CDN; if your CMS strips <video> on import, paste the same URLs as plain download links. .mov playback is best on Safari and recent Chrome on macOS; other browsers may need the file opened locally.

Coding agent — Ollama’s demo of a coding-agent style flow accelerated under the MLX-backed Apple Silicon preview (see the MLX blog post for context).
OpenClaw — Ollama’s demo of the assistant-style workload the post calls out alongside coding agents.

Published performance claims (prefill and decode)

Ollama states tests were run on 29 March 2026 using Alibaba’s Qwen3.5-35B-A3B model, with the new path using NVFP4 quantization and the previous implementation on Ollama 0.18 using Q4_K_M. The blog charts compare Ollama 0.19 against 0.18 for that scenario.

Metric (blog chart) Ollama 0.18 Ollama 0.19 (preview)
Prefill (tokens/s) 1154 1810
Decode (tokens/s) 58 112
Side-by-side bar charts: Ollama 0.19 vs 0.18 prefill and decode tokens per second on the blog scenario.
Chart-style summary from the announcement narrative: higher prefill and decode tokens/s for 0.19 vs 0.18 in Ollama’s published Qwen3.5-35B-A3B harness (see ollama.com/blog/mlx).

The same post adds that Ollama 0.19 can reach even higher numbers in a cited configuration: 1851 tokens/s prefill and 134 tokens/s decode when running with int4 quantization. Treat these as vendor-reported, model-specific, and harness-specific figures—your machine, OS build, and workload mix will differ.

NVFP4 support and why it matters

Ollama writes that it now leverages NVIDIA’s NVFP4 format to keep model accuracy while cutting memory bandwidth and storage needs for inference. The positioning is explicit: as more cloud inference providers standardize on NVFP4, local Ollama runs can stay closer to production parity in numeric behavior. The post also notes other precisions will follow based on research and hardware partner intent.

Cache upgrades for agents and long sessions

Beyond raw throughput, Ollama highlights three cache behaviors:

  • Lower memory utilization: cache reuse across conversations, which should raise cache hits when you branch with a shared system prompt (the post names tool-heavy flows like Claude Code).
  • Intelligent checkpoints: snapshots of cache state at “intelligent” points in the prompt to cut repeated prompt processing.
  • Smarter eviction: shared prefixes survive longer when older branches are dropped.

For day-to-day agent ergonomics, latency often lives in prefix recompute and context churn—these changes target that class of friction rather than only peak tok/s marketing.

How to try it: download, RAM bar, and commands

The blog’s call to action is straightforward: install Ollama 0.19 from the official download path, and use the Qwen3.5-35B-A3B variant tuned for coding with NVFP4 naming in the tag.

Hardware note from Ollama: use a Mac with more than 32 GB of unified memory for this preview workflow with the highlighted model.

Claude Code:

ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

OpenClaw:

ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4

Interactive chat:

ollama run qwen3.5:35b-a3b-coding-nvfp4

Preview releases can change defaults, model tags, and CLI flags. If a command fails after an update, re-check the same blog post and the release notes for your installed build.

Future models and imports

Ollama states it is actively working to support future models under the new path, and plans an easier import path for user fine-tuned models on supported architectures. Until then, expect the supported architecture list to expand iteratively rather than overnight.

Sources

Figures and model names above are transcribed from that post as of its publication date; always verify against the live article before citing in downstream specs or RFPs.

Collage of AI and developer logos including DeepSeek, Llama, OpenAI, Docker, and GitHub, with a presenter — Udemy Local AI Masterclass course visual.
Udemy

Local AI Masterclass: LLMs, Diffusion & AI-Agents on Your PC

Available at Udemy — practical local-AI learning path (LLMs, diffusion, agents) you can combine with Apple Silicon stacks such as Ollama and MLX. Course title, curriculum, and price can change; verify details on the merchant page before purchase.

View course on Udemy

Frequently asked questions

Does MLX replace Ollama everywhere, or only on Mac?

The announcement is scoped to Ollama on Apple Silicon in preview. It does not state that Linux or Windows builds now route through MLX.

Will my M1 or M2 Mac see the same 1810 tok/s prefill?

Ollama claims a broad Apple Silicon speedup, but the charted numbers in the post come from a specific model and quantization setup compared across 0.18 and 0.19. Expect different absolute numbers on other chips and memory tiers.

Why does Ollama ask for more than 32 GB unified memory?

The blog sets that bar for the preview workflow with Qwen3.5-35B-A3B at the highlighted precision—roughly the same class of constraint you would expect for a 35B-class model with headroom for KV cache and desktop multitasking.

Is NVFP4 only relevant if I use NVIDIA hardware locally?

Here NVFP4 is framed as a quantization format for inference that aligns with wider provider-side adoption—not as a requirement that your Mac contain an NVIDIA GPU. The post positions it around accuracy and bandwidth tradeoffs on Apple Silicon in this preview.