Salvatore Sanfilippo (antirez, Redis) released ds4: a single-model Metal inference engine for DeepSeek V4 Flash that deliberately rejects the general-framework approach. Asymmetric 2-bit quantization on MoE experts only gets a 280B-parameter model into 128 GB RAM with 26–36 t/s generation, 1M-token context, and disk-persisted KV cache on Apple Silicon.
A blog post published April 18 describes a technique for running LLM inference inside a WebAssembly sandbox at near-native GPU speed on Apple Silicon. By overriding Wasmtime's memory allocator to back Wasm linear memory with a Metal buffer via makeBuffer(bytesNoCopy:), the author collapses the Wasm–GPU boundary entirely: 0.03 MB overhead vs 16.78 MB for the copy approach, ~9 ms/token for Llama 3.2 1B on M1, and KV cache snapshots that restore 5.45× faster than recomputing prefill.
Eigen Labs — the team behind EigenLayer Ethereum restaking — launched Darkbloom on April 15: a research-preview decentralized inference network that routes AI requests through idle Apple Silicon Macs with cryptographic privacy guarantees. The node operator genuinely cannot read your prompt. The security model is layered and interesting; the economics are aggressive; the project is very early.
Ghost Pepper v2.0.1 is a macOS hold-to-talk tool that quietly chains WhisperKit and a local Qwen 3.5 model to transcribe and clean up speech without any cloud call. It's a small app, but a clear signal of where on-device AI composition is heading.
Ollama's preview MLX backend replaces direct Metal calls on Apple Silicon with Apple's dedicated ML framework, yielding a 93% decode speedup for Qwen3.5-35B-A3B on M5 chips. The update also adds NVFP4 quantization and a smarter KV cache — including prefix-aware eviction that keeps shared system prompts hot across conversations.