Apple-Silicon · AI Beat

05 Jun 2026 · AI Beat Desk

Magenta RealTime 2 Is Actually an Instrument Now

Google's Magenta RealTime 2 cuts live music generation control latency from ~3 seconds to ~200ms by shifting from chunk-based to frame-level causal processing. It runs locally on Apple Silicon MacBooks as open weights, and the latency reduction is the difference between a studio tool and something a musician can actually play.

08 May 2026 · AI Beat Desk

One Model, One Chip, No Framework

Salvatore Sanfilippo (antirez, Redis) released ds4: a single-model Metal inference engine for DeepSeek V4 Flash that deliberately rejects the general-framework approach. Asymmetric 2-bit quantization on MoE experts only gets a 280B-parameter model into 128 GB RAM with 26–36 t/s generation, 1M-token context, and disk-persisted KV cache on Apple Silicon.

19 Apr 2026 · AI Beat Desk

When the Sandbox Shares the GPU's Memory

A blog post published April 18 describes a technique for running LLM inference inside a WebAssembly sandbox at near-native GPU speed on Apple Silicon. By overriding Wasmtime's memory allocator to back Wasm linear memory with a Metal buffer via makeBuffer(bytesNoCopy:), the author collapses the Wasm–GPU boundary entirely: 0.03 MB overhead vs 16.78 MB for the copy approach, ~9 ms/token for Llama 3.2 1B on M1, and KV cache snapshots that restore 5.45× faster than recomputing prefill.

16 Apr 2026 · AI Beat Desk

Your Idle Mac as a Private Inference Node

Eigen Labs — the team behind EigenLayer Ethereum restaking — launched Darkbloom on April 15: a research-preview decentralized inference network that routes AI requests through idle Apple Silicon Macs with cryptographic privacy guarantees. The node operator genuinely cannot read your prompt. The security model is layered and interesting; the economics are aggressive; the project is very early.

07 Apr 2026 · AI Beat Desk

Two Models, One Keystroke

Ghost Pepper v2.0.1 is a macOS hold-to-talk tool that quietly chains WhisperKit and a local Qwen 3.5 model to transcribe and clean up speech without any cloud call. It's a small app, but a clear signal of where on-device AI composition is heading.

31 Mar 2026 · AI Beat Desk

Ollama Switches to MLX and Doubles Decode Speed

Ollama's preview MLX backend replaces direct Metal calls on Apple Silicon with Apple's dedicated ML framework, yielding a 93% decode speedup for Qwen3.5-35B-A3B on M5 chips. The update also adds NVFP4 quantization and a smarter KV cache — including prefix-aware eviction that keeps shared system prompts hot across conversations.