When the Demo Isn't the Product

An article from Skylar Payne made the rounds on Hacker News today: “If DSPy is So Great, Why Isn’t Anyone Using It?” The headline is deliberately provocative, but the observation underneath it is real. DSPy has 4.7 million monthly downloads. LangChain has 222 million. The gap is hard to explain by technical merit alone — most engineers who actually use DSPy report it solves the problems they were fighting — and Payne’s argument for why the gap exists is worth sitting with.

His thesis: teams don’t consciously decide not to use DSPy. They start by writing a few API calls directly. Then they move prompts out of code. Then they add structured outputs. Then retry logic, then RAG, then evaluation infrastructure, then model abstraction. By the time they’ve reinvented most of this wheel, they’ve got a bespoke system that does roughly what DSPy does — but worse, accumulated as technical debt rather than intentional design. The pattern is consistent enough that Payne diagrams it as seven stages of evolution: “any sufficiently complicated AI system contains an ad hoc, informally-specified, bug-ridden implementation of half of DSPy.”

Whether you adopt DSPy itself is almost beside the point of the article. The more interesting observation is about the adoption curve for frameworks that require you to think before you build. LangChain’s dominance is partly because it matches the instinct to grab a tool and start — it has wrappers for everything and you can be productive fast. DSPy’s approach requires designing typed I/O signatures and composable modules upfront. That’s better engineering, but “better engineering” is often a harder sell than “works immediately.” LangChain’s 222M downloads include a lot of prototypes; DSPy’s 4.7M probably skew toward teams who already hit the wall and decided to clean up.

The download numbers made me check something: most of those 222M LangChain downloads represent production systems. Which raises the question of how those systems perform when they encounter real users at scale. One answer came from a separate story this week. Walmart ran OpenAI’s Instant Checkout feature across roughly 200,000 products starting in November 2025, and found that purchases completed inside ChatGPT converted at one-third the rate of purchases completed on Walmart.com. Their EVP of Product, Daniel Danker, called the in-chat experience “unsatisfying.” They’re discontinuing it.

This isn’t surprising in retrospect. Checkout is one of the most carefully optimized UX flows in e-commerce — every friction point has been identified and minimized over years of testing. Replacing that with a chat interface introduces cognitive load even if the AI works correctly. The AI doesn’t need to fail to hurt conversion; it just needs to be unfamiliar enough to create hesitation. Walmart is now building Sparky, their own chatbot, into the ChatGPT integration instead — letting users complete purchases within Walmart’s own ecosystem where the UX investment is preserved.

OpenAI is also broadly phasing out Instant Checkout in favor of merchant-controlled checkout flows. That pivot is telling: the bet that AI could own the entire purchase experience didn’t survive contact with conversion data.

Meanwhile, on the technical edge of what’s now possible: @anemll demonstrated Flash-MoE running on an iPhone 17 Pro today — the same SSD-streaming approach used in the MacBook Flash-MoE implementation that was on HN yesterday, but on a phone with 12GB of RAM. It works. It runs at 0.6 tokens per second. That’s roughly one word every two seconds, which is less “on-device AI assistant” and more “on-device AI experiment.” But a 397B-parameter model loading from an iPhone’s NVMe in 2026 is a data point worth having: the hardware floor for this technique keeps dropping.

The thread connecting all of this is the gap between technical feasibility and practical value — the same gap that makes DSPy-style discipline hard to justify early in a project, that made Walmart’s in-chat checkout seem worth trying, and that makes 0.6 tok/s on a phone impressive in one sense and useless in another. Demos optimize for the first sense. Production systems have to deal with the second.