Close Menu
geekfence.comgeekfence.com
    What's Hot

    The Download: the future of chipmaking and Anthropic’s government clash

    June 23, 2026

    Comarch User Group 2026: Navigating the 2% Growth Trap with Agentic AI and Composable Architecture

    June 23, 2026

    Clustering Unstructured Text with LLM Embeddings and HDBSCAN

    June 23, 2026
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook Instagram
    geekfence.comgeekfence.com
    • Home
    • UK Tech News
    • AI
    • Big Data
    • Cyber Security
      • Cloud Computing
      • iOS Development
    • IoT
    • Mobile
    • Software
      • Software Development
      • Software Engineering
    • Technology
      • Green Technology
      • Nanotechnology
    • Telecom
    geekfence.comgeekfence.com
    Home»iOS Development»Responses Bug in LM Studio
    iOS Development

    Responses Bug in LM Studio

    AdminBy AdminJune 23, 2026No Comments9 Mins Read2 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Responses Bug in LM Studio
    Share
    Facebook Twitter LinkedIn Pinterest Email


    It started, as these things do, with a shortcut I was certain would work.

    I’ve been building SwiftAgents, my Swift framework for talking to language models, and one of the local providers it supports is LM Studio — the app a lot of us reach for to run models on our own Macs. LM Studio recently grew support for the newer “Responses” API, the OpenAI-style endpoint that can remember a conversation for you. Instead of re-sending the whole chat history on every turn, you send only the new message plus a little breadcrumb — previous_response_id — that tells the server “you already remember the rest.” Less data over the wire, less bookkeeping on the client. An obvious win, and I wanted it in SwiftAgents.

    Before wiring it in for good, I asked Claude Code to benchmark it. Ten turns of the same little conversation, run two ways: once with the new chaining trick, and once the old-fashioned way where you resend the entire history every single time. I just wanted to confirm the clever path was faster before committing to it.

    The numbers came back backwards.

    When the shortcut is the long way

    Here is what the benchmark found, running a small Qwen3 model inside LM Studio. The left column is the “optimization” — chaining with previous_response_id, sending only the new message each turn. The right column is the brute-force approach — resending the entire conversation, every time, like a caveman.

    The number shown is how many input tokens the server actually had to process on that turn:

    Turn Chaining (only the new message sent) Full resend (whole history every time)
    1 26 26
    2 48 48
    3 98 69
    4 206 95
    5 415 120
    6 829 141
    7 1,669 169
    8 3,338 191
    9 6,677 211
    10 13,364 238

    Read it twice, because I had to. The wasteful approach — resending everything — keeps the workload flat, around 240 tokens by turn ten. The clever approach, where I send almost nothing, somehow makes the server grind through thirteen thousand.

    And look at the shape of that left column: 26, 48, 98, 206, 415, 829… it doubles every turn. A textbook geometric balloon. Whatever the server does internally when it “remembers” the conversation for you, it rebuilds the whole thing roughly twice as large each time. Since the model has to read all of those tokens before it can say a word, the wait balloons right along with the token count. By turn ten a single reply took 28 seconds with chaining, against 3 seconds without.

    The optimization was, comfortably, the slowest possible way to hold the conversation.

    Making sure it wasn’t just me

    A result that silly deserves suspicion, so the next step was to check whether I’d misconfigured something or stumbled onto one bad model. The first idea was to run the benchmark against official GPT 5.5 – and there the caching behaved exactly as you’d expect. Then I asked Claude Code to run the same probe across a number of LLMs I had previously downloaded.

    The balloon showed up every single time — small models and large, old architectures and brand-new ones, the plain ones and the fancy “reasoning” ones, and even a mixture-of-experts model. Same fingerprint each time: the chained path doubles every turn, the full-resend path stays flat.

    A few of the more memorable data points:

    • gpt-oss (a 20-billion-parameter mixture-of-experts model): ballooned to 16,833 tokens by turn ten — for a conversation that was genuinely 283 tokens long. That’s a 59× tax. The lovely irony here is that this model barely “thinks” out loud at all, yet it scored the worst blowup of the lot, which told us the bug has nothing to do with how much the model generates and everything to do with how the server rebuilds the history.
    • A 12-billion Gemma model: by turn ten, a single reply took 37.6 seconds instead of the ~2.6 seconds the same conversation needed over the plain chat endpoint.

    Importantly, this isn’t the Responses API being a bad idea, and it isn’t LM Studio being bad software — its ordinary chat endpoint is quick and caches beautifully. It’s one specific feature, the server-side conversation reconstruction behind previous_response_id, that misbehaves. I know it’s specific to LM Studio because the obvious points of comparison don’t do it: OpenAI’s own servers keep the token count equal to the real conversation, and Ollama — which simply declines to be stateful — keeps it flat too. Only LM Studio’s reconstruction inflates.

    So rather than ship a feature that makes things slower, I did the boring, correct thing in SwiftAgents: on LM Studio it resends the full history and skips the chaining entirely. And I wrote the whole thing up, with a runnable reproduction script, as a bug report on LM Studio’s tracker. Sometimes the deliverable is a paper trail.

    A side quest: the app I loved versus the one I didn’t

    Somewhere in the middle of all this benchmarking, a different question crept in.

    I’ve always preferred LM Studio. It’s the better-looking app, it feels more modern, and — the reason that actually mattered to me — it supported MLX, Apple’s on-device machine-learning framework, long before Ollama did. On Apple Silicon, MLX is the fast path, so for a good while LM Studio was simply the quicker way to run a model on a Mac. Ollama was the command-line workhorse I respected but didn’t reach for.

    While poking at Gemma 4, I noticed Ollama had quietly closed that gap — it now runs the same modern, accelerated model formats I’d switched to LM Studio for in the first place. Which meant, for the first time, I could put the two of them on a truly level playing field: the same model, in the same quantization, and just race them.

    So I did. Here’s Gemma-4-E4B, identical nvfp4 build on both:

    Ollama LM Studio
    Reading your prompt (prompt processing) 910 tok/s 445 tok/s
    Writing the answer (generation) 62.7 tok/s 51.7 tok/s
    Time until the first word appears 72 ms 121 ms
    Re-reading a 1,780-token prompt it just saw (warm cache) 65 ms 657 ms

    Ollama wins every row. It reads prompts twice as fast, generates noticeably quicker, starts answering sooner, and — the one that surprised me most — reuses its cache about ten times more cheaply. Ask it to re-read a prompt it just processed and it’s done in 65 milliseconds; LM Studio takes the better part of a second to do the same thing.

    I want to be fair, because there’s an honest caveat buried in here. The first time I raced them I had LM Studio on MLX and Ollama on the older format, and in that mismatched setup LM Studio’s generation looked faster. It was a trap — I was comparing the fast format against the slow one. The moment I matched them quant-for-quant, the apparent win evaporated and Ollama pulled ahead on everything. So I won’t claim Ollama is universally faster at everything for everyone; I’ll claim the thing my data actually supports, which is that on the same model in the same format, Ollama came out ahead everywhere I looked.

    That’s a slightly uncomfortable conclusion for me, given how much I liked the other app. But the stopwatch doesn’t care what’s prettier.

    The part I keep thinking about

    Here’s the bit that genuinely tickles me, and it’s not really about tokens at all.

    I didn’t write any of these benchmarks. I described what I wanted to know — “load a model, run ten turns each way, track the response time” — and Claude Code wrote the Python, ran it and computed all the statistics. When it needed a model that wasn’t loaded, it drove LM Studio’s command-line tool to load it, checked the API to confirm it was really resident, and benchmarked it.

    At one point it quoted a generation speed that looked too good, paused, decided the measurement window had been too short to trust, rewrote the benchmark to generate a longer sample, and re-ran it to get an honest number. It even filed the bug report on my behalf. You can see how additional info was added as comments as I was discovering more data.

    At the same time my agentic CI loop was ticking as well on the SwiftAgents PR. When the pull request’s continuous-integration build went red on Linux — because a type I’d used lives in a different module off the Mac — it diagnosed the failure, reached for my own SwiftCross shim to fix it, pushed, watched the build, found a second spot with the same problem, fixed that too, and waited with me until all six platforms went green. I mostly watched.

    A few months ago, writing a benchmark harness by hand would have been too much work for me. So I wouldn’t have done this research, but I would have just complained on Twitter about another problem in somebody else’s code. And I would have been frustrated that I couldn’t do anything about it. In this new reality agents do the research, the write-up and the filing of the issue. The ball is now in LM Studio’s court. This new reality still feels faintly like cheating.

    I put the benchmarking scripts in gist for reference.

    What I changed

    Two things came out of an afternoon that was only ever meant to confirm a one-line optimization.

    SwiftAgents now does the sensible thing on LM Studio: it resends the full conversation and leaves previous_response_id chaining well alone until the underlying balloon is fixed. The “optimization” stays on the shelf.

    And on my own machine, my default has quietly shifted from the app I liked to the one that’s faster. I still think LM Studio is the nicer thing to look at. But I’ve been doing this long enough to know that when the numbers are that consistent, you go where the numbers point — even when they point somewhere you didn’t expect, and even when an AI is the one holding the stopwatch.

    Do you use any local inferencing? If so, which do you prefer?

    Like this:

    Like Loading…

    Related


    Categories: Bug Reports



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    ios – How to make headerView with overlapping content design in SwiftUI

    June 22, 2026

    Kits All the Way Down

    June 18, 2026

    ios – UITabBarController on iPadOS 18 swallows all touches even with mode = .tabBar (via Python/rubicon-objc)

    June 17, 2026

    ios – Centered ScrollView content doesn’t return to position after pull-to-refresh with a large navigation title in SwiftUI

    June 12, 2026

    Introducing SwiftBash | Cocoanetics

    June 8, 2026

    ios – SwiftUI Map View freezes when there is no network

    June 7, 2026
    Top Posts

    Understanding U-Net Architecture in Deep Learning

    November 25, 202555 Views

    Hard-braking events as indicators of road segment crash risk

    January 14, 202630 Views

    Redefining AI efficiency with extreme compression

    March 25, 202627 Views
    Don't Miss

    The Download: the future of chipmaking and Anthropic’s government clash

    June 23, 2026

    This story is from The Algorithm, our weekly newsletter giving you the inside track on…

    Comarch User Group 2026: Navigating the 2% Growth Trap with Agentic AI and Composable Architecture

    June 23, 2026

    Clustering Unstructured Text with LLM Embeddings and HDBSCAN

    June 23, 2026

    New Data Analytics Breakthroughs Give Ecommerce Startups a Fighting Chance

    June 23, 2026
    Stay In Touch
    • Facebook
    • Instagram
    About Us

    At GeekFence, we are a team of tech-enthusiasts, industry watchers and content creators who believe that technology isn’t just about gadgets—it’s about how innovation transforms our lives, work and society. We’ve come together to build a place where readers, thinkers and industry insiders can converge to explore what’s next in tech.

    Our Picks

    The Download: the future of chipmaking and Anthropic’s government clash

    June 23, 2026

    Comarch User Group 2026: Navigating the 2% Growth Trap with Agentic AI and Composable Architecture

    June 23, 2026

    Subscribe to Updates

    Please enable JavaScript in your browser to complete this form.
    Loading
    • About Us
    • Contact Us
    • Disclaimer
    • Privacy Policy
    • Terms and Conditions
    © 2026 Geekfence.All Rigt Reserved.

    Type above and press Enter to search. Press Esc to cancel.