When Politeness Beats Accuracy (and Why That’s a Problem)
Recently I ran into an interesting limitation while experimenting with Google Gemini. The task itself was trivial: count the number of items in a JSON collection returned from an API call. Normally I would write a quick script, but I wanted to use the opportunity to explore how well Gemini could handle simple structured data.
At first, Gemini returned a number that seemed lower than expected. I asked if the count was correct. Instead of confirming or explaining uncertainty, Gemini returned a larger number. That immediately raised a red flag. I asked it to check again. The number increased once more.
Now things were clearly off.
To tighten the request, I asked Gemini to return both the count and the list of identifiers for the items. Each time I asked it to verify the result, the number continued to increase. That at least gave me something concrete to examine. I looked at the last item from the original JSON list and noticed that its identifier did not appear in Gemini’s supposedly "complete" list.
I pointed this out. Gemini apologized. It recalculated. The answer changed again.
We repeated this process several times. Each response was polite and apologetic, but also confidently incorrect, accompanied by reassurance that the results were now accurate.
Eventually I asked Gemini how it was actually processing the data. The answer confirmed my suspicion: it was only handling the input in small batches and struggling to maintain state across them. I suggested a more reliable approach: iterate through the batches with a cursor and maintain a running total.
Gemini produced a final count of 127.
Out of curiosity, I wrote a quick Python snippet to parse the JSON locally. The correct count was 61.
When I pointed that out, Gemini apologized again and accepted the correction.
What stood out most in this interaction was the role of sycophancy. The model was extremely polite, but its tone created the illusion of confidence even while the results kept changing. Apologies don’t increase trust when the answers remain wrong.
The real lesson for me was that an adjustment is needed when interacting with LLMs for tasks like this. Instead of conversational niceties, I want raw accuracy and explicit limitations. A “fail-fast” mode, where uncertainty, assumptions, and processing constraints are surfaced immediately, would be far more valuable.
As LLMs continue evolving, it will be interesting to see whether they develop different conversational modes: one optimized for friendly interaction, and another optimized for precise, transparent computation.
Because sometimes, politeness is the opposite of helpfulness.