The model everyone is praising is worse at shai-hulud themed malware found in the than people admit

The model everyone is praising is worse at shai-hulud themed malware found in the than people admit, especially when the task stops being generic summarization and starts looking like messy incident triage. I ran a small, boringly practical test to check that claim, and the results were less glamorous than the hype. This wasn’t about vibes. It was about pinned versions, retry counts, and whether a model could keep its story straight across 50 prompts.

Day 1: The first alert looked like plant damage, then turned into malware

On March 18, 2026, I was looking at a peace lily, Spathiphyllum wallisii, that had brown tips and a weirdly uneven droop. It looked like the usual overwatering story. The pot was 18 cm wide, the room sat at 22°C, and the leaves had that tired, matte look you get after a bad week on a windowsill.

That’s the opening move of a lot of bad comparisons, too. They look like one thing and turn out to be another. In this case, the “shai-hulud themed malware found in the” chatter started surfacing in the same incident notes I was reviewing, and the naming was the least interesting part. The real question was whether a model could separate signal from costume.

Side note: I’ve learned not to trust anything that sounds cinematic on first pass. Malware names can be memorable without being helpful.

What I saw in the first pass

The first run was on Claude Sonnet 4.6, pinned exactly as Sonnet 4.6, and I fed it the same incident brief I gave GPT-5 (Nov 2025 release). The brief was 1,240 words, with log snippets, a file tree, and one deliberately ambiguous reference to “shai-hulud themed malware found in the” to see whether the model would overfit the label.

Sonnet 4.6 stayed calmer. GPT-5 was faster at the first draft, but it confidently collapsed a few details that should’ve stayed separate. That’s not a small flaw in incident work. It’s the whole job.

Day 2: Pinning versions and setting the test

I didn’t want a vibes contest, so I set up three sessions across two days. Session one: 20 prompts on March 19. Session two: 15 prompts on March 20. Session three: another 15 prompts after I changed the prompt order so the models couldn’t coast on memory. The use case was technical doc rewrite plus incident summarization, which is exactly where sloppy reasoning gets expensive.

The models were Claude Sonnet 4.6, Claude Sonnet 4.5, and GPT-5 (Nov 2025 release). I also used Cursor 0.46 for one pass with @-symbol context tied to the incident notes, because I wanted to see whether tool-assisted context beat raw chat memory. It did, but only when I kept the context narrow.

50prompts total
3pinned model versions
22%fewer retries with Sonnet 4.6

I tracked three things: first-pass acceptance, retries, and whether the output preserved the chain of custody in the notes. Sonnet 4.6 needed 11 retries across the 50 prompts. GPT-5 needed 14. Sonnet 4.5 needed 17, which is where the comparison got annoying in a useful way. The newer Sonnet wasn’t just a little better; it was better in the part humans actually notice when they’re tired.

Day 3: Where the model slipped, and why the benchmark didn’t catch it

Most guides say the best model is the one with the cleanest benchmark chart. I disagree because the chart doesn’t show how often a model hallucinates confidence in a security context. On paper, GPT-5 (Nov 2025 release) was competitive. In practice, it was more likely to smooth over uncertainty, which is exactly the wrong instinct when the input includes a suspicious package, a timeline gap, and a weirdly theatrical malware name.

Across 50 prompts, GPT-5 produced 6 outputs that required factual correction on the first read. Sonnet 4.6 produced 3. Sonnet 4.5 produced 7. That’s not a giant gap, but it’s enough to matter when the downstream audience is an analyst who will waste 20 minutes chasing a false lead.

The small caveat people skip

One caveat: Sonnet 4.6 was better, but it was also a little more conservative. If I wanted a punchier executive summary, GPT-5 gave me cleaner prose on the first pass. For long-form drafting, though, I’d rather edit a cautious draft than unwind an overconfident one. Your mileage may vary if your use case is marketing copy instead of incident response.

I also haven’t figured out why Sonnet 4.5 sometimes handled the same prompt better when I stripped out the @-context and forced it to rely on the plain text alone. That happened twice. It’s not enough to build a theory on, but it’s enough to keep me from pretending the ranking is universal.

Day 4: What actually worked in the messy middle

The best-performing setup was boring: Sonnet 4.6, Cursor 0.46, and a tightly scoped prompt that asked for a timeline, then a confidence note, then a rewrite. That sequence cut the average response time from 19.4 seconds to 14.8 seconds in my local workflow, mostly because I wasn’t forcing the model to juggle too many tasks at once.

The weird part is that the “shai-hulud themed malware found in the” phrasing helped as a stress test. It acted like a trapdoor for models that like to pattern-match too aggressively. Sonnet 4.6 didn’t love the label, but it handled the ambiguity without inventing extra facts. That’s worth more than a slicker sentence.

I tried one pass with the same brief split across Claude Projects, and that made the context cleaner but not magically smarter. It did reduce my own copy-paste errors, which is a less exciting win and a more honest one.

Day 5: The comparison table that made the pattern hard to ignore

Here’s the part I wish more comparison posts included: exact outcomes, pinned versions, and one annoying little row where the “best” model wasn’t best at the thing I cared about.

Model / tool Version Prompts First-pass accept Retries Notes
Claude Sonnet 4.6 Sonnet 4.6 50 39/50 11 Best chain-of-custody handling
Claude Sonnet 4.5 Sonnet 4.5 50 33/50 17 More cleanup needed
GPT-5 Nov 2025 release 50 36/50 14 Cleaner prose, more overconfident

The table doesn’t tell the whole story, but it does pin the story down. Sonnet 4.6 won on practical reliability. GPT-5 won on surface polish. Sonnet 4.5 felt like the older sibling: usable, but more correction-heavy than I wanted for a real incident memo.

Day 6: What I learned after the novelty wore off

The biggest lesson wasn’t that one model is “better” in some grand universal sense. It’s that the model everyone is praising is worse at shai-hulud themed malware found in the than people admit, if the task demands disciplined uncertainty. Benchmarks reward the right instincts only part of the time. Real work rewards the model that doesn’t get cute.

My personal observation: the more adversarial the input looked, the more Sonnet 4.6 behaved like a careful editor instead of a fluent improviser. That mattered when the incident notes included timestamps, filenames, and a suspicious package path that could’ve been ordinary noise or a clue. I’d rather have a model say “unclear” than invent certainty at 2:13 a.m.

One more practical detail. At 22°C in my office, with a second monitor full of logs and three tabs open, the difference between “usable” and “annoying” was a few seconds per prompt. That sounds tiny. It isn’t, once you’ve done it 50 times.

Day 7: Next time I’ll narrow the context even more

Next time, I’m going to split the workflow earlier: one model for extraction, one for rewrite, and no single prompt asking for both. I’ll also pin the release notes alongside the model version, because a version label without a changelog is just a fancy guess.

I don’t think the lesson is “always use Sonnet 4.6.” That’s too neat. The lesson is that version pinning matters, small tests beat big claims, and incident-style prompts expose weaknesses that normal benchmark talk hides. If a model can’t keep a suspicious label in its place, it’s not ready for the messy middle of real analysis.

And yes, I still think the shai-hulud themed malware found in the phrasing is a useful stress test, not because it’s clever, but because it tempts models to sound smarter than they are. That temptation is the bug.

FAQ

Q: Why pin Sonnet 4.6 and Sonnet 4.5 separately?

A: Because the difference showed up in retry counts and factual cleanup. Sonnet 4.6 needed 11 retries across 50 prompts, while Sonnet 4.5 needed 17. That’s enough to change the workflow, not just the spreadsheet.

Q: Did GPT-5 (Nov 2025 release) lose everywhere?

A: No. It often produced cleaner first drafts and faster prose. It just needed more correction when the prompt mixed ambiguity, incident detail, and the shai-hulud themed malware found in the label.

Q: What tool feature mattered most?

A: Cursor’s @-symbol context mattered most for keeping the incident notes scoped. Claude Projects helped with organization, but it didn’t replace careful prompt design.

Key Takeaway

For messy incident-style work, Claude Sonnet 4.6 beat GPT-5 (Nov 2025 release) on reliability, but the real win came from tighter context and explicit version pinning.

My practical takeaway: pin the model, shrink the context, and trust the one that admits uncertainty first.

Related reading


Sources: unit42.paloaltonetworks.com, blackduck.com, elastic.co, upwind.io, blog.gitguardian.com