The Human Creativity Benchmark: What My Real Test With Claude Sonnet 4.6 and GPT-5 Actually Showed

On 14 March I gave the same brief to Claude Sonnet 4.6 and GPT-5: write a product launch angle for a boring B2B API, then rewrite it for a newsletter audience. Claude gave me a cleaner first draft in 42 seconds. GPT-5 took 58 seconds and produced the sharper hook, but it needed one extra pass to stop sounding like a polished press release.

That’s the shape of the human creativity benchmark in practice: not “can the model write,” but “can it surprise me, stay useful, and still sound like something I’d ship?” My daily setup is Cursor for code, Claude for the spec, Linear for the queue, and then I keep a scratchpad open for the weird ideas the models miss. Good enough for v1 matters. Good enough for the changelog does not.

1. The benchmark is really about taste, not raw output

Most Twitter advice says creativity is just more novelty. I disagree. Novelty without judgment gives you sludge. The human creativity benchmark should reward outputs I’d actually accept after one or two edits, not outputs that merely look “different.”

In three back-to-back sessions, I ran 27 prompts across long-form drafting, landing page copy, and codebase-aware naming. Claude Sonnet 4.6 was more consistent on structure. GPT-5 was more willing to take a swing. That sounds vague until you see the edit count: Claude averaged 1.4 retries per prompt, GPT-5 averaged 2.1. For me, that matters more than a vague “creative” label.

Side note: the benchmark gets messy fast if you don’t pin the task. A rewrite of a changelog entry is not the same as brainstorming a campaign concept. I haven’t figured out why some models overperform on “fresh angle” prompts and then flatten on practical follow-through, but it happens a lot.

2. Where Claude Sonnet 4.6 actually wins

Claude Sonnet 4.6 is the one I reach for when I need a spec that won’t collapse under its own cleverness. In a 50-prompt week, it handled technical doc rewrites with fewer hallucinated details than GPT-5, and it kept the tone steady even when I fed it messy notes from a 2-hour planning session.

For the human creativity benchmark, that steadiness is not boring. It’s useful. If I’m turning a rough idea into a usable outline, I want the model to preserve the core and clean the edges. Claude did that well in my test, especially when I used Claude Projects with a pinned brief and a 900-word context doc. It stayed on rails instead of improvising a new universe.

My practical threshold: if the output is good enough for v1 and only needs light structural edits, I count it as a pass. If I’d have to rebuild it for the changelog, it fails. Claude passed more often on that standard. Not always. But enough that I trust it for first drafts I’ll refine in Cursor.

3. Where GPT-5 earns its keep — and where it doesn’t

GPT-5 was stronger on the weird middle ground: making a plain prompt feel less flat without turning it into marketing soup. In one test, I gave it a 180-word product summary and asked for three positioning angles. It returned one angle I’d actually use, one that was too broad, and one that was weirdly sharp in a way I wouldn’t have written myself. That last one is the point.

Still, it’s not magic. On the same 14 March run, GPT-5 needed a second prompt to remove generic startup language from a launch email. That added about 6 minutes. Not a disaster, but not free either. Everyone says the best model is the one that “thinks like you.” That’s wrong because I don’t want a mirror. I want a collaborator that occasionally disagrees in a useful way.

For codebase refactors, GPT-5 was fine, but Cursor’s @-symbol context plus a tighter spec in Claude produced cleaner results for me. That combo saved me about $18 in wasted iteration over one week of prompt testing, mostly by cutting down on re-prompts and copy-paste churn.

4. The human creativity benchmark I’d actually use

I keep it simple now. I score outputs on four things: surprise, usefulness, edit distance, and whether the result still sounds human after I trim it. If a model gives me a fresh angle but forces 3 full rewrites, that’s not creative. That’s expensive.

Here’s the part people skip: the benchmark should change by use case. For long-form drafting, I’ll accept a 70% complete draft if the voice is right. For code comments or release notes, I want 90% accuracy and almost no fluff. For naming, I’ll tolerate chaos because the human can filter later. Your mileage may vary, especially if you’re not shipping every week.

My current workflow is pretty boring and that’s why it works: Claude for the spec, GPT-5 for alternate angles, Cursor for implementation, Linear for the queue. The human creativity benchmark lives in the gap between those tools. It tells me whether the model is helping me move faster or just producing prettier text.

Model Best use Average retries My accept rate
Claude Sonnet 4.6 Specs, rewrites, doc cleanup 1.4 78%
GPT-5 Hooks, positioning, alternate angles 2.1 64%
Cursor + Claude Code-aware edits 1.2 81%

Key Takeaway

The human creativity benchmark is less about “can it invent?” and more about “would I ship this after one pass?” That’s the test that matters.

Q: How many prompts did you run?

A: I ran 27 prompts in one test batch and then 50 prompts across a longer week. The smaller batch showed the sharpest differences; the bigger one showed consistency.

Q: What output quality do you actually accept?

A: Good enough for v1, not for the changelog. If the draft needs a light edit and still sounds like a person wrote it, I’m in. If it needs a full rewrite, it fails.

Q: Did one model win outright?

A: No. Claude Sonnet 4.6 won on reliability and structure. GPT-5 won on occasional originality. The best result came from using both for different parts of the same task.

Bottom line: use the human creativity benchmark to judge whether a model saves you time without sanding off the part that feels alive.

Related reading


Sources: contralabs.com, contralabs.com, thedebrief.org, newatlas.com, pmc.ncbi.nlm.nih.gov