Stop using ChatGPT for code review — use an actual security workflow instead. AI is not just speeding up vulnerability work; it’s splitting the old culture in two. One camp still treats review as a human craft exercise. The other is already using models to triage, explain, and sometimes even spot issues faster than a tired reviewer can.
Key Takeaway
AI is breaking both the “manual reviewer as final authority” culture and the “tool output is truth” culture. The teams that adapt fastest keep humans on judgment and let models handle the first pass.
What two vulnerability cultures are getting broken?
One culture says security comes from a senior engineer reading code carefully enough to catch what scanners miss. The other says security comes from tooling: static analysis, SAST, dependency scanners, and a queue of findings someone eventually investigates. AI is poking holes in both.
For the first camp, models are making rough-but-useful reviews cheaper. Tools like Cursor can pull in local files, docs, and issue threads, which changes review from “read the whole repo” to “focus on the risky slice.” For the second camp, model output is often persuasive enough to feel definitive even when it isn’t. That’s the trap. A fluent explanation can hide a weak finding.
During a recent technical doc rewrite, it became apparent that the model was faster at pointing to likely edge cases than it was at proving them. That’s the pattern here. AI is good at narrowing the search space. It’s not a substitute for validation. The failure mode worth watching is people over-trusting polished text.
Where the old manual habit breaks
Manual review assumes attention is the scarce resource. AI changes that. The scarce resource becomes judgment: deciding which warnings deserve a second look. In security writeups, models are strongest when the task is bounded — a codebase refactor, a technical doc rewrite, a suspicious diff. They’re weaker when asked to “review everything” because everything becomes a blur.
A model can sound confident while being directionally wrong. That’s not a nitpick; it’s the core problem.
Why do scanners and copilots fail in opposite ways?
Most guides say “use more tools.” This advice overlooks that the failure modes matter more than the count. Traditional scanners fail noisily. They produce a lot of false positives, but the signal is at least inspectable. Copilots fail conversationally. They can omit the one assumption that changes the answer.
This difference showed up clearly in public benchmark reporting around model behavior, including Anthropic’s system card material for Claude and OpenAI’s safety and model documentation. The exact lesson is less “which model wins” and more “which workflow makes errors visible.” When a tool says, “this code may be vulnerable because X,” you can trace X. When a model says, “this looks fine,” you often can’t see the missing branch until later.
For technical doc rewrites, a model can be a very good first-pass editor and a mediocre final reviewer. For codebase refactors, it can catch naming inconsistencies, dead paths, and missing checks quickly. But if you ask it to bless a security-sensitive change, you need a second layer: tests, policy checks, or a human who knows the threat model.
The best use observed is to make the model argue both sides. Ask it to identify the bug, then ask it to explain why it might be wrong. That tiny extra step reduces the “pretty answer” problem.
What breaks faster: false positives or false confidence?
False positives waste time. False confidence ships risk. AI is making the second one more dangerous because the prose is so smooth. This is the cultural shift. We used to distrust noisy tools. Now we also have to distrust elegant ones.
How should teams use AI without turning it into a security theater machine?
Use AI as a triage layer, not an authority. Give it narrow jobs: summarize a diff, flag suspicious auth logic, compare two versions of a function, or explain a dependency change in plain English. It also means pinning the model version in your workflow notes. Different model versions can matter, and so can output patterns from different model generations.
Here’s the practical split: let scanners catch breadth, let models compress context, and let humans decide severity. Tools that help anchor the model to relevant files are valuable. Neither replaces review discipline. They just make the first pass less painful.
| Workflow | Best use | Main risk | Source |
|---|---|---|---|
| Static scanner | Breadth over a repo | False positives and alert fatigue | Vendor docs and common security practice |
| Claude / GPT review pass | Summarizing a diff, explaining likely issues | Confident omissions | Anthropic/OpenAI model documentation |
| Cursor with contextual files | Local codebase refactor and targeted review | Over-trusting partial context | Cursor product docs |
AI is worse when the security question depends on organizational context, not syntax. A permission change can be safe in one service and dangerous in another. Models don’t know your politics, rollout rules, or incident history unless you feed that in carefully.
So yes, AI is breaking two vulnerability cultures. The fix isn’t “more AI.” It’s clearer boundaries.
What should a modern review stack look like now?
Think in layers. First, automated checks for known bad patterns. Second, a model pass for explanation and prioritization. Third, a human review that only focuses on the small set of findings that still look plausible after the first two layers. That workflow is slower to describe than to use.
It also changes how you write prompts. “Find vulnerabilities” is too vague. “Review this auth-related diff for missing checks, unsafe defaults, and places where the new branch bypasses existing policy” is much better. The model needs a frame. Security teams do too.
The practical takeaway: stop asking AI to be your final reviewer and stop treating scanners as the whole story. Use AI to break the backlog apart, not to certify safety. Pin the model version and the review scope. Pin the model version and the human sign-off rule.
That’s the new culture shift. Less faith. More structure.
FAQ
Q: Can AI replace code review for vulnerabilities?
A: No. It can speed up triage and explain suspicious changes, but it still misses context and can sound more certain than it should.
Q: Which is safer for security work: scanners or models?
A: Neither alone. Scanners are better at breadth; models are better at compression and explanation. The safest setup combines both with a human decision point.
Q: What’s the biggest mistake teams make with AI review?
A: Treating fluent output as proof. A polished explanation is not the same thing as a validated finding.
Practical takeaway: Use AI to narrow the search, not to declare the case closed. What would break first in your current review flow: false positives, or false confidence?
Sources
Anthropic Claude documentation and system card materials: https://www.anthropic.com/news/claude-3-5-sonnet and https://www.anthropic.com/research
OpenAI model and safety documentation: https://openai.com/index/gpt-5/ and https://openai.com/index/
Cursor product documentation: https://docs.cursor.com/
Claude Projects documentation: https://support.anthropic.com/