Why Verifying AI Code Matters More Than Generating It
By: Evgeny Padezhnov
Software development changed when AI started writing code. Now it needs to change again.
The proliferation of AI coding assistants created a new problem: developers generate code faster than they can verify it. According to dev.to contributor shrsv, "generation is cheap, verification is expensive." This gap between creation and validation reshapes what makes engineers valuable.
The Verification Bottleneck
AI generates code in seconds. Verification takes hours.
Key point: Writing code faster doesn't automatically mean faster integration or maintenance. Tessl's analysis shows verification practices haven't scaled with generation capabilities. Every piece of AI-generated code essentially becomes a hypothesis that needs testing.
Common mistake: Treating AI code like human-written code. GitHub's documentation emphasizes that AI-generated code requires different review approaches. The code compiles, passes basic tests, and reads cleanly — but problems show up under stress, misuse, or unexpected input.
From Debugging to Refutation
Traditional code review looks for bugs. AI code review requires something different.
The shift moves from debugging to what shrsv calls "refutation" — actively trying to falsify the assumptions the model made. This borrows from Karl Popper's scientific method: treat every AI output as a hypothesis to be tested.
In practice:
# Traditional review
git diff | grep -E "(TODO|FIXME|bug)"
# AI code review
pytest test_edge_cases.py --hypothesis-mode
security-scanner --check-assumptions
load-test --unexpected-patterns
The New Stack describes this as moving from "vibe coding" to systematic verification. Trust and verify becomes the mantra — emphasis on verify.
The Economics of Validation
A six-month study cited by shrsv revealed the economic reality: one engineer building an AI secretary discovered generation costs approach zero while verification costs increase.
Why verification stays expensive:
- Each AI output needs individual testing
- Models make confident guesses to fill ambiguity
- Surface-level quality misleads reviewers
- Business context gets lost in generation
Brightsec's analysis notes: "If a requirement is ambiguous, the model will still produce something. That 'something' may work functionally while violating security boundaries."
Practical Verification Framework
Tested in production: The three-step approach from shrsv's dev.to post:
- Generate with high temperature — Get diverse options from AI
- Refute with discipline — Test each option against constraints
- Keep only verified pieces — Integrate what survives testing
Tools that make refutation systematic:
# Property-based testing
hypothesis test_invariants.py
# Security boundary checks
semgrep --config=auto src/
# Performance regression tests
ab -n 1000 -c 10 http://localhost:8080/api/
Apiiro's recommendations add that changes affecting "APIs, authentication, sensitive data, or AI-generated code receive deeper review" than standard changes.
The Competitive Advantage
Developers who excel at refutation outperform those who only generate.
In plain terms: The engineer who thrives won't be the one prompting the largest code output. Success belongs to those who can reliably decide what survives from AI generation.
Addy Osmani's Substack frames it clearly: AI serves as a first-pass reviewer, not a final decision-maker. Human accountability remains central to code review, especially for AI-assisted changes.
Try it: Pick any AI-generated function in your codebase. Write three tests specifically designed to break its assumptions. Most fail within minutes.
Building Verification Infrastructure
Organizations need infrastructure that makes criticism as fast as generation.
Veracode's security framework shows how to embed verification:
- SAST in IDEs catches issues during generation
- DAST in staging environments tests runtime behavior
- SCA validates dependencies AI recommends
- CI/CD pipelines enforce verification gates
The infrastructure shift treats all AI code as untrusted by default. Brightsec compares it to reviewing code "copied from an external repository or pasted from an online forum."
What to Try Right Now
Start small: Take one AI-generated function from today's work. Write a test that assumes the opposite of what the function claims to do. Run it. Most likely, it reveals an edge case the AI missed.
The future of software engineering doesn't belong to those who generate the most code. It belongs to those who can systematically verify what should survive.
Frequently Asked Questions
How can you design focused validation tools that check for specific properties rather than trying to assess code quality broadly?
Property-based testing frameworks like Hypothesis focus on invariants rather than broad quality metrics. Define what must always be true (no null pointer exceptions, API responses under 200ms) and test specifically for violations. Tools like semgrep allow custom rules targeting exact patterns AI tends to misuse.
What questions should engineers ask first to determine which hypotheses are worth testing instead of running tests blindly?
Start with business impact: What happens if this assumption fails in production? Check security boundaries first, then performance characteristics, then edge cases around data handling. Prioritize tests for code touching authentication, payment processing, or user data over internal utilities.
How do you structure CI/CD to make systematic criticism of AI-generated code as abundant and fast as the generation itself?
Run verification in parallel stages: security scanners, performance benchmarks, and property tests simultaneously. Use feature flags to test AI-generated code in production with limited blast radius. Set up automated rollback triggers when metrics deviate from baseline performance.