Projects Blog Music Contact
← All Posts
AI & ML March 1, 2026

Why Verifying AI Code Matters More Than Generating It

By: Evgeny Padezhnov

Illustration for: Why Verifying AI Code Matters More Than Generating It

Software development changed when AI started writing code. Now it needs to change again.

The proliferation of AI coding assistants created a new problem: developers generate code faster than they can verify it. According to dev.to contributor shrsv, "generation is cheap, verification is expensive." This gap between creation and validation reshapes what makes engineers valuable.

The Verification Bottleneck

AI generates code in seconds. Verification takes hours.

Key point: Writing code faster doesn't automatically mean faster integration or maintenance. Tessl's analysis shows verification practices haven't scaled with generation capabilities. Every piece of AI-generated code essentially becomes a hypothesis that needs testing.

Common mistake: Treating AI code like human-written code. GitHub's documentation emphasizes that AI-generated code requires different review approaches. The code compiles, passes basic tests, and reads cleanly — but problems show up under stress, misuse, or unexpected input.

From Debugging to Refutation

Traditional code review looks for bugs. AI code review requires something different.

The shift moves from debugging to what shrsv calls "refutation" — actively trying to falsify the assumptions the model made. This borrows from Karl Popper's scientific method: treat every AI output as a hypothesis to be tested.

In practice:

# Traditional review
git diff | grep -E "(TODO|FIXME|bug)"

# AI code review
pytest test_edge_cases.py --hypothesis-mode
security-scanner --check-assumptions
load-test --unexpected-patterns

The New Stack describes this as moving from "vibe coding" to systematic verification. Trust and verify becomes the mantra — emphasis on verify.

The Economics of Validation

A six-month study cited by shrsv revealed the economic reality: one engineer building an AI secretary discovered generation costs approach zero while verification costs increase.

Why verification stays expensive:

Brightsec's analysis notes: "If a requirement is ambiguous, the model will still produce something. That 'something' may work functionally while violating security boundaries."

Practical Verification Framework

Tested in production: The three-step approach from shrsv's dev.to post:

  1. Generate with high temperature — Get diverse options from AI
  2. Refute with discipline — Test each option against constraints
  3. Keep only verified pieces — Integrate what survives testing

Tools that make refutation systematic:

# Property-based testing
hypothesis test_invariants.py

# Security boundary checks
semgrep --config=auto src/

# Performance regression tests
ab -n 1000 -c 10 http://localhost:8080/api/

Apiiro's recommendations add that changes affecting "APIs, authentication, sensitive data, or AI-generated code receive deeper review" than standard changes.

The Competitive Advantage

Developers who excel at refutation outperform those who only generate.

In plain terms: The engineer who thrives won't be the one prompting the largest code output. Success belongs to those who can reliably decide what survives from AI generation.

Addy Osmani's Substack frames it clearly: AI serves as a first-pass reviewer, not a final decision-maker. Human accountability remains central to code review, especially for AI-assisted changes.

Try it: Pick any AI-generated function in your codebase. Write three tests specifically designed to break its assumptions. Most fail within minutes.

Building Verification Infrastructure

Organizations need infrastructure that makes criticism as fast as generation.

Veracode's security framework shows how to embed verification:

The infrastructure shift treats all AI code as untrusted by default. Brightsec compares it to reviewing code "copied from an external repository or pasted from an online forum."

What to Try Right Now

Start small: Take one AI-generated function from today's work. Write a test that assumes the opposite of what the function claims to do. Run it. Most likely, it reveals an edge case the AI missed.

The future of software engineering doesn't belong to those who generate the most code. It belongs to those who can systematically verify what should survive.

Frequently Asked Questions

How can you design focused validation tools that check for specific properties rather than trying to assess code quality broadly?

Property-based testing frameworks like Hypothesis focus on invariants rather than broad quality metrics. Define what must always be true (no null pointer exceptions, API responses under 200ms) and test specifically for violations. Tools like semgrep allow custom rules targeting exact patterns AI tends to misuse.

What questions should engineers ask first to determine which hypotheses are worth testing instead of running tests blindly?

Start with business impact: What happens if this assumption fails in production? Check security boundaries first, then performance characteristics, then edge cases around data handling. Prioritize tests for code touching authentication, payment processing, or user data over internal utilities.

How do you structure CI/CD to make systematic criticism of AI-generated code as abundant and fast as the generation itself?

Run verification in parallel stages: security scanners, performance benchmarks, and property tests simultaneously. Use feature flags to test AI-generated code in production with limited blast radius. Set up automated rollback triggers when metrics deviate from baseline performance.

Squeeze AI
  1. AI code generation vastly outpaces verification capabilities—code is generated in seconds while verification takes hours. This creates a fundamental bottleneck where faster code creation doesn't translate to faster integration or maintenance, reshaping what makes engineers valuable.
  2. AI code review requires 'refutation' rather than traditional debugging: systematically testing assumptions the model made by actively trying to falsify them through edge cases, security scanning, and load testing, borrowing from Popper's scientific method.
  3. Verification costs remain expensive and increase with volume despite near-zero generation costs because models make confident guesses to fill ambiguity, surface-level code quality misleads reviewers, and business context is lost in the generation process.
  4. A three-step production-tested approach works: generate with high temperature for diverse options, systematically refute each option against constraints, and integrate only what survives testing using property-based testing and security scanning tools.

Squeezed by b1key AI