[Project] 8. NL Glue Layer — Requirements Engineering Practice in the Vibe-Coding Era

2026-06-01 5905 words 28 minutes

/images/Project%20-%208%20-%20NL%20Glue%20Layer/cover.png

Contents

My Vibe-Coding Practice Direction — The NL Glue Layer

The Problem: AI Writes Code Too Fast

Vibe-Coding is changing the rhythm of software development. AI coding tools have accelerated code generation by 10x or even 100x. But this speed brings an obvious problem: code out of control.

The Sweet Trap of Vibe-Coding

The Vibe-Coding experience goes like this: You tell the AI — “Help me build an order system that supports placing orders, payments, and refunds. Payments use webhook callbacks, refunds need an approval workflow, and all operations must have audit logs.”

Ten minutes later, in front of you:

order_service.py — a complete order state machine (pending → paid → shipped → completed / refunded)
payment_gateway.py — payment gateway integration with signature verification, idempotency checks, and timeout retries
refund_workflow.py — a refund approval workflow with three-level approval and automatic notifications
audit_logger.py — asynchronous audit logging with query support
models.py — database models with indexes, constraints, and soft deletes
tests/ — a full suite of unit tests with seemingly 90%+ coverage

You run the tests — all green. You start the service, hit a few APIs — they all work. The code quality looks good too — complete type annotations, proper error handling, clear log levels.

But this is just the beginning. You cannot nail the entire system in one prompt — you have to go round by round with the AI, progressively refining:

Round 1: “Build a basic framework for an order system” → generates order models and basic CRUD
Round 2: “Add payment integration with webhook callbacks” → generates payment gateway and signature verification
Round 3: “Refunds need an approval workflow, three-level approval” → generates refund workflow and approval chain
Round 4: “All operations must have audit logs” → generates audit log module
Round 5, Round 6… Each round fills in details, fixes edge cases, adjusts business rules

Each round, the AI generates high-quality code based on your description. You run the tests — all green. You commit. You feel like you got a week’s worth of work done in a day.

And then the problems start to surface.

In the second week, you ask the AI to add a “partial refund” feature. The AI generates partial_refund.py — the logic is complete, tests pass. But you don’t notice:

In the original refund_workflow.py, the refund amount validation is refund_amount == order_amount (full refund only)
In the new partial_refund.py, the validation is refund_amount <= order_amount
The two modules run independently, sharing no common validation logic
If someone calls both the full refund and partial refund interfaces simultaneously, the order amount could be refunded twice

In the third week, the payment gateway changes its webhook signing algorithm. You ask the AI to update the signature verification. The AI modifies the verify_signature() function in payment_gateway.py. But you don’t notice:

refund_workflow.py also has its own signature verification — independently written by the AI when it originally generated the refund module, using the old algorithm
Payment verification passes, but refund verification fails, and the failure is swallowed by a try/except, leaving only a single line of warning log

In the fourth week, you look back at the system — 12 files, 3000 lines of code, each file professionally written, but nobody (including yourself) truly understands all the interactions between them. You don’t know which modules share logic, which operate independently, and which implicit assumptions have been violated somewhere.

This is the sweet trap of Vibe-Coding: every generation is perfect individually, but cumulatively it’s chaos. The AI only sees one file at a time; it doesn’t know the full picture of the system. And you, because the code is generated so fast, haven’t had time to build that full picture either.

Three Signals of Code Out of Control

This chaos doesn’t erupt suddenly — it manifests through three progressive signals.

Signal One: Duplicate Implementation

In the order system, the refund amount validation logic was written three separate times by the AI in three different places:

**Figure 1.3 — Three Implementations of Refund Amount Validation**

Each implementation works, each has its own “style.” But when you need to change the refund rules — say, add “the refund amount cannot exceed the amount actually paid” — you have to find all three places, modify each one, and ensure the modified behavior is consistent.

The AI won’t help you with this. It only sees one file at a time.

Signal Two: Hidden Dependencies

The webhook handler in payment_gateway.py depends on a specific behavior of audit_logger.py — “when log writes fail, the main flow should not be interrupted.” But this dependency is documented nowhere; it only exists in the conversation where the AI originally generated the code.

Three weeks later, you ask the AI to optimize the audit log’s performance. The AI changes the asynchronous write to batch writing. The batch write throws an exception on failure. The payment webhook handler doesn’t catch this exception, causing the payment to succeed but the order status to not update.

Nobody knows there is a dependency between these two modules. There are no comments, no documentation, no contracts in the code.

Signal Three: Knowledge Loss

After the code is written, even the author can’t recall “why it was written this way.” You look at the three-level approval logic in refund_workflow.py and have no memory of why it’s three levels instead of two.

Was it because of an amount threshold? A compliance requirement? Or did the AI generate it that way and you just accepted it because it worked?

Three months later, you and any new team member face the same predicament: only code, no context.

**Figure 1.1 — Three Signals of Code Out of Control**

Why We Need an Abstraction Layer Above Code

The root cause is: Code is the detail layer, not the understanding layer.

When you go directly from intent to code generation with AI, with no buffer layer in between:

**Figure 1.2 — Direct Generation vs. Generation After Abstraction**

Without an abstraction layer, each AI generation is an independent “translation” — from natural language to code. Each translation can produce different results because the AI has no memory, no context, and no standard.

With an abstraction layer, the situation is completely different:

You first write the intent as structured natural language (e.g., User Stories)
The AI generates code from this structured description
Each generation has the same “source of truth” as its reference
Code can change, but intent remains stable

The purpose of this abstraction layer is not to slow things down, but to:

Provide a stable anchor for understanding — no matter how the code changes, the intent is fixed
Give AI-generated code a traceable origin — every piece of code can be traced back to a specific requirement
Maintain continuity of intent during code changes — when refactoring, you know what “should be done” remains unchanged

This abstraction layer could be API documentation, interface contracts, or requirements documents. I chose User Stories.

The Choice: Why User Stories

The choice of abstraction layer determines the shape of the entire workflow. Different abstraction layers have different characteristics and limitations.

Candidate Abstraction Layers

Common abstraction layers include:

API Documentation: describes interface signatures, parameters, return values
Interface Contracts: describes protocols between modules, data formats, behavioral constraints
Requirements Documents: describes functional goals, user scenarios, acceptance criteria
User Stories: describes requirements from the user’s perspective, including role, goal, and value

Each approach attempts to establish a “source of truth” above the code, but with different effects.

**Figure 2.1 — Comparison of Candidate Abstraction Layers**

Limitations of API Documentation

The strength of API documentation is precision:

Function signatures, parameter types, and return values are deterministic
Can be generated directly from code to stay in sync

But API documentation’s limitation is lack of context:

You know “what this function does,” but not “why it exists”
You know “what the parameters are,” but not “what scenarios use them”
It cannot answer “what value does this feature provide to the user”

A concrete example: You see the signature exportCSV(filters: FilterOptions): Promise<Buffer> — you know it exports CSV. But you don’t know:

Who uses this feature? (Operations? Data analysts?)
Why are they exporting? (For reports? For audits?)
What is the data volume? (100 records? 1 million?)
What should happen if the export fails? (Retry? Notify the user?)

This information is critical for the AI to generate correct code, but it’s completely absent from API documentation.

**Figure 2.2 — Information Coverage of API Documentation**

Limitations of Interface Contracts

The strength of interface contracts is clear constraints:

Define boundaries and protocols between modules
Can be validated against the contract

But interface contracts’ limitation is being overly technical:

They describe “how the system collaborates internally,” not “what the user needs”
Non-technical roles (product, operations) find them hard to understand
They cannot serve as the “source of truth for requirements”

An interface contract tells you “Module A calls Module B’s ProcessOrder method via gRPC, with a 5-second timeout and 3 retries.” But it doesn’t tell you “after a user places an order, the system should confirm the order within 3 seconds; if the payment gateway times out, the order should be preserved and the user notified to retry later.”

The former is implementation detail; the latter is a business requirement. Interface contracts can only glue code to code — they cannot glue people and code.

**Figure 2.3 — Information Coverage of Interface Contracts**

Advantages of User Stories

The structure of a User Story is:

1
2
3
4
As a {role}, I want {goal}, so that {value}

Acceptance Criteria:
- Given {context}, When {action}, Then {result}

The advantages of this structure:

User-centric: describes requirements from the user’s perspective, not the system’s perspective
Includes context: role, goal, value, and scenarios are all explicitly described
Verifiable: Acceptance Criteria provide concrete verification standards
Easy to understand: non-technical roles can comprehend it too

Returning to the order system example, a User Story would be written like this:

1
2
3
4
5
6
7
As a buyer, I want to apply for a partial refund on a paid order,
so that when some items are defective, I don't need to return the entire order.

AC-1: Given the order status is paid and not yet refunded, When applying for a partial refund with amount ≤ actual paid amount, Then create a refund request and enter the approval workflow
AC-2: Given the refund amount exceeds the actual paid amount, When applying for a refund, Then reject and return "refund amount cannot exceed the amount actually paid"
AC-3: Given there is already an active refund request for the same order, When applying for another refund, Then reject and return "there is an unfinished refund request"
AC-4: Given the refund approval has passed, When executing the refund, Then call the payment gateway refund API and update the order status to partially refunded

When the AI reads this description, it knows:

Who the user is (buyer)
What to do (partial refund)
Why (some items are defective)
How to handle edge cases (excess refund, duplicate refund, approval workflow)

This information is sufficient to generate correct code and correct tests. More importantly — when the AI generates partial_refund.py, it won’t conflict with the full refund logic in refund_workflow.py anymore, because the User Story has already clarified the amount validation rules.

But the value of User Stories goes beyond this. It provides two key capabilities:

Global Visibility: When you have 20 User Stories describing the entire order system, you can see the system’s full picture at a glance — which features are defined, which are missing, and which have dependencies between them. Code cannot do this, because it’s scattered across dozens of files and nobody can read all the code at once and understand their relationships.

A Clear Path from Story to Code: User Stories are not documents that are written and then tossed aside — they are the starting point of development. The workflow becomes:

Write User Story → 2. Generate code from Story → 3. Generate tests from Story → 4. Verify code and Story consistency

This path is clear and traceable. Every piece of code can be traced back to which User Story, and every User Story can be checked for corresponding code implementation.

**Figure 2.4 — User Story Structure**

**Figure 2.5 — Information Coverage of User Stories**

Positioning User Stories as Glue

User Stories are not the endpoint for requirements — they are the glue layer:

Upward, they glue user intent (product requirements, business goals)
Downward, they glue code implementation (functions, modules, interfaces)
Outward, they glue test cases (Acceptance Criteria → Test Case)

**Figure 2.6 — User Story as the Glue Layer**

As a glue layer, the core responsibility of User Stories is maintaining consistency:

Does the code implementation match the User Story’s description?
Do the test cases cover the User Story’s Acceptance Criteria?
Does the documentation reflect the User Story’s intent?

This is why I chose User Stories over API documentation or interface contracts — User Stories are the only abstraction layer that can simultaneously glue people, code, and tests.

Core Principle: NL is the Center, Code is the Detail

Once User Stories are established as the glue layer, the entire workflow’s perspective shifts: NL (Natural Language) is the center, code is the implementation detail.

NL Doesn’t Need to Be “Complete,” It Needs to Be “Reliable”

Traditional documentation thinking pursues “completeness”:

Every feature must have documentation
Every detail must be described
Documentation must be 100% in sync with code

But as a glue layer, NL doesn’t need to be complete:

It can describe only the core features and ignore edge cases
It can describe only “what should be,” not “what currently is”
It can have repetition, redundancy, and rough descriptions

The only requirement is: NL must not be wrong.

**Figure 3.1 — NL Quality State Transitions**

**Figure 3.2 — Quality Requirements for NL**

NL Can Be Redundant, but Not Wrong

Redundant NL is not a problem:

The same feature described across multiple User Stories does not cause confusion
Redundancy instead provides multiple perspectives, enhancing understanding

Wrong NL is fatal — here “wrong” doesn’t mean code and NL are inconsistent (that’s code’s problem), but that the NL itself is wrong or self-contradictory:

US-05 says “the refund amount cannot exceed the actual paid amount,” US-08 says “the refund amount may include compensation and is allowed to exceed the actual paid amount” — the two stories contradict each other, and developers don’t know which one to follow
US-12 says “refund approval only requires a supervisor,” US-15 says “all refunds must go through financial approval” — the two stories conflict on the approval process
US-03 says “ship immediately after payment succeeds,” but the actual business rule is “ship only after T+1 settlement” — the NL itself is wrong, inconsistent with the real business rules

These NL errors are more dangerous than code bugs — because NL is the glue, the source of truth. If the NL itself is wrong, then the code, tests, and documentation generated from it are all wrong — and it’s hard to discover because “that’s what it says in the documentation.”

NL Can Be Rough, but Not Vague

Rough NL is acceptable:

“The order supports refunds” — while lacking details, the direction is correct
“Payment callbacks must be secure” — while not quantified, the intent is clear

Vague NL cannot serve as glue:

“The system should perform well” — what is “well”? Not verifiable
“Process the user’s request” — what request? How to process? Not implementable
“Better experience” — what is “better”? Not measurable

**Figure 3.3 — NL Quality Classification**

From “Code Coverage” to “NL Implementation Rate”

The traditional coverage mindset is: how much of the code is covered by tests.

The NL glue layer’s coverage mindset is bidirectional:

NL Implementation Rate: how much of the NL is implemented in code (NL → Code)
Code-NL Rate: how much of the code is covered by NL (Code → NL)

**Figure 3.4 — The Shift in Coverage Thinking**

These two metrics answer different questions:

NL Implementation Rate: how many of the requirements you wrote have become code? (Unimplemented NL is the backlog.)
Code-NL Rate: how much of the code you wrote is described by NL? (Uncovered code is “wild code” — nobody knows why it exists.)

A low code-NL rate means there is a large amount of “wild code” in the system — it exists, but no NL explains why it exists. This code is the most dangerous during refactoring, because nobody knows its business context, and changes might break some implicit business rules.

The significance of this shift:

Stop caring about “is the code covered by tests” (that’s test coverage’s job)
Care about “are the features described by NL implemented in code” (NL implementation rate)
Care about “is the code described by NL” (code-NL rate)
Unimplemented NL is backlog, not a problem
Uncovered code is a risk — it needs NL to be added or needs to be confirmed for deletion
Implemented code that contradicts NL is a problem — it must be fixed

NL coverage is the measurement tool for this mindset shift — its core question is not “is the code good,” but rather “how much of the NL has been implemented, has it been implemented correctly, and how much of the code is covered by NL.”

Three Core Metrics

Based on the “NL is the center” principle, I defined three core metrics to measure the health of NL.

NL Implementation Rate: Has This NL Been Implemented?

Definition: Number of implemented NL items / Total number of NL items (excluding vague ones)

Meaning: How many of the User Stories’ Acceptance Criteria have corresponding implementations found in the code.

Example:

1
2
3
4
5
6
7
US-05: As a buyer, I want to check the refund progress of an order

AC-1: Given the order has a refund request, When checking the refund progress, Then display the current approval node and estimated completion time
  → Code has get_refund_progress() function → ✅ Implemented

AC-2: Given the refund request was rejected, When checking the refund progress, Then display the rejection reason and a prompt to reapply
  → Code has no logic to display rejection reason → ❌ Not implemented

Unimplemented NL is backlog. In Agile development, backlog refers to “a list of identified but not yet implemented requirements” — simply put, “to-do items.” Unimplemented NL is your to-do list, telling you which features haven’t been done yet. This is not a problem; it’s a normal development rhythm.

NL Accuracy Rate: Is NL Consistent with the Code?

Definition: Number of implemented and consistent NL items / Number of implemented NL items

Meaning: Among the implemented NL items, how many are consistent with the code’s actual behavior.

Example:

1
2
3
4
5
US-08: As a finance officer, I want to approve refund requests

AC-1: Given the refund amount ≤ 1000 yuan, When the supervisor approves, Then automatically enter financial approval
  → Code has the threshold hardcoded at 500 yuan → ⚠️ Conflict
  → NL says 1000 yuan, code says 500 yuan

Conflicts are the most serious problem: when the NL glue breaks, it misleads everyone.

Code-NL Rate: Is the Code Described by NL?

Definition: Number of code behaviors covered by NL / Total number of code behaviors

Meaning: How many code behaviors have corresponding NL descriptions. This is the inverse of NL implementation rate (NL → Code) — it’s the Code → NL direction.

Example:

1
2
3
4
5
6
7
Code has a retry_with_backoff() function implementing exponential backoff retry logic
  → Check all User Stories, no AC describes "retry mechanism" or "backoff strategy"
  → ❌ Uncovered: this code has no NL explaining why it exists

Code has a validate_signature_v2() function
  → US-03 AC-2 describes "webhook signature verification uses HMAC-SHA256"
  → ✅ Covered: this code has NL explaining its business purpose

Uncovered code is “wild code”: it exists, but no NL explains why it exists. This code is the most dangerous during refactoring, because nobody knows its business context, and changes might break some implicit business rules.

Relationships Between Metrics

The three metrics form a bidirectional measurement system:

**Figure 4.1 — Bidirectional NL Coverage Measurement**

NL Implementation Rate (NL → Code): how many of your written requirements have become code? Unimplemented NL is backlog.
NL Accuracy Rate (NL ↔ Code Consistency): among implemented NL items, how many match the actual code behavior? Inconsistencies are conflicts.
Code-NL Rate (Code → NL): how much of your written code is described by NL? Uncovered code is “wild code” — an existing risk.

NL coverage is not a single number — it’s the complete picture of this bidirectional measurement — it tells you how much of the NL has been implemented, how correctly it has been implemented, and how much of the code is covered by NL.

Why Not Scoring

The traditional approach is to assign scores to NL:

Precision 80, Completeness 70, Consistency 90
Overall score 80, “Good”

But scoring has problems:

How do you set thresholds: is 80 good or is 70 good? Different projects have different standards.
Hard to act on: knowing it’s “70 points” doesn’t tell you where the specific problems are.
Easy to game: optimizing for the score rather than for quality.

My approach is to report problems directly:

Don’t tell you “this NL scores 70 points”
Tell you “this NL conflicts with the code: NL says X, code does Y”
Tell you “this NL is vague: ‘fast’ is not defined”

Problems are more actionable than scores.

Three Core Detections

Based on the three metrics, I designed three detection mechanisms to identify problems.

Conflict Detection: NL Says X, Code Does Y

Goal: Find contradictions between NL and code.

Detection Method:

Extract key values from NL (numbers, conditions, behaviors)
Find the corresponding implementation in code
Compare the NL description with the code’s actual behavior
If inconsistent, mark as a conflict

**Figure 5.1 — Conflict Detection Flow**

Example:

1
2
3
4
5
6
7
US-08, AC-1: "When the refund amount ≤ 1000 yuan, after supervisor approval, automatically enter financial approval"
  → Code refund_workflow.py:67 threshold hardcoded at 500 yuan
  → ⚠️ Conflict: NL says 1000 yuan, code says 500 yuan

US-03, AC-2: "Payment webhook signature verification uses HMAC-SHA256"
  → Code payment_gateway.py:34 uses MD5
  → ⚠️ Conflict: NL says HMAC-SHA256, code uses MD5

Conflicts have the highest priority: they directly break the value of NL as glue.

Fuzziness Detection: NL That Can’t Glue Downstream

Goal: Find NL that cannot serve as input for downstream.

Detection Method:

Check whether NL contains specific values (numbers, conditions, states)
Check whether NL uses subjective language (“fast,” “good,” “better”)
Check whether NL can be transformed into test cases
If it cannot be transformed, mark as fuzzy

**Figure 5.2 — Fuzziness Detection Flow**

Example:

1
2
3
4
5
6
7
US-12, AC-1: "The refund approval process must be secure"
  → "Secure" is not defined → ⚠️ Fuzzy
  → Suggestion: change to "requires three-level approval, and approvers at each level cannot be the applicant"

US-15, AC-1: "Payment callbacks must be handled well"
  → What counts as "handled well"? Idempotent? Retry? Timeout? → ⚠️ Fuzzy
  → Suggestion: specify the idempotency strategy, retry count, and timeout duration

Fuzziness is not an error: but it needs improvement, otherwise it cannot glue downstream.

Backlog Detection: Visualizing To-Do Items

Goal: Find NL items that haven’t been implemented in code yet, forming a to-do list.

Detection Method:

Extract the Acceptance Criteria from NL
Search for corresponding implementations in code
If not found, add to backlog

**Figure 5.3 — Backlog Detection Flow**

Example:

1
2
3
4
5
US-09, AC-1: "Support refunds to the original payment channel (WeChat/Alipay/bank card)"
  → Code only implements refund to balance, no original-channel refund logic → 📋 Added to backlog

US-11, AC-1: "Refund approval supports batch operations (one-click approve/reject multiple items)"
  → Code only has single-item approval interface → 📋 Added to backlog

Backlog is not a problem: it’s a visualization of to-do items, telling you which features haven’t been implemented yet.

Detection Priority

The priority of the three detections:

**Figure 5.4 — Detection Priority**

Conflict: NL contradicts code — the glue is broken, must be fixed
Fuzzy: NL cannot glue downstream — suggested for improvement
Unimplemented: NL hasn’t been implemented yet — it’s backlog, not a problem

Practice: From Handling Code to Handling User Stories

The ultimate goal of theory is to guide practice. Establishing the NL glue layer means a paradigm shift in how you work.

The Shift in Workflow

Traditional workflow:

1
Requirements → Code → Tests → Deployment

NL glue layer workflow:

1
Requirements → User Story → Code → Tests → Verify NL Consistency

**Figure 6.1 — Workflow Comparison**

Key changes:

The starting point changed: from “writing code” to “writing User Stories”
The endpoint changed: from “code works” to “NL and code are consistent”
The verification changed: from “tests pass” to “NL is implemented and accurate”

Design Philosophy of an NL Coverage Tool

Based on the above theory, an ideal NL coverage tool should have the following characteristics:

Core Features:

Extract all User Stories’ Acceptance Criteria
Search for corresponding implementations in code
Detect conflicts, fuzziness, and unimplemented items
Generate an NL coverage report

Design Principles:

No scoring: report problems directly, don’t give scores
Actionable: each problem has a specific description and suggestion
Clear priority: conflict > fuzzy > unimplemented

Output Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# NL Coverage Report

## Conflicts (NL and Code Inconsistent)
| Story | AC | NL Description | Actual Code Behavior | Location |
|:---|:---|:---|:---|:---|
| US-08 | AC-1 | Refund amount ≤ 1000 yuan requires supervisor approval | Threshold hardcoded at 500 yuan | refund_workflow.py:67 |

## Fuzzy (Cannot Glue Downstream)
| Story | AC | Description | Fuzzy Point |
|:---|:---|:---|:---|
| US-12 | AC-1 | The refund approval process must be secure | "Secure" not defined |

## Unimplemented (NL Not Yet Implemented)
| Story | AC | Description |
|:---|:---|:---|
| US-09 | AC-1 | Refund to original payment channel |

Integration with the Existing Toolchain

The NL coverage tool is not an isolated tool — it integrates with the entire req toolchain:

Integration with req-refresh:

After refreshing requirements documents, automatically run NL coverage analysis
When conflicts are found, prompt the user to fix them

Integration with req-catalog:

When generating the requirements catalog, embed NL coverage metrics
Provide a global view of NL health

Integration with the req pipeline:

During the justify phase, check NL coverage
If the number of conflicts is too high, mark as “NL gap”

**Figure 6.2 — NL Coverage Tool and Toolchain Integration**

Practical Use Scenarios

Scenario One: New Feature Development

Write User Stories and Acceptance Criteria
Use AI to generate code
Run the NL coverage tool to check implementation status
Fix conflicts, improve fuzzy NL
Confirm NL and code are consistent, then commit

**Figure 6.3 — New Feature Development Flow**

Scenario Two: Code Refactoring

Run the NL coverage tool to check current NL coverage
Discover conflicts: NL says X, code does Y
Decide: change the NL or change the code
Refactor the code
Run the NL coverage tool again to confirm consistency

**Figure 6.4 — Code Refactoring Flow**

Scenario Three: Requirements Review

Run the NL coverage tool to generate a report
Review the backlog: these are to-do items
Review fuzzy NL: these are requirements that need refinement
Review conflicting NL: these are issues that need immediate fixes
Conduct the requirements review based on the report

**Figure 6.5 — Requirements Review Flow**

The Truth About Testing: The Problem with AI-Generated Tests

The NL glue layer not only changes the development workflow, but also exposes a fundamental flaw in how AI writes unit tests.

The Trap of AI-Generated Tests

The current approach to AI-generated unit tests is:

Read the code implementation
Generate test cases based on the code logic
Tests pass 100%

This looks perfect, but it’s actually circular reasoning:

Code says X, test verifies X → Pass
Code says Y, test verifies Y → Pass
Code is wrong, the test is wrong too → Still passes

**Figure 7.1 — The Circular Reasoning of AI-Generated Tests**

The value of this kind of testing is zero: it can only prove that “the code does what the code does,” but cannot prove that “the code does what it should do.”

Tests Should Come from User Stories

The correct path for test generation:

Generate tests from the User Story’s Acceptance Criteria
The tests describe “what should be,” not “what currently is”
If the code is wrong, the tests will fail → conflicts are discovered

**Figure 7.2 — Generating Tests from User Stories**

This difference is fundamental:

Code-based tests: verify the code’s self-consistency (is the code consistent with itself?)
NL-based tests: verify the code’s correctness (is the code consistent with the requirements?)

A Concrete Example

Suppose the User Story says:

1
2
3
US-08: As a finance officer, I want to approve refund requests

AC-1: Given the refund amount ≤ 1000 yuan, When the supervisor approves, Then automatically enter financial approval

AI-generated test based on code (where the threshold is hardcoded at 500 yuan):

1
2
3
4
def test_refund_approval_auto_forward():
    refund = create_refund(amount=500)
    approve_by_supervisor(refund)
    assert refund.status == "finance_review"  # ✅ Passes, but verifies the wrong threshold

Test generated based on User Story:

1
2
3
4
def test_refund_approval_auto_forward():
    refund = create_refund(amount=1000)
    approve_by_supervisor(refund)
    assert refund.status == "finance_review"  # ❌ Fails! Code threshold is 500, NL says 1000

Only the second type of test can discover the problem.

NL as the Source of Truth for Testing

This further reinforces the positioning of the NL glue layer:

**Figure 7.3 — NL as the Source of Truth for Testing**

The complete value chain of the NL glue layer:

NL → Code: guides implementation
NL → Tests: generates verification standards
Tests → Code: verifies consistency
Conflict Detection: discovers contradictions between NL and code

When tests come from NL, not from code, tests have real value — they verify that “the code does what it should do,” not that “the code does what the code does.”

The Bigger Picture: NL as a Universal Glue Layer

When we extend the NL glue layer mindset beyond “code-to-code,” a bigger picture emerges: AI itself is the glue layer, and NL is its adhesive.

Glue Between Code and Code

This is the most direct level, and the core of what we’ve discussed earlier:

Dependencies between module A and module B are clearer when described in NL than in code comments
Contracts between functions are more complete when expressed as User Stories than as type signatures
Intent preservation during refactoring is more reliable when anchored in NL than inferred from code

The AI’s role here is as a translator: translating NL into code, code into tests, and tests into reports. NL is the adhesive, AI is the executor.

Glue Between Programs and Programs

When integrating multiple systems, the value of NL becomes even more apparent:

System A’s API says “return user info,” System B expects “return customer profile” — are these the same thing?
Data flow between microservices: defined in code you only see the format; defined in NL you see the semantics
Third-party integration: their documentation says “supports batch operations” — how many is batch? 100? 10,000?

**Figure 8.1 — NL as Inter-System Glue**

The traditional approach uses IDL (Interface Definition Language) or OpenAPI Spec to define interfaces. These are precise, but they lack semantics. NL adds the semantic layer — it defines not only “what the format is,” but also “what this means.”

Glue Between People and People

This is the most easily overlooked level. In software development, the cost of communication between people often exceeds the cost of writing code:

The product manager says “the user experience should be good,” the developer interprets it as “response must be fast,” the tester interprets it as “must have no bugs”
The frontend says “this component should be reusable,” the backend interprets it as “needs abstraction,” the designer interprets it as “needs consistency”
A new member asks “why was this feature done this way,” the veteran says “that’s how it was decided back then”

The value of NL as a glue layer:

Eliminates ambiguity: define “good” as “< 200ms response time”
Aligns understanding: define “reusable” as “supports 3+ scenarios”
Conveys context: write the “why” into the value clause of a User Story

**Figure 8.2 — NL as Inter-Team Glue**

The AI’s role here is as an aligner: unifying the NL expressions from different roles into the same source of truth, transforming vague consensus into precise acceptance criteria.

Glue Between Teams and Teams

As an organization grows, collaboration between teams becomes a bottleneck:

The platform team says “we provide the user service,” the business team says “what I need is customer management”
The data team says “the data warehouse is ready,” the analytics team says “I can’t find the metrics I need”
The security team says “compliance requirements are met,” the product team says “the user registration flow is too complicated”

The traditional solutions are architecture review meetings, cross-team documentation, and API gateways. These are all effective, but they share a common problem: high update cost, easy to become outdated.

The advantages of the NL glue layer:

Low-cost updates: changing a User Story is 10x faster than changing an architecture document
High-fidelity transmission: NL is easier for non-technical people to understand than code
Verifiability: NL can be automatically checked for consistency by AI

**Figure 8.3 — NL as Cross-Team Glue**

The Essence of AI as a Glue Layer

Let’s return to the most fundamental question: what is the role of AI in software development?

Traditional view: AI is an accelerator — making code writing faster.

My view: AI is a glue layer — binding different things together.

**Figure 8.4 — The Panorama of AI as a Glue Layer**

And NL (Natural Language) is the adhesive of this glue layer:

Without NL, AI is just a tool for quickly generating code
With NL, AI becomes the hub connecting intent, code, tests, and teams

This is why NL coverage is so important — it doesn’t measure “how well the code is written,” but “how firmly the glue is bonded.”

When NL coverage is high and accurate:

Code changes have a traceable trail
Test failures have a discoverable cause
Team collaboration has a reliable basis
System evolution has a trackable direction

When NL coverage is low or inaccurate:

Code becomes a black box
Tests become decoration
Team collaboration relies on guesswork
System evolution relies on luck

Conclusion: Reliable Glue

Vibe-Coding has made code generation incredibly fast, but this speed needs to be constrained and guided.

**Figure 9.1 — NL Glue Layer Panorama**

Establishing the NL glue layer is essentially a paradigm shift in how we work:

From “handling code” to “handling User Stories”
From “code coverage” to “NL implementation rate”
From “tests pass” to “NL and code are consistent”

The significance of this shift:

Code is detail: AI can generate it quickly, but NL is the source of truth
NL is the center: it binds user intent, code implementation, and test cases together
Consistency is the goal: NL and code must be consistent, otherwise the glue is broken

Three core metrics (implementation rate, accuracy rate, code-NL rate) and three core detections (conflict, fuzzy, unimplemented) provide a framework for measurement and improvement.

The ultimate goal is not “perfect documentation,” but reliable glue — enabling NL to effectively bind upstream and downstream, keeping Vibe-Coding fast while remaining under control.