I was listening to an excellent podcast from The Economist recently (Boss Class) around vibe coding and one thing really stood out. One of the theories / reasons why Software Development seems to have been one of the main focuses of AI developments over the last year or so is that unlike other industries it is verifiable.
Code is a verifiable domain. There are ways to see if code is working and test it, such that the model can take that as a feedback loop and essentially get better in this domain.
Sarah Guo
Feeling the vibe – Boss Class
In addition to publicly available code for the models to train on, we get a feedback loop that can be tested and iterated on.
The key component to all this is tested. Now a lot of vibe coded stuff is likely to be a very good MVP but could you ship it? Is it secure? Does it meet user needs? Does it meet legal requirements (GDPR, Safety)? Is it accessible?
A long time ago in a galaxy far, far away I worked alongside a lot of very good Ruby on Rails developers as well as a lot of very good UX Designers and there were a lot of ‘tests’, both computer and user centred. Both were written in a human readable form:
Ruby on Rails development
Behavior-driven development (BDD ), often using frameworks like Cucumber or Beha
Scenario: I can create a discussion in a group when I am a group member
Given I am logged in
And a group with title "Ruby" exists
And I am a member of the group "Ruby"
When I create a discussion within the group "Ruby" with the title "Writing tests"
And I visit the group with title "Ruby"
Then I should see the discussion "Writing tests" is within the group "Ruby"
UX Discovery
The same readable, natural language approach showed up on the design side too:
User Stories – short statements about features – effectively acceptance criteria to measure success – written from a user’s perspective, for example:
As a < type of user >, I want < some goal > so that < some reason >.
As a < WHO >, I want < WHAT > so that < WHY >.
As a < type of user >, I do < WHAT > and expect < WHAT >.
As we often worked with Agile UX methods, we would include designers and developers in the process, and by understanding the user need we could build more effective products.
By writing out tests and baking this into acceptance criteria in Kanban style project management we were able to make sure accessibility was not an afterthought but integrated into the process:
As a blind user, I want to enter my birth year directly so that I don’t have to navigate through every option in a dropdown list.
Also for security, we can bake in expected fail scenarios
@security-critical
Scenario Outline: Block various path traversal attempts
When I try to run "mkdir"
Then the command should fail
And the output should contain ""
The reason we did this was not just because we wanted secure, performant code. We wanted to meet the user need, but by using natural language in our tests, informed by user stories also written in natural language, we were able to better meet the needs of the actual users of the product – not just something that was wished into existence based on a “Make me a Spotify of xyz”.
In the world of Gen AI development that also uses natural language for its prompts, these methods can greatly improve the output as we can create and run new tests as each new feature gets added from a backlog / PLAN.md file.
Consider this approach to developing something from a single prompt or a plan:
## Custom Command: /red-green-refactor
During this session we will practice TDD in the red-green-refactor cycle.
Rules:
1. First write a test for the smallest possible unit of functionality.
2. Run the test and it should fail (red).
3. Write the minimum code to make the test pass (green).
4. Refactor for readability and maintainability (refactor).
5. If more than one test fails, focus on the single most important failure and run only that test until it passes.
6. Do not use TODO tools or long-term plans; rely on the human for the next step.
Output discipline:
- Label each response as RED, GREEN, or REFACTOR.
- Explain why the step is in that phase.
Personally I’ve found this really useful, it’s based on Kent Beck’s Test Driven Development which he even admits was a rediscovery from earlier methods of testing expected outcomes in the age of tape.
I’ve noted the AI agent tends to pick up more mistakes as it has to run various scenarios which might break as we add more complexity, reducing context bloat and the iceberg effect.
TL;DR I’d rather spend ten minutes writing a scenario in plain English than an hour debugging something that was never tested against a real user need to begin with.
The full flow as such would be:
User stories (WHY)
→ BDD scenarios (WHAT, in natural language)
→ TDD red-green-refactor (HOW, one test at a time)
→ Agent implements minimum code
→ Suite stays green