How BDD Helps Produce Correct Code with ChatGPT and Why Automated Tests Are Still Important

Andrey Lebedev
11 min readSep 4, 2024

--

TL;DR?

Software development as we know it today will very soon transform into pure system/business analysis. That is why the methodologies that help build clear and unambiguous requirements will be in high demand. On the other hand in the interim, before AI has grown to its full extent, automated tests generated by AI serve as important feedback that helps improve the quality of the main code.

The main message

Behaviour Driven Development (BDD) might become an industry standard in the near future with the onset of AI-powered software development. Here, I will explain why this is inevitable in my opinion.

Let me first put it straight. Despite having different standpoints with many of my friends and colleagues regarding the future fate of the traditionally known Software Engineering profession, I believe that this profession will pretty soon become a very narrow and scarcely needed role akin to Fortran/Cobol developers nowadays. In the long run, the Software Engineering profession will evolve into the Business Analyst role, and those who are skilled in producing clear requirements will take it over.I will try to prove this to you below — stay tuned.

Punch Cards and ChatGPT

I’ll start with the simple, but sometimes forgotten fact: all programming languages were created in the first place for humans to facilitate their interaction with computers. Disagree? Well, recall that any computer program is nothing but a sequence of zeros and ones in the memory of the machine. At the onset of the computer era, the first engineers had to create programs translating their algorithms into these sequences of bits, in other words directly writing instructions for the computer in binary form. Have you ever heard of punch cards?

If you are young enough, you might have never heard of them. On the other hand, it is unlikely that anyone who really worked with them would read this article. I held them in my hands multiple times, mostly because in the Soviet Union and the early post-Soviet Union epoch these punch cards were used a lot in schools as learning material… not in the way you might have thought: one side of a punch card had no inscriptions, so it was used by the teachers to write or print some stuff on it and use it as a teaching card.

Maksym Kozlenko, CC BY-SA 4.0, via Wikimedia Commons

So, what I am driving at: in half a century we moved from punched cards to modern computer programming languages that resemble very much just a natural language (following a strict syntax, of course). The vector of this evolution is pretty clear: we, humans, try to make programs as much as possible resembling our everyday natural language. Up to the point, where the need for clarity exceeds the need for speed. Clean Code techniques and Domain-driven design approaches (ubiquitous language in particular) aim at the same goal — make the code easy to understand and connected with the context of the requirements. Essentially what the new epoch of LLM brought to us is a leap in this evolution to the new milestone when the way of telling the computer what to do can be expressed in the natural language without respecting any strict syntactic, code formatting or a list of reserved keywords (yet, now it is done via translating the requirements given in the form of natural language text to a piece of code of one of the programming languages, but this is just the beginning: I bet in a few years LLMs like ChatGPT will be capable of creating compiled executable artefacts directly, like JVM bytecode, for example, to make things portable.

If someone thinks I am exaggerating, let me explain. I myself already actively use ChatGPT to create frontend implementations for proofs of concepts. This process is straightforward, I do not move a finger to add any code to the generated solutions, I just ask the LLM to add new functions or extend them. Same applies to fully applicable routine tasks like creating a Python script for calculating a sum of all insurance reimbursements for the year (from reimbursement documents in the PDF format).

Now, how does it follow from what I outlined above that the profession of software engineer will sink in history? The same way as the skill of punching cards with binary code has no practical value nowadays, the same will apply to the skill of describing algorithms in a programming language. This will not happen in one day, but rather gradually. This has already started — various AI-driven assistants that help to autocomplete the code. The profession of software developer will slightly transform into the profession of a business analyst. At the end of the day it will become not important whether one knows how to create a class, method, describe a loop or a switch, but rather how good one is at giving prompts for an AI assistant. And here where the importance of BDD comes into play. Let me elaborate it further.

The essential ideas of BDD include:

  • using a ubiquitous language (or domain-specific language) for describing requirements;
  • sharing understanding of the requirements between all involved parties, which is achieved via providing concrete examples describing the behaviour of the feature in question.

These two crucial points primarily serve to eliminate or significantly reduce ambiguity of the requirements on the one hand, make the requirements-givers to better understand what they actually expect to get as a result on the other hand, and finally to cover edge-case scenarios to clearly outline the boundaries of the described functionality. And this is exactly what a good prompt requires in order to get the right result from an LLM. Let us illustrate this concept with a concrete example.

Non-BDD prompting

I have a huge genealogical tree with more than two thousand persons in it stored in my genealogical application. I need to exchange this tree with my relatives from Spain. It can be easily done by exporting the entire database in the GEDCOM format — plain text format describing a tree. The only problem — my tree contains names predominantly written in the Cyrillic alphabet, whereas my Spanish relatives have no clue how to read them. So I need to quickly transliterate all names. Let us ask ChatGPT to create a script for that:

I have my genealogical tree saved in a GEDCOM file. It contains names in Cyrillic alphabet. I need to transliterate all names to Latin.

ChatGPT quickly produces the following Python script:

import re
from transliterate import translit

# Function to transliterate a name from Cyrillic to Latin
def transliterate_name(name):
return translit(name, 'ru', reversed=True)

# Function to process the GEDCOM file
def transliterate_gedcom(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as f:
lines = f.readlines()

with open(output_file, 'w', encoding='utf-8') as f:
for line in lines:
# Check if the line contains a name (starts with 1 NAME)
if line.startswith('1 NAME'):
# Extract the name part
match = re.match(r'1 NAME (.+)', line)
if match:
original_name = match.group(1)
# Transliterate the name
transliterated_name = transliterate_name(original_name)
# Replace the original name with the transliterated name
line = f'1 NAME {transliterated_name}\n'
f.write(line)

# Example usage
input_file = 'your_input_file.ged'
output_file = 'your_output_file.ged'

transliterate_gedcom(input_file, output_file)

If you are not familiar with the GEDCOM format (why would you be, really?) you will not spot any problem in this script. Let me explain further what is wrong with it: it does not transliterate surnames or maiden (given) names.

I thought, ‘Alright’, and asked ChatGPT to fix the problem:

This script does not transliterate maiden names

ChatGPT apologises as usual and suggests a slightly modified version of the script by adding a new function:

# Function to transliterate a full name including maiden names
def transliterate_full_name(full_name):
# Split the full name into parts (e.g., "First /Maiden/ Last")
parts = re.split(r'(/[^/]+/)', full_name) # This splits the name by slashes but keeps the slashes
# Transliterate each part separately
transliterated_parts = [transliterate_name(part) if part.startswith('/') else transliterate_name(part) for part in parts]
return ''.join(transliterated_parts)

This does not solve the problem! Time to give you some explanation. In the GEDCOM format names are written in the following form:

1 NAME Иван Иванович /Иванов (Иванов)/
2 GIVN Иван
2 SURN Иванов

So clearly 2 GIVN and 2 SURN are not taken into account when the proposed script performs its work and, depending on the genealogical application, when importing this file, it can cause a certain inconsistency and still display Cyrillic parts of the name.

Trying to address this problem does not solve the problem again:

It doesn’t transliterate surnames and given names

The suggested change:

# Function to transliterate the entire name, including given names, surnames, and maiden names
def transliterate_full_name(full_name):
# Split the full name by spaces and slashes, keeping slashes for maiden names intact
parts = re.split(r'(\s+|/)', full_name)
transliterated_parts = [transliterate_name(part) if not re.match(r'(\s+|/)', part) else part for part in parts]
return ''.join(transliterated_parts)

You see? The problem is that we perform this exercise assuming that we know very little about the GEDCOM format itself, as well as about Python language. At this stage if I knew where the problem came from, it, perhaps, would be faster for me to perform a manual fixing of the script by adding missing instructions. That is why I avoided telling ChatGPT exactly where the problem came from. Technically speaking, if we lack the above mentioned knowledge we reach an impasse. Any further attempts to explain to ChatGPT that the given script does its job only partially leads to modifications of the transliterate_full_name function, but not taking 2 GIVN and 2 SURN into account.

A solution? Here comes BDD!

BDD prompting

The situation can radically change if we formulate our requirements in a more structured form and give concrete examples:

As a user,

I want to have my genealogical tree transliterated from the Cyrillic alphabet to Latin

So that I can share it with the foreign part of my family

Scenario 1:

Given: a GEDCOM file containing a male person with the first name “Андрей”, patronymic name “Сергеевич” and the last name “Лебедев”

When: I run the transliterating script

Then: the transliterated file contains the same person with the first name “Andrey”, patronymic name “Sergeevich” and the last name “Lebedev” instead

Scenario 2:

Given: a GEDCOM file containing a female person with the first name “Акулина”, patronymic name “Ивановна”, the last name “Головкина” and the maiden name “Раевская”

When: I run the transliterating script

Then: the transliterated file contains the same person with the first name “Akulina”, patronymic name “Ivanovna”, the last name “Golovkina” and the maiden name “Raevskaya” instead

Just prompting ChatGPT to generate a script for these requirements gives us more or less the same result as in the previous attempt. However, we get a significantly better result if we do one of the following: either asking ChatGPT to create automated tests and check if they pass, or asking it to create automated tests beforehand and then create an implementation that made these tests pass.

Let’s see how it looks like, and first, what sort of tests it generates:

import unittest
import xml.etree.ElementTree as ET

# Placeholder for your transliteration function
def transliterate_cyrillic_to_latin(gedcom_data):
# Mock transliteration logic for demonstration purposes
transliteration_map = {
"Андрей": "Andrey",
"Сергеевич": "Sergeevich",
"Лебедев": "Lebedev",
"Акулина": "Akulina",
"Ивановна": "Ivanovna",
"Головкина": "Golovkina",
"Раевская": "Raevskaya"
}

for key, value in transliteration_map.items():
gedcom_data = gedcom_data.replace(key, value)

return gedcom_data

class TestTransliterateGenealogy(unittest.TestCase):

def test_male_person_transliteration(self):
# Scenario 1: Male person
original_gedcom = """
0 @P1@ INDI
1 NAME Андрей /Лебедев/
2 GIVN Андрей
2 SURN Лебедев
"""
expected_gedcom = """
0 @P1@ INDI
1 NAME Andrey /Lebedev/
2 GIVN Andrey
2 SURN Lebedev
"""

# Run the transliteration
result_gedcom = transliterate_cyrillic_to_latin(original_gedcom)

# Check that the transliterated GEDCOM matches the expected output
self.assertEqual(result_gedcom.strip(), expected_gedcom.strip())

// ... I omitted the rest for clarity

if __name__ == '__main__':
unittest.main()

Second, we do not even need to run it locally on our machine. This code itself serves as super-formalised requirement for the LLM.

Asking ChatGPT to correct the initial implementation so the test passes

This is my next prompt:

Assume that all needed packages are installed, do check if the test has passed

And here the magic happens, ChatGPT finds out that the tests are actually failing:

The test suite has failed on Scenario 2. The issue is that the transliteration is not applied to the given name (GIVN), surname (SURN), and maiden name (_MARNM) fields in the additional lines of the GEDCOM content.

To address this, we need to extend the transliteration logic to cover these additional lines. Here’s the updated function and tests:

def transliterate_gedcom_content(gedcom_content):
transliterated_lines = []
for line in gedcom_content.splitlines():
# Check if the line contains a name or additional name parts (GIVN, SURN, _MARNM)
if line.startswith('1 NAME') or line.startswith('2 GIVN') or line.startswith('2 SURN') or line.startswith('2 _MARNM'):
# Extract the name part
match = re.match(r'(1 NAME|2 GIVN|2 SURN|2 _MARNM) (.+)', line)
if match:
tag = match.group(1)
original_name = match.group(2)
# Transliterate the name
transliterated_name = transliterate_full_name(original_name)
# Replace the original name with the transliterated name
line = f'{tag} {transliterated_name}'
transliterated_lines.append(line)

return "\n".join(transliterated_lines)

You see? It suddenly realises that it has been missing the additional parts of the person description, namely this line of code:

match = re.match(r'(1 NAME|2 GIVN|2 SURN|2 _MARNM) (.+)', line)

Observe, that nowhere we provided ChatGPT this information about the additional blocks. We only gave concrete examples and asked it to generate and run the tests.

I find it amazing because it means that automated tests serve as negative feedback inside an LLM:

Automated tests as balancing feedback inside the LLM

Asking ChatGPT to generate the tests first and then provide the implementation

For the sake of brevity, I will omit additional code snippets. I will just tell you if you provide ChatGPT with the requirements in the BDD form and ask it to generate the tests first, and only after that to generate the code of the script, it will (most probably) produce the correct script from the first attempt.

I believe the automated tests suite still serves here as balancing feedback, but this time the initial requirements do not produce an initial code implementation:

Balancing feedback with generating automated tests first

Conclusions

In the coming epoch of ubiquitous usage of AI technologies for code generation it becomes crucial to find a way to formulate the requirements in a crystal clear and unambiguous form, so that the result produced by the AI system is correct. As future code-agnostic users, we do not want to double-check the produced code, we want to have the correct result on the first attempt.

In order to achieve that, the BDD technique is what can be the perfect solution. Collecting the requirements in a clear form, communicating with the stakeholders and providing concrete examples helps not only for human developers but to LLMs as well (who would be surprised?).

Yet, just collecting and presenting requirements in the best of the forms sometimes is still not enough to get the right solution. As BDD is a generalisation of the TDD, it is crucial to produce automated tests based on the given scenarios. Asking ChatGPT to produce tests first (or even after the first attempt) leads to production of a properly written script (or auto-correcting the previous defective implementation). Hence, automated tests serve as balancing feedback for AI (or, I’d say, even for AI) and incorporating automated tests into AI-driven code generation can not only prevent errors but streamline development, ensuring robust implementations from the start

Acknowledgements

I want to thank Dr. Igor Belyaletdinov for his consultation about the Self-regulating and Feedback mechanisms topic.

--

--

Andrey Lebedev

PhD in CS, a Software engineer with more than 20 years of experience. Check my LinkedIn profile for more information: https://www.linkedin.com/in/andremoniy/