August 20, 2024

Understanding Python Tests and Git Patches for OpenAI SWE Bench Verified Dataset

llm

evaluation

python

tutorial

OpenAI released a human-validated subset of SWE-bench dataset, named SWE-bench Verified, to help with evaluating LLMs’ ability to solve real-world software issues.

The core idea is that a human-developed patch successfully resolved a software issue (problem statement) and passed automated tests. AI models are asked to generate a patch based on

the entire codebase
the problem statement

Then, the AI-generated patch is evaluated using the same tests to see whether it solved the problem or not.

This dataset is collected from real Python Github repos and is quite technical to understand and use (I will write another blog to explain the dataset).

In this post, I created a simple Python repository to demonstrate essential technical concepts and commands that are prerequisites for understanding the OpenAI SWE Bench Verified Dataset.

python unit tests
create and apply git patches
ask AI to generate Python patches

First, clone the repo: git clone https://github.com/harrywang/git-patch and check all commits: git log:

commit 55e70342dd9a233a5d4d6b14a929ea0f490e91e5 (HEAD -> main, origin/main, origin/HEAD)
Author: Harry Wang <harryjwang@gmail.com>
Date:   Tue Aug 20 11:14:48 2024 -0400

    added gpt prompt

commit 9a7115f79891d52f57b5e142b7dcd82fd5713e98
Author: Harry Wang <harryjwang@gmail.com>
Date:   Tue Aug 20 11:06:24 2024 -0400

    added patch file

commit d1a15a652860c4e0fbaa24afa9a8338acbf0e981
Author: Harry Wang <harryjwang@gmail.com>
Date:   Tue Aug 20 10:41:56 2024 -0400

    fix bug

commit d7b2378faa43afd85000f794acf6a9384cb4880c
Author: Harry Wang <harryjwang@gmail.com>
Date:   Tue Aug 20 10:33:36 2024 -0400

    initial not working version

Check out the initial NOT working version: git checkout d7b2378faa43afd85000f794acf6a9384cb4880c

You can see a simple Python file calculations.py for calculate summation:

class Calculations:
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def get_sum(self):
        return self.a + self.b + 1

As you can see I intentionally added 1 to introduce an error in the calculation (which will be fixed in a commit).

A Python test file tests.py is created to test the get_sum function:

import unittest
from calculations import Calculations

class TestCalculations(unittest.TestCase):

    def test_sum(self):
        calculation = Calculations(8, 2)
        self.assertEqual(calculation.get_sum(), 10, 'The sum is wrong.')

if __name__ == '__main__':
    unittest.main()

Now, run the test: python tests.py, the test will fail:

python tests.py
F
======================================================================
FAIL: test_sum (__main__.TestCalculations)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/harrywang/sandbox/git-patch/tests.py", line 8, in test_sum
    self.assertEqual(calculation.get_sum(), 10, 'The sum is wrong.')
AssertionError: 11 != 10 : The sum is wrong.

----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (failures=1)

I then removed the 1 to fix the issue and made another commit:

commit d1a15a652860c4e0fbaa24afa9a8338acbf0e981
Author: Harry Wang <harryjwang@gmail.com>
Date:   Tue Aug 20 10:41:56 2024 -0400

    fix bug

Now, you can generate a patch file based on the commit:

git format-patch -1 d1a15a652860c4e0fbaa24afa9a8338acbf0e981
0001-fix-bug.patch

While you are still in the not working Git workspace, try to apply this patch: git apply 0001-fix-bug.patch

You should see that the 1 is removed from the python file, and run the test again and it will pass:

.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

I then used the following prompt to ask gpt-4o to generate a patch file:

The sum is not working, given the following code in calculations.py:

class Calculations:
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def get_sum(self):
        return self.a + self.b + 1

    def get_difference(self):
        return self.a - self.b

    def get_product(self):
        return self.a * self.b

    def get_quotient(self):
        return self.a / self.b



 generate a git patch file to fix this problem

The result is:

The issue in your code is that the `get_sum` method incorrectly adds an extra `1` to the sum. I'll generate a git patch file to fix this issue.

Here is the git patch:

```diff
From 1234567abcdef123456789abcdef123456789abc Mon Sep 17 00:00:00 2001
From: Your Name <your.email@example.com>
Date: Tue, 20 Aug 2024 12:00:00 +0000
Subject: [PATCH] Fix incorrect sum calculation

Remove the unnecessary `+ 1` from the `get_sum` method.

---
 calculations.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/calculations.py b/calculations.py
index abcdef1..1234567 100644
--- a/calculations.py
+++ b/calculations.py
@@ -5,7 +5,7 @@ class Calculations:
         self.b = b
 
     def get_sum(self):
-        return self.a + self.b + 1
+        return self.a + self.b
 
     def get_difference(self):
         return self.a - self.b
-- 
2.34.1

I saved the generated patch code into ai-fix.patch and run git apply ai-fix.patch. It solved the problem and passed the test.

While SWE Bench uses Docker to run the evaluations (with a more complex process and code), the underlying concept is similar.

PS. The featured image for this post is generated using HiddenArt tool from Takin.ai.