OpenAI released a human-validated subset of SWE-bench dataset, named SWE-bench Verified, to help with evaluating LLMs’ ability to solve real-world software issues.
The core idea is that a human-developed patch successfully resolved a software issue (problem statement) and passed automated tests. AI models are asked to generate a patch based on
Then, the AI-generated patch is evaluated using the same tests to see whether it solved the problem or not.
This dataset is collected from real Python Github repos and is quite technical to understand and use (I will write another blog to explain the dataset).
In this post, I created a simple Python repository to demonstrate essential technical concepts and commands that are prerequisites for understanding the OpenAI SWE Bench Verified Dataset.
First, clone the repo: git clone https://github.com/harrywang/git-patch
and check all commits: git log
:
commit 55e70342dd9a233a5d4d6b14a929ea0f490e91e5 (HEAD -> main, origin/main, origin/HEAD)
Author: Harry Wang <harryjwang@gmail.com>
Date: Tue Aug 20 11:14:48 2024 -0400
added gpt prompt
commit 9a7115f79891d52f57b5e142b7dcd82fd5713e98
Author: Harry Wang <harryjwang@gmail.com>
Date: Tue Aug 20 11:06:24 2024 -0400
added patch file
commit d1a15a652860c4e0fbaa24afa9a8338acbf0e981
Author: Harry Wang <harryjwang@gmail.com>
Date: Tue Aug 20 10:41:56 2024 -0400
fix bug
commit d7b2378faa43afd85000f794acf6a9384cb4880c
Author: Harry Wang <harryjwang@gmail.com>
Date: Tue Aug 20 10:33:36 2024 -0400
initial not working version
Check out the initial NOT working version: git checkout d7b2378faa43afd85000f794acf6a9384cb4880c
You can see a simple Python file calculations.py
for calculate summation:
class Calculations:
def __init__(self, a, b):
self.a = a
self.b = b
def get_sum(self):
return self.a + self.b + 1
As you can see I intentionally added 1 to introduce an error in the calculation (which will be fixed in a commit).
A Python test file tests.py
is created to test the get_sum
function:
import unittest
from calculations import Calculations
class TestCalculations(unittest.TestCase):
def test_sum(self):
calculation = Calculations(8, 2)
self.assertEqual(calculation.get_sum(), 10, 'The sum is wrong.')
if __name__ == '__main__':
unittest.main()
Now, run the test: python tests.py
, the test will fail:
python tests.py
F
======================================================================
FAIL: test_sum (__main__.TestCalculations)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/harrywang/sandbox/git-patch/tests.py", line 8, in test_sum
self.assertEqual(calculation.get_sum(), 10, 'The sum is wrong.')
AssertionError: 11 != 10 : The sum is wrong.
----------------------------------------------------------------------
Ran 1 test in 0.000s
FAILED (failures=1)
I then removed the 1 to fix the issue and made another commit:
commit d1a15a652860c4e0fbaa24afa9a8338acbf0e981
Author: Harry Wang <harryjwang@gmail.com>
Date: Tue Aug 20 10:41:56 2024 -0400
fix bug
Now, you can generate a patch file based on the commit:
git format-patch -1 d1a15a652860c4e0fbaa24afa9a8338acbf0e981
0001-fix-bug.patch
While you are still in the not working Git workspace, try to apply this patch: git apply 0001-fix-bug.patch
You should see that the 1
is removed from the python file, and run the test again and it will pass:
.
----------------------------------------------------------------------
Ran 1 test in 0.000s
OK
I then used the following prompt to ask gpt-4o to generate a patch file:
The sum is not working, given the following code in calculations.py:
class Calculations:
def __init__(self, a, b):
self.a = a
self.b = b
def get_sum(self):
return self.a + self.b + 1
def get_difference(self):
return self.a - self.b
def get_product(self):
return self.a * self.b
def get_quotient(self):
return self.a / self.b
generate a git patch file to fix this problem
The result is:
The issue in your code is that the `get_sum` method incorrectly adds an extra `1` to the sum. I'll generate a git patch file to fix this issue.
Here is the git patch:
```diff
From 1234567abcdef123456789abcdef123456789abc Mon Sep 17 00:00:00 2001
From: Your Name <your.email@example.com>
Date: Tue, 20 Aug 2024 12:00:00 +0000
Subject: [PATCH] Fix incorrect sum calculation
Remove the unnecessary `+ 1` from the `get_sum` method.
---
calculations.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/calculations.py b/calculations.py
index abcdef1..1234567 100644
--- a/calculations.py
+++ b/calculations.py
@@ -5,7 +5,7 @@ class Calculations:
self.b = b
def get_sum(self):
- return self.a + self.b + 1
+ return self.a + self.b
def get_difference(self):
return self.a - self.b
--
2.34.1
I saved the generated patch code into ai-fix.patch
and run git apply ai-fix.patch
. It solved the problem and passed the test.
While SWE Bench uses Docker to run the evaluations (with a more complex process and code), the underlying concept is similar.
PS. The featured image for this post is generated using HiddenArt tool from Takin.ai.