News Report Software Technology

August 14, 2024

OpenAI Introduces SWE-Bench Verified To Improve Reliability Of AI Model Evaluation

by Alisa Davidson

Published: August 14, 2024 at 5:37 am Updated: August 14, 2024 at 5:37 am

by Ana

Edited and fact-checked: August 14, 2024 at 5:37 am

In Brief

OpenAI released human-validated subset of SWE-bench, designed to accurately assess AI models’ ability to solve real-world software problems.

OpenAI Introduces SWE-Bench Verified To Improve Reliability Of AI Model Evaluation

Artificial intelligence research organization OpenAI announced the release of a human-validated subset of SWE-bench, designed to more accurately assess AI models’ ability to solve real-world software problems.

SWE-bench is a benchmark used to assess large language models (LLMs) capabilities in addressing real-world software issues sourced from GitHub. It is a widely used evaluation tool for software engineering, where agents are provided with a code repository and an issue description and are tasked with creating a patch to resolve the described problem.

It is used to monitor the Medium risk level within the Model Autonomy risk category of the Preparedness Framework. Evaluating catastrophic risk levels depends on the reliability of evaluation results and a clear understanding of what the scores represent.

We're releasing a new iteration of SWE-bench, in collaboration with the original authors, to more reliably evaluate AI models on their ability to solve real-world software issues. https://t.co/qJuLpCdSWJ
— OpenAI (@OpenAI) August 13, 2024

The company has released SWE-bench Verified in collaboration with the authors of SWE-bench. This subset of the original SWE-bench test set includes 500 samples confirmed as non-problematic by human annotators. This new version replaces both the original SWE-bench and SWE-bench Lite test sets. Additionally, it includes human annotations for all SWE-bench test samples.

Additionally, a new evaluation harness for SWE-bench has been developed. It utilizes containerized Docker environments to simplify and enhance the reliability of evaluations on SWE-bench.

Using this dataset, OpenAI evaluated GPT-4o’s performance with various open-source scaffolds. They discovered that GPT-4o achieved a score of 33.2% on SWE-bench Verified with the highest-performing scaffold, more than doubling its previous score of 16% on the original SWE-bench.

Cosine Achieves 30% Success Rate In Solving Real-World Programming Issues, GPT-4o Climbs To Second Place

The challenges in this benchmark are derived from a set of real-world programming problems known for being particularly tough for AIs. In March, startup Cognition AI reported that its model could solve 14% of these problems.

Recently, startup Cosine announced it had achieved a 30% success rate, setting a new record. Meanwhile, a model based on OpenAI‘s GPT-4o now holds the second-place position, up from third place with a previous version of the test.

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Alisa, a dedicated journalist at the MPost, specializes in cryptocurrency, zero-knowledge proofs, investments, and the expansive realm of Web3. With a keen eye for emerging trends and technologies, she delivers comprehensive coverage to inform and engage readers in the ever-evolving landscape of digital finance.

Alisa Davidson

Hot Stories

News Report Technology

Ronin Introduces Builder Revenue Share Program Allowing Developers To Earn Revenue From User Referrals

by Alisa Davidson

August 25, 2025

Business News Report Technology

BitMine Immersion Reports $8.82B In Crypto And Cash Holdings, Becomes Largest Ethereum Treasury Globally

by Alisa Davidson

August 25, 2025

Digest Business Technology

Last Week of August in Crypto: BTC Tests $112K, ETH Breaks Records, TON Draws Treasury Backing

by Victoria d'Este

August 25, 2025

Opinion Technology

Bybit Releases H1 2025 Report Highlighting Mastery In Crisis Response, AI Innovation, And Market Leadership

by Alisa Davidson

August 25, 2025

OpenAI Introduces SWE-Bench Verified To Improve Reliability Of AI Model Evaluation

Cosine Achieves 30% Success Rate In Solving Real-World Programming Issues, GPT-4o Climbs To Second Place

Disclaimer

About The Author

Ronin Introduces Builder Revenue Share Program Allowing Developers To Earn Revenue From User Referrals

BitMine Immersion Reports $8.82B In Crypto And Cash Holdings, Becomes Largest Ethereum Treasury Globally

Last Week of August in Crypto: BTC Tests $112K, ETH Breaks Records, TON Draws Treasury Backing

Bybit Releases H1 2025 Report Highlighting Mastery In Crisis Response, AI Innovation, And Market Leadership

Ronin Introduces Builder Revenue Share Program Allowing Developers To Earn Revenue From User Referrals

BitMine Immersion Reports $8.82B In Crypto And Cash Holdings, Becomes Largest Ethereum Treasury Globally

Last Week of August in Crypto: BTC Tests $112K, ETH Breaks Records, TON Draws Treasury Backing

CryptoQuant: Institutional Interest Shifts To Ethereum, Upward Momentum Is Supported By Stronger Fundamentals

The Calm Before The Solana Storm: What Charts, Whales, And On-Chain Signals Are Saying Now

Crypto In April 2025: Key Trends, Shifts, And What Comes Next