News Report Technology

March 16, 2023

OpenAI Announces Evals, An Open-Source Software Framework for Evaluating AI Models

by Cindy Tan

Published: March 16, 2023 at 12:30 pm Updated: March 16, 2023 at 12:30 pm

In Brief

OpenAI hopes to crowdsource benchmarks for evaluating AI models like GPT-4.

Payment processing company, Stripe, has already used Evals to measure the accuracy of their GPT-powered documentation tool.

OpenAI will be granting GPT-4 access for a limited time to those who contribute high quality evals.

OpenAI Announces Evals, An Open-Source Software Framework for Evaluating AI Models

Alongside the announcement of GPT-4, OpenAI has announced the open-source software framework OpenAI Evals. This tool is designed to create and run benchmarks that evaluate the performance of models like GPT-4. With Evals, OpenAI hopes to crowdsource benchmarks for AI model testing.

“We use Evals to guide development of our models (both identifying shortcomings and preventing regressions), and our users can apply it for tracking performance across model versions (which will now be coming out regularly) and evolving product integrations,” the company explains in a blog post.

Stripe, a popular payment processing company, has already used Evals to complement its human evaluations and measure the accuracy of their GPT-powered documentation tool.

Developers can use Evals to create and run evaluations that:

Use datasets to generate prompts,
Measure the quality of completions provided by an OpenAI model, and
Compare performance across different datasets and models.

With the open-source code, developers can also write and add a custom Eval as well as several templates that may accommodate different benchmarks. The company has included templates that have been most useful internally, including a template for “model-graded evals,” which GPT-4 can use to check its own work. As an example to follow, the company has created a logic puzzles eval containing ten prompts where GPT-4 fails.

Evals is also compatible with implementing existing benchmarks, including several notebooks implementing academic benchmarks and a few variations of integrating small subsets of CoQA.

While developers will not be paid for contributing Evals, OpenAI will be granting GPT-4 access for a limited time to those who contribute “high-quality evals.”

The announcement of Evals comes after OpenAI recently said it would stop using data submitted by customers via its API to train or improve its models unless the customers decide to opt in. The company joins Meta in crowdsourcing benchmarks as the latter tasks humans with “finding adversarial examples that fool current state-of-the-art models” for its DynaBench platform.

Read more:

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Cindy is a journalist at Metaverse Post, covering topics related to web3, NFT, metaverse and AI, with a focus on interviews with Web3 industry players. She has spoken to over 30 C-level execs and counting, bringing their valuable insights to readers. Originally from Singapore, Cindy is now based in Tbilisi, Georgia. She holds a Bachelor's degree in Communications & Media Studies from the University of South Australia and has a decade of experience in journalism and writing. Get in touch with her via [email protected] with press pitches, announcements and interview opportunities.

Cindy Tan

Hot Stories

Circle Secures New York Trust Charter, Fortifying Regulatory Foundation For USDC

by Alisa Davidson

July 31, 2026

Gate Update: Zero-Fee US Stocks, A 63% Chip Surge, And Fed Volatility Define A Landmark Week

by Alisa Davidson

July 31, 2026

Gate Introduces Zero-Fee Trading For Eligible US Stocks And ETFs, Becoming First Digital Asset Platform To Eliminate Commission Fees

by Alisa Davidson

July 31, 2026

Gate Introduces Zero-Fee Trading For Eligible US Stocks And ETFs, Becoming First Digital Asset Platform To Eliminate Commission Fees

by Alisa Davidson

July 31, 2026

Business Markets News Report Technology

Circle Secures New York Trust Charter, Fortifying Regulatory Foundation For USDC

by Alisa Davidson

July 31, 2026

Digest News Report Technology

Gate Update: Zero-Fee US Stocks, A 63% Chip Surge, And Fed Volatility Define A Landmark Week

by Alisa Davidson

July 31, 2026

News Report Technology

Gate Introduces Zero-Fee Trading For Eligible US Stocks And ETFs, Becoming First Digital Asset Platform To Eliminate Commission Fees

by Alisa Davidson

July 31, 2026

Opinion Business Technology

Why Enterprise AI Agents Fail—And What They Need To Work

by Rise Ooi

July 31, 2026

OpenAI Announces Evals, An Open-Source Software Framework for Evaluating AI Models

Disclaimer

About The Author

Circle Secures New York Trust Charter, Fortifying Regulatory Foundation For USDC

Gate Update: Zero-Fee US Stocks, A 63% Chip Surge, And Fed Volatility Define A Landmark Week

Gate Introduces Zero-Fee Trading For Eligible US Stocks And ETFs, Becoming First Digital Asset Platform To Eliminate Commission Fees

Why Enterprise AI Agents Fail—And What They Need To Work

Circle Secures New York Trust Charter, Fortifying Regulatory Foundation For USDC

Gate Update: Zero-Fee US Stocks, A 63% Chip Surge, And Fed Volatility Define A Landmark Week

Gate Introduces Zero-Fee Trading For Eligible US Stocks And ETFs, Becoming First Digital Asset Platform To Eliminate Commission Fees

Morph, Morpho And Gauntlet Partner To Deliver Institutional On-Chain Yield To Bitget’s 125M Users

How Minmax Is Building The Professional AI Trading Terminal Prediction Markets Still Lack In 2026

The Calm Before The Solana Storm: What Charts, Whales, And On-Chain Signals Are Saying Now