GPT-4’s Performance on U.S. Bar Exam Contradicts Its Claims

by Damir Yalalov

Published: May 30, 2023 at 3:52 am Updated: May 30, 2023 at 3:52 am

by Karolina Gaszcz

Edited and fact-checked: May 30, 2023 at 3:52 am

In Brief

The examination of GPT-4’s performance on the Uniform Bar Exam revealed a discrepancy between estimated and actual performance, emphasizing the importance of transparent evaluation procedures and accessible data.

OpenAI is encouraged to address discrepancies and develop a more inclusive and reliable approach to AI model evaluation to gain trust and ensure credibility.

In a recent examination of GPT-4’s performance on the Uniform Bar Exam (UBE), doubts have arisen about the accuracy of OpenAI’s claims regarding the model’s success rate. Contrary to the initial assertion that GPT-4 outperforms 90% of individuals, the findings suggest a significant discrepancy between the estimated and actual performance of the AI model. This revelation emphasizes the importance of transparent evaluation procedures and accessible data for validating such claims.

GPT-4's Performance on U.S. Bar Exam Contradicts Its Claims — @Midjourney

The examination focused on various factors to ascertain the true capabilities of GPT-4. Firstly, the analysis of the February exams in Illinois revealed that GPT-4’s scores approached the 90th percentile. However, it was observed that these scores were heavily influenced by retakers who had previously failed the July exam and thus scored below the overall average.

Furthermore, the results of the July exam contradicted OpenAI’s claims, revealing that GPT-4 would only outperform 68% of people and 48% of essays. GPT-4’s performance against first-time takers (excluding retakes) was evaluated at the 63rd percentile when official data from several tests at different periods was considered, with essays scoring considerably lower at the 41st percentile.

An additional perspective was gained by examining the performance of those who passed the exam, including licensed individuals and those awaiting licensing. In this regard, GPT-4’s overall performance was ranked at the 48th percentile, with essays faring even worse at the 15th percentile.

While these findings are troubling, it is critical to consider the possibility of human mistake in the review process. The author of the article emphasizes the importance of understanding the sample utilized by the researchers to evaluate GPT-4’s performance. The lack of official data, especially in aggregated form, makes fair comparison and evaluation of percentiles difficult. Establishing clear and accessible evaluation techniques that can be evaluated by all stakeholders is critical.

In response to these concerns, OpenAI is urged to address the discrepancies and provide further insights into the evaluation process. Transparency and openness are essential for gaining trust and ensuring the credibility of AI models in high-stakes domains such as law.

It should be noted that the article does not discuss the specific score achieved by GPT-4, which is reported to be 298. Evaluating the significance of this score requires a contextual understanding of the grading system used. Just as a child coming home from school with a B could be either a cause for celebration or disappointment, the interpretation of the GPT-4’s score depends on the scale employed.

The assessment of GPT-4’s performance on the bar exam raises serious concerns about the veracity of OpenAI’s initial assertions. The gap between estimated and actual performance emphasizes the importance of clear evaluation systems and easily accessible data. OpenAI is encouraged to address these challenges and develop a more inclusive and reliable approach to AI model evaluation.

Read more about AI:

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Damir is the team leader, product manager, and editor at Metaverse Post, covering topics such as AI/ML, AGI, LLMs, Metaverse, and Web3-related fields. His articles attract a massive audience of over a million users every month. He appears to be an expert with 10 years of experience in SEO and digital marketing. Damir has been mentioned in Mashable, Wired, Cointelegraph, The New Yorker, Inside.com, Entrepreneur, BeInCrypto, and other publications. He travels between the UAE, Turkey, Russia, and the CIS as a digital nomad. Damir earned a bachelor's degree in physics, which he believes has given him the critical thinking skills needed to be successful in the ever-changing landscape of the internet.

Damir Yalalov