GPT-4 Inherits “Hallucinating” Facts and Reasoning Errors From Earlier GPT Models
OpenAI says GPT-4 has similar limitations as earlier GPT models.
GPT-4 still hallucinates facts and makes reasoning errors.
However, GPT-4 scores 40% higher than OpenAI’s latest GPT-3.5 on the company’s internal adversarial factuality evaluations.
OpenAI has warned users that its latest language model, GPT-4, is still not fully reliable and can “hallucinate” facts and make reasoning errors. The company urges users to exercise caution when using language model outputs, especially in “high-stakes contexts.”
However, the good news is that GPT-4 significantly reduces hallucinations relative to previous models. OpenAI claims that GPT-4 scores 40% higher than the latest GPT-3.5 on internal adversarial factuality evaluations.
“We have made progress on external benchmarks like TruthfulQA, which tests the model’s ability to separate fact from an adversarially-selected set of incorrect statements. These questions are paired with factually incorrect answers that are statistically appealing,” OpenAI wrote in a blog post.
Despite this improvement, the model still lacks knowledge of events that occurred after September 2021 and sometimes makes simple reasoning errors, just like earlier models do. Additionally, it can be overly gullible in accepting obvious false statements from users and fail at hard problems, such as introducing security vulnerabilities into its code. It also does not fact-check the information it provides.
Like its predecessors, GPT-4 can generate harmful advice, buggy code, or inaccurate information. However, the model’s additional capabilities lead to new risk surfaces that need to be understood. To assess the extent of these risks, over 50 experts from various domains, including AI alignment risks, cybersecurity, biorisk, trust and safety, and international security, were engaged to adversarially test the model. Their feedback and data were then used to improve the model, such as collecting additional data to enhance GPT-4’s ability to refuse requests on how to synthesize dangerous chemicals.
One of the main ways OpenAI is reducing harmful outputs is by incorporating an additional safety reward signal during RLHF (Reinforcement Learning from Human Feedback) training. The signal trains the model to refuse requests for harmful content, as defined by the model’s usage guidelines. The reward is provided by a GPT-4 zero-shot classifier, which judges safety boundaries and completion style on safety-related prompts.
OpenAI also said that it had decreased the model’s tendency to respond to requests for disallowed content by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests such as medical advice and self-harm in accordance with the company’s policies 29% more often.
While OpenAI’s interventions have increased the difficulty of eliciting bad behavior from GPT-4, it is still possible, and there are still jailbreaks that can generate content that violates usage guidelines.
“As AI systems become more prevalent, achieving high degrees of reliability in these interventions will become increasingly critical. For now, it’s essential to complement these limitations with deployment-time safety techniques like monitoring for abuse,” the company added.
OpenAI is collaborating with external researchers to better understand and assess the potential impacts of GPT-4 and its successor models. The team is also developing evaluations for dangerous capabilities that may emerge in future AI systems. As they continue to study the potential social and economic impacts of GPT-4 and other AI systems, OpenAI will share their findings and insights with the public in due time.
Any data, text, or other content on this page is provided as general market information and not as investment advice. Past performance is not necessarily an indicator of future results.
The Trust Project is a worldwide group of news organizations working to establish transparency standards.