News Report Technology
March 11, 2025

OpenAI Suggests CoT Monitoring To Address Malicious Behavior In AI Models

In Brief

OpenAI suggests to detect when frontier reasoning models begin exploiting loopholes by using LLMs to monitor the models’ chains of thought.

OpenAI Suggests CoT Monitoring To Address Malicious Behavior In AI Models

Artificial intelligence research organization, OpenAI has discovered a way to detect when frontier reasoning models begin exploiting loopholes, by using large language models (LLMs) to monitor the models’ chains of thought.

Chain-of-thought (CoT) reasoning models process information in a way that is understandable to humans. By observing their “thought process,” OpenAI has been able to identify issues such as models attempting to bypass coding tests, misleading users, or abandoning tasks when they become too complex.

OpenAI believes that monitoring CoT reasoning might be one of the few tools available to oversee superhuman models in the future. The organization also found that while directly optimizing CoTs to meet specific standards (such as preventing reward hacking) can improve performance in the short term, it does not fully eliminate all misbehavior and may lead the model to conceal its intentions. OpenAI hopes that future research will find ways to optimize CoTs without these limitations, but in the meantime, further exploration is needed.

Reward hacking refers to a situation where AI agents achieve high rewards by engaging in behaviors that do not align with the original intentions of their designers. As the development of more advanced frontier reasoning models progresses, it has become apparent that AI models are increasingly skilled at exploiting flaws within their tasks and inaccuracies in their reward functions. This has led to the emergence of models that can execute sophisticated reward hacks, particularly in coding tasks.

Detecting misaligned behavior caused by reward hacking proves to be a difficult challenge, often requiring human intervention to manually monitor the actions of an agent. However, this approach is unlikely to scale effectively, especially as more advanced models uncover increasingly complex behaviors.

Nevertheless, LLMs, like OpenAI’s o3-mini, trained with reinforcement learning to reason through CoT, present a potential solution for monitoring reward hacking. These models process information in natural language, making their reasoning easier for humans to comprehend. OpenAI has observed that frontier reasoning models often make their intentions clear within their chain-of-thought. For instance, some models explicitly state their intention to subvert a task, with statements like “Let’s hack.”

By using another LLM to monitor their thought process, misbehavior can be detected and flagged effectively. However, this method of monitoring is not foolproof. If strict supervision is applied directly to the chain-of-thought, the models can learn to conceal their intent, continuing to misbehave while hiding their true motivations.

OpenAI Recommends Light Optimization For Effective CoT Monitoring

OpenAI notes that the chain-of-thought monitoring is already proving to be valuable and is not merely a theoretical tool for the future. Reward hacking can be identified simply by reviewing the reasoning model’s output, where it clearly states in plain language its intention to engage in reward hacking.

Not long ago, language models faced challenges in generating coherent paragraphs of text. Now, they are capable of solving complex mathematical problems, conducting in-depth research by synthesizing information from various sources, and handling basic software engineering tasks. However, as these capabilities evolve, the potential for increasingly sophisticated and subtle forms of reward hacking grows as well. OpenAI’s models may develop misaligned behaviors, such as power-seeking, sandbagging, deception, and strategic manipulation.

The company believes that CoT monitoring could be one of the few effective methods available for supervising superhuman models. At this point, it advises AI developers working with frontier reasoning models to avoid applying strong supervision directly to the CoTs. The experiments indicate that light optimization pressure can lead to models that perform better and align more closely with their intended goals. However, it remains difficult to detect when models begin hiding their true intentions. Therefore, even with light supervision, OpenAI recommends proceeding with caution.

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Alisa, a dedicated journalist at the MPost, specializes in cryptocurrency, zero-knowledge proofs, investments, and the expansive realm of Web3. With a keen eye for emerging trends and technologies, she delivers comprehensive coverage to inform and engage readers in the ever-evolving landscape of digital finance.

More articles
Alisa Davidson
Alisa Davidson

Alisa, a dedicated journalist at the MPost, specializes in cryptocurrency, zero-knowledge proofs, investments, and the expansive realm of Web3. With a keen eye for emerging trends and technologies, she delivers comprehensive coverage to inform and engage readers in the ever-evolving landscape of digital finance.

Hot Stories

SKALE’s Vision for a Scalable and Accessible Future

by Victoria d'Este
March 11, 2025
Join Our Newsletter.
Latest News

From Ripple to The Big Green DAO: How Cryptocurrency Projects Contribute to Charity

Let's explore initiatives harnessing the potential of digital currencies for charitable causes.

Know More

AlphaFold 3, Med-Gemini, and others: The Way AI Transforms Healthcare in 2024

AI manifests in various ways in healthcare, from uncovering new genetic correlations to empowering robotic surgical systems ...

Know More
Read More
Read more
Impossible Cloud Network Node Sale Ends Tomorrow: Final Opportunity To Participate And Support Network Development
News Report Technology
Impossible Cloud Network Node Sale Ends Tomorrow: Final Opportunity To Participate And Support Network Development
March 11, 2025
SKALE’s Vision for a Scalable and Accessible Future
Interview Business Markets Technology
SKALE’s Vision for a Scalable and Accessible Future
March 11, 2025
Bitget’s February 2025 Transparency Report: $20B Daily Trading Volume, Forbes Recognition, And Global Expansion
News Report Technology
Bitget’s February 2025 Transparency Report: $20B Daily Trading Volume, Forbes Recognition, And Global Expansion
March 11, 2025
Dollar-Pegged Stablecoins Account for Over 98% of Total Supply and Hedge Against Volatility
Opinion Business Markets Technology
Dollar-Pegged Stablecoins Account for Over 98% of Total Supply and Hedge Against Volatility
March 11, 2025