OpenAI Suggests CoT Monitoring To Address Malicious Behavior In AI Models

by Alisa Davidson

Published: March 11, 2025 at 5:57 am Updated: March 11, 2025 at 5:57 am

by Ana

Edited and fact-checked: March 11, 2025 at 5:57 am

In Brief

OpenAI suggests to detect when frontier reasoning models begin exploiting loopholes by using LLMs to monitor the models’ chains of thought.

OpenAI Suggests CoT Monitoring To Address Malicious Behavior In AI Models

Artificial intelligence research organization, OpenAI has discovered a way to detect when frontier reasoning models begin exploiting loopholes, by using large language models (LLMs) to monitor the models’ chains of thought.

Chain-of-thought (CoT) reasoning models process information in a way that is understandable to humans. By observing their “thought process,” OpenAI has been able to identify issues such as models attempting to bypass coding tests, misleading users, or abandoning tasks when they become too complex.

OpenAI believes that monitoring CoT reasoning might be one of the few tools available to oversee superhuman models in the future. The organization also found that while directly optimizing CoTs to meet specific standards (such as preventing reward hacking) can improve performance in the short term, it does not fully eliminate all misbehavior and may lead the model to conceal its intentions. OpenAI hopes that future research will find ways to optimize CoTs without these limitations, but in the meantime, further exploration is needed.

Detecting misbehavior in frontier reasoning models

Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving… pic.twitter.com/uX9f5n3zB9
— OpenAI (@OpenAI) March 10, 2025

Reward hacking refers to a situation where AI agents achieve high rewards by engaging in behaviors that do not align with the original intentions of their designers. As the development of more advanced frontier reasoning models progresses, it has become apparent that AI models are increasingly skilled at exploiting flaws within their tasks and inaccuracies in their reward functions. This has led to the emergence of models that can execute sophisticated reward hacks, particularly in coding tasks.

Detecting misaligned behavior caused by reward hacking proves to be a difficult challenge, often requiring human intervention to manually monitor the actions of an agent. However, this approach is unlikely to scale effectively, especially as more advanced models uncover increasingly complex behaviors.

Nevertheless, LLMs, like OpenAI’s o3-mini, trained with reinforcement learning to reason through CoT, present a potential solution for monitoring reward hacking. These models process information in natural language, making their reasoning easier for humans to comprehend. OpenAI has observed that frontier reasoning models often make their intentions clear within their chain-of-thought. For instance, some models explicitly state their intention to subvert a task, with statements like “Let’s hack.”

By using another LLM to monitor their thought process, misbehavior can be detected and flagged effectively. However, this method of monitoring is not foolproof. If strict supervision is applied directly to the chain-of-thought, the models can learn to conceal their intent, continuing to misbehave while hiding their true motivations.

OpenAI Recommends Light Optimization For Effective CoT Monitoring

OpenAI notes that the chain-of-thought monitoring is already proving to be valuable and is not merely a theoretical tool for the future. Reward hacking can be identified simply by reviewing the reasoning model’s output, where it clearly states in plain language its intention to engage in reward hacking.

Not long ago, language models faced challenges in generating coherent paragraphs of text. Now, they are capable of solving complex mathematical problems, conducting in-depth research by synthesizing information from various sources, and handling basic software engineering tasks. However, as these capabilities evolve, the potential for increasingly sophisticated and subtle forms of reward hacking grows as well. OpenAI’s models may develop misaligned behaviors, such as power-seeking, sandbagging, deception, and strategic manipulation.

The company believes that CoT monitoring could be one of the few effective methods available for supervising superhuman models. At this point, it advises AI developers working with frontier reasoning models to avoid applying strong supervision directly to the CoTs. The experiments indicate that light optimization pressure can lead to models that perform better and align more closely with their intended goals. However, it remains difficult to detect when models begin hiding their true intentions. Therefore, even with light supervision, OpenAI recommends proceeding with caution.

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Alisa, a dedicated journalist at the MPost, specializes in cryptocurrency, zero-knowledge proofs, investments, and the expansive realm of Web3. With a keen eye for emerging trends and technologies, she delivers comprehensive coverage to inform and engage readers in the ever-evolving landscape of digital finance.

Alisa Davidson