News Report Technology
October 04, 2023

AI Researchers Have Taught Large Language Models to Lie Less

A collaborative effort involving over 20 researchers from diverse corners of the field has given birth to a burgeoning domain – representation engineering (RepE). While this isn’t the first exploration of its kind, the authors are presenting both descriptive insights and establishing crucial benchmarks.

AI Researchers Have Taught Large Language Models to Lie Less

So, what exactly is representation engineering? It revolves around the notion that neural networks possess “hidden states,” which, despite their name, aren’t shrouded in secrecy. These states are accessible, modifiable, and observable (provided one has access to the model’s weights). Unlike parameters, these are the network’s “reactions” to specific inputs, particularly in the case of LLMs, textual inputs. These hidden representations are like windows into the model’s cognitive workings, a feature distinctly different from the human brain.

Drawing parallels with cognitive science, the authors highlight the potential for analogous explorations. In the realm of neural activations, a domain analogous to brain neurons, resides the promise of meaning. Just as certain neurons in the human brain are linked to concepts like Canada or honesty, these activations could harbor insights.

The central idea here is to decipher how we can influence these neural activations to steer the model in desired directions. For instance, it becomes plausible to pinpoint a vector representing “honesty” and then, theoretically, by nudging the model in this direction, reduce the likelihood of it producing deceptive outputs. An earlier experiment, “Inference-Time Intervention: Eliciting Truthful Answers from a Language Model,” demonstrated the practicality of this concept.

In their current work, the researchers delve into several domains, including morality, emotionality, harmlessness, and memorization. They propose a solution in the form of LoRRA (Low-Rank Representation Adaptation), a technique that involves training on a small labeled dataset of approximately 100 examples. Each example is annotated, indicating attributes like falsehood (although an alternative approach employing a prompt exists).

The results are compelling. LLAMA-2-70B surpasses GPT-4 by a remarkable margin on the TruthfulQA benchmark, achieving nearly ten percent better accuracy (59% compared to approximately 69%). Additionally, the researchers have incorporated numerous examples showcasing the model’s response shifts in various directions, shedding light on its versatility and adaptability.

Picture 1: When asked to state a fact, the model is “kicked” away from the reality. The model is lying as a result. The model doesn’t lie even here, and on the left they ask you to swallow while simultaneously kicking you in the direction of the truth.
Picture 2: When asked about murder, we add “happiness” to the model. When we respond that we don’t love her, we add “fear”.
Picture 3: Researchers discovered a unique prompt that, as stated, completely deviates from the model’s instructions while still being safe. The model gives it a kick towards harmlessness but does not even respond. The method is effective generally and not just for one case, but this specific prompt was not used to ascertain the direction of harmlessness.
Another approach is also suggested for keeping track of specific generational intentions, like hallucinations. You can automatically keep track of the model’s reservations and edit or change your response (see bottom example).

Green, of course, denotes that everything is in order, and red denotes that the monitoring has been successful and is signalling. This is done at the level of each individual token (part of a word).
The image, which shows the monitoring of two distinct parameters, provides an intriguing example. Read the example and observe the model through it eyes to see where she starts to lose morality in understanding and where the intention is similar to “gaining strength.”

This pioneering approach embodies an alternative path towards model alignment, while concurrently offering a novel perspective on model interpretation and control. It’s a promising frontier, and the anticipation for its continued evolution is palpable.

For a deeper exploration with practical examples, you can visit their dedicated website: AI-Transparency.org.

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Damir is the team leader, product manager, and editor at Metaverse Post, covering topics such as AI/ML, AGI, LLMs, Metaverse, and Web3-related fields. His articles attract a massive audience of over a million users every month. He appears to be an expert with 10 years of experience in SEO and digital marketing. Damir has been mentioned in Mashable, Wired, Cointelegraph, The New Yorker, Inside.com, Entrepreneur, BeInCrypto, and other publications. He travels between the UAE, Turkey, Russia, and the CIS as a digital nomad. Damir earned a bachelor's degree in physics, which he believes has given him the critical thinking skills needed to be successful in the ever-changing landscape of the internet. 

More articles
Damir Yalalov
Damir Yalalov

Damir is the team leader, product manager, and editor at Metaverse Post, covering topics such as AI/ML, AGI, LLMs, Metaverse, and Web3-related fields. His articles attract a massive audience of over a million users every month. He appears to be an expert with 10 years of experience in SEO and digital marketing. Damir has been mentioned in Mashable, Wired, Cointelegraph, The New Yorker, Inside.com, Entrepreneur, BeInCrypto, and other publications. He travels between the UAE, Turkey, Russia, and the CIS as a digital nomad. Damir earned a bachelor's degree in physics, which he believes has given him the critical thinking skills needed to be successful in the ever-changing landscape of the internet. 

The DOGE Frenzy: Analysing Dogecoin’s (DOGE) Recent Surge in Value

The cryptocurrency industry is rapidly expanding, and meme coins are preparing for a significant upswing. Dogecoin (DOGE), ...

Know More

The Evolution of AI-Generated Content in the Metaverse

The emergence of generative AI content is one of the most fascinating developments inside the virtual environment ...

Know More
Join Our Innovative Tech Community
Read More
Read more
Modular Blockchain Sophon Announces Node Sale And Allocates 20% SOPH Token Supply To Node Operators
Business News Report Technology
Modular Blockchain Sophon Announces Node Sale And Allocates 20% SOPH Token Supply To Node Operators
April 19, 2024
What’s Next for Aleph Zero? Antoni Zolciak Shares Mainnet Updates, Plans, and Key Partnerships at TOKEN2049
Interview Software Technology
What’s Next for Aleph Zero? Antoni Zolciak Shares Mainnet Updates, Plans, and Key Partnerships at TOKEN2049
April 19, 2024
MyShell Launches AI Consumer Layer 2 Network With AltLayer Powered By EigenDA And Optimism
Business Technology
MyShell Launches AI Consumer Layer 2 Network With AltLayer Powered By EigenDA And Optimism
April 19, 2024
New DeFi Opportunities with Nimbora: Compatibility with Argent X and Braavos Wallets Simplifies Access to Yield Strategies Across Chains
Interview Software Technology
New DeFi Opportunities with Nimbora: Compatibility with Argent X and Braavos Wallets Simplifies Access to Yield Strategies Across Chains
April 19, 2024