Experts Caution Against ‘Malicious Inserts’ into AI Datasets in ChatGPT

by Damir Yalalov

Published: May 10, 2023 at 5:09 am Updated: May 10, 2023 at 5:09 am

by Karolina Gaszcz

Edited and fact-checked: May 10, 2023 at 5:09 am

In Brief

ChatGPT is potentially vulnerable due to the training data.

According to researchers, for only $60 in the US, we could poison 0.01% of the LAION-400 or COYO-700 data sets in 2022.

ChatGPT technology is becoming increasingly popular, but recent research suggests that this technology may be vulnerable due to the training data it uses. As models become more complex and data sets become larger and more complex, malicious actors could exploit this vulnerability to manipulate the data sets and cause the machine learning models to produce inaccurate results.

Experts Caution Against 'Malicious Inserts' into AI Datasets in ChatGPT — @Midjourney / TataMatalata#9861

The primary concern is that chatbot databases are often “conditionally verified” data sets, meaning that there is a certain level of trust put into the data without extensive verification. In other words, these datasets can often have underlying issues that have not been considered. Although validation of datasets is often not performed due to their large size, there exists the potential for malicious actors to manipulate this data.

In fact, researchers have suggested that by 2022, attackers could spend an estimated $60 to poison 0.01% of the LAION-400 or COYO-700 data sets. Although this does not sound like much, malicious actors could use this poisoned data for their own gain if left unchecked. The malicious data can eventually leak into larger datasets, corrupting data quality and leading to unreliable machine-learning models.

It is necessary to take steps to safeguard databases against malicious data. Aggregating several data sources should become the standard for chatbot training datasets to ensure the data is reliable and accurate. Additionally, companies should experiment with datasets to ensure they are not vulnerable to malicious actors.

AI Chatbots with Malicious Code Can Be Vulnerable to Hacking

The threat of malicious code in chatbots can be quite serious; malicious code can be used to steal user data, enable malicious access to servers, and enable malicious activities such as money laundering or data exfiltration. If an AI chatbot is trained on data with malicious inserts, it could unknowingly inject the malicious code into its responses and unknowingly be used as a tool for malicious gain.

It is possible for malicious actors to take advantage of this vulnerability by either deliberately or inadvertently introducing malicious code into the training data. In addition, since AI chatbots learn from the data it is presented with, this could also potentially lead to them learning incorrect responses or even malicious behavior.

Another danger that AI chatbots may face is that of “overfitting.” This is when prediction models are trained too closely on the data they were given, thus leading to poor predictions when presented with new data. This can be a particular problem as AI chatbots trained on malicious code could potentially become more effective at injecting malicious code into their responses as they become more familiar with the data.

It is essential to be aware of the risks and take precautions to guarantee the training data used to teach ChatGPT is secure and reliable to prevent these potential weaknesses. The initial data used for training must also be kept separate and unique; the promotion of “malicious inserts” must not conflict with or overlap with other sources. It should be examined and compared to other domains if “capturing” multiple confirmed domains is feasible to validate the data.

Chatbot technology promises to transform how people conduct human discussions. But before it can realize its full potential, it needs to be improved and safeguarded. Datasets for chatbots need to be well-checked and readied to fend off malicious actors. By doing this, we can ensure that we fully utilize the technology’s potential and keep pushing the limits of artificial intelligence.

Read more about AI:

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Damir is the team leader, product manager, and editor at Metaverse Post, covering topics such as AI/ML, AGI, LLMs, Metaverse, and Web3-related fields. His articles attract a massive audience of over a million users every month. He appears to be an expert with 10 years of experience in SEO and digital marketing. Damir has been mentioned in Mashable, Wired, Cointelegraph, The New Yorker, Inside.com, Entrepreneur, BeInCrypto, and other publications. He travels between the UAE, Turkey, Russia, and the CIS as a digital nomad. Damir earned a bachelor's degree in physics, which he believes has given him the critical thinking skills needed to be successful in the ever-changing landscape of the internet.

Damir Yalalov