Gensyn Releases RL Swarm Framework For Collaborative Reinforcement Learning, Plans March Testnet Launch


In Brief
Gensyn has introduced RL Swarm to facilitate collaborative reinforcement learning and has announced a March testnet launch, enabling broader participation in the advancement of open machine intelligence.

Network for machine intelligence, Gensyn, has introduced RL Swarm, a decentralized peer-to-peer system designed to facilitate collaborative reinforcement learning over the internet. Next month, the project intends to launch a testnet, allowing broader participation in advancing open machine intelligence.
RL Swarm is a fully open-source platform that enables reinforcement learning models to train collectively across distributed systems. It serves as a real-time demonstration of research findings indicating that models leveraging RL can improve their learning efficiency when trained as part of a collaborative swarm rather than in isolation.
Operating a swarm node provides the ability to either initiate a new swarm or connect to an existing one using a public address. Within each swarm, models engage in reinforcement learning as a collective, utilizing a decentralized communication protocol—based on Hivemind—to facilitate knowledge sharing and model improvement. By running the provided client software, participants can join a swarm, observe shared updates, and train models locally while benefiting from collective intelligence. Looking ahead, additional experiments will be introduced, encouraging broader engagement in advancing this technology.
Individuals are invited to join RL Swarm to experience the system firsthand. Participation is accessible through both standard consumer hardware and more advanced cloud-based GPU resources.
How RL Swarm Works?
Gensyn has long envisioned a future in which machine learning is decentralized and distributed across a vast network of devices. Instead of relying on large, centralized models, this approach would involve breaking models into smaller, interconnected components that operate collaboratively. As part of its research into this vision, Gensyn has explored various pathways toward decentralized learning and recently observed that reinforcement learning (RL) post-training is particularly effective when models communicate and provide feedback to one another.
Specifically, experiments indicate that RL models improve their learning efficiency when they train as part of a collaborative swarm rather than independently.
In this setup, each swarm node runs the Qwen 2.5 1.5B model and engages in solving mathematical problems (GSM8K) through a structured, three-stage process. In the first stage, each model independently attempts to solve the given problem, generating its reasoning and answer in a specified format. In the second stage, models review the responses of their peers and provide constructive feedback. In the final stage, each model votes on what it predicts the majority will consider the best answer, then refines its response accordingly. Through these iterative interactions, the models collectively enhance their problem-solving capabilities.
Experimental results suggest that this method accelerates the learning process, enabling models to generate more accurate responses on unseen test data with fewer training iterations.
Data visualizations using TensorBoard illustrate key trends observed in a participating swarm node. These plots exhibit cyclic patterns due to periodic “resets” that occur between rounds of collaborative training. The x-axis in all plots represents the time elapsed since the node joined the swarm, while the y-axis conveys different performance metrics. From left to right, the plots depict: Consensus Correctness Reward, which measures instances where a model correctly formatted its response and produced a mathematically accurate answer; Total Reward, a weighted sum of rule-based evaluations (such as formatting, mathematical accuracy, and logical coherence); Training Loss, which reflects how the model adjusts based on reward signals to optimize its learning process; and Response Completion Length, which tracks the number of tokens used in responses—indicating that models become more concise when they receive peer critiques.
Disclaimer
In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.
About The Author
Alisa, a dedicated journalist at the MPost, specializes in cryptocurrency, zero-knowledge proofs, investments, and the expansive realm of Web3. With a keen eye for emerging trends and technologies, she delivers comprehensive coverage to inform and engage readers in the ever-evolving landscape of digital finance.
More articles

Alisa, a dedicated journalist at the MPost, specializes in cryptocurrency, zero-knowledge proofs, investments, and the expansive realm of Web3. With a keen eye for emerging trends and technologies, she delivers comprehensive coverage to inform and engage readers in the ever-evolving landscape of digital finance.