News Report Technology
January 25, 2024

AI4Bharat Releases ‘Airavata’, a Custom LLM to Improve Hindi Language in AI Models

AI4Bharat Releases ‘Airavat’, A Custom LLM for Improved Support Hindi Language

Indian higher education institute IIT Madras’ AI research lab AI4Bharat released Airavata, an instruction-tuned model for Hindi. According to the announcement, the model has been built by fine-tuning Sarvam AI’s OpenHathi, with diverse Hindi datasets to make it better suited for assistive tasks.

Hindi is the most spoken language in India with over 43% native speakers.

“Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages soon,” said the AI lab in a LinkedIn post. It is important to note that the performance of large language models (LLMs) relies on high-quality instruction tuning datasets. However, there is a scarcity of diverse datasets available for Hindi.

Major progress has also been made in developing datasets for pre-training like RedPajama; instruction tuning like Alpaca, UltraChat, Dolly, OpenAssistant, LMSYS-Chat; and evaluation benchmarks like AlpacaEval, MT-Bench. However, most of these advancements have been predominantly centered on the English language.

“There is some limited support for Indian languages, which can be attributed to the incidental inclusion of some Indian language data that slipped through the data filters during the pre-training of these language models. However, the representation of data, the efficacy of tokenizers, and task performance for Indian languages are considerably behind that of English,” AI4Bharat Labs said in its statement.

“The performance in Indian languages, even on closed-source models such as ChatGPT, GPT-4 and others, is inferior compared to English,” it added.

AI4Bharat Releases Instruction Tuning Datasets

The AI4Bharat team also released the instruction-tuning datasets used for the model to enable further research for IndicLLMs.

“Airavata” relies on human-curated datasets that are friendly to licensing agreements to develop instruction-tuned models. The team specifically avoid using data generated from proprietary models like GPT-4 because it would increase costs and limit the free usage of these models in other applications due to licensing restrictions.

Instead, the team believe human-curated datasets are a more sustainable approach for building models for most Indic languages.

However, Airavata, like other LLMs, encounters typical challenges. These include a possibility for hallucination, leading to fabricated information and may struggle with accuracy in complex or specialized topics. There’s also a risk of producing objectionable or biased content.

The team clarified that the model is for research purposes and is not recommended for any production use cases.

Previously, the AI4Bharat lab launched an open-source video transcreation platform – Chitralekha – which includes a workforce management system facilitating the complete transcreation process of a video from one language to another, covering transcription, translation and voice-over for the translated language.

It was created in collaboration with EkStep – a non-for-profit foundation and the team that was instrumental in developing India’s Aadhaar project.

Additionally, AI4Bharat has initiated the recruitment process for its AI resident and associate program for the 2024-25 term. This year-long pre-doctoral program emphasizes intensive work in natural language processing (NLP), speech, and vision projects.

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Kumar is an experienced Tech Journalist with a specialization in the dynamic intersections of AI/ML, marketing technology, and emerging fields such as crypto, blockchain, and NFTs. With over 3 years of experience in the industry, Kumar has established a proven track record in crafting compelling narratives, conducting insightful interviews, and delivering comprehensive insights. Kumar's expertise lies in producing high-impact content, including articles, reports, and research publications for prominent industry platforms. With a unique skill set that combines technical knowledge and storytelling, Kumar excels at communicating complex technological concepts to diverse audiences in a clear and engaging manner.

More articles
Kumar Gandharv
Kumar Gandharv

Kumar is an experienced Tech Journalist with a specialization in the dynamic intersections of AI/ML, marketing technology, and emerging fields such as crypto, blockchain, and NFTs. With over 3 years of experience in the industry, Kumar has established a proven track record in crafting compelling narratives, conducting insightful interviews, and delivering comprehensive insights. Kumar's expertise lies in producing high-impact content, including articles, reports, and research publications for prominent industry platforms. With a unique skill set that combines technical knowledge and storytelling, Kumar excels at communicating complex technological concepts to diverse audiences in a clear and engaging manner.

Hot Stories
Join Our Newsletter.
Latest News

Exploring Blockchain Gaming: Recap of 2023 and Sneak Peek into 2024

Footprint Analytics' report analyzes the performance data of blockchain gaming in 2023 and discusses potential trends for ...

Know More

RGB Bolsters Bitcoin and Lightning Network’s Scalability and Privacy Capabilities

RGB is a layer 2/3 solution on Bitcoin and Lightning Network that bolsters scalability and privacy capabilities ...

Know More
Join Our Innovative Tech Community
Read More
Read more
ARK Investment and 21Shares Revise Ethereum ETF Proposal, Improve Cash Creation and Redemption
Business News Report
ARK Investment and 21Shares Revise Ethereum ETF Proposal, Improve Cash Creation and Redemption
February 8, 2024
Jupiter Removes 90 Million JUP and 10 Million USDC Liquidity from Issuance Pool
Markets News Report
Jupiter Removes 90 Million JUP and 10 Million USDC Liquidity from Issuance Pool
February 8, 2024
Frax Finance Launches Layer 2 Blockchain Fraxtal and FXTL Points System
Markets News Report
Frax Finance Launches Layer 2 Blockchain Fraxtal and FXTL Points System
February 8, 2024
Ethereum Dencun Upgrade Completes on Holesky Testnet, Paves Way for Mainnet
News Report Technology
Ethereum Dencun Upgrade Completes on Holesky Testnet, Paves Way for Mainnet
February 7, 2024
What You
Need to Know

Subscribe To Our Newsletter.
Daily search marketing tidbits for savvy pros.