AI4Bharat Releases ‘Airavata’, a Custom LLM to Improve Hindi Language in AI Models

by Kumar Gandharv

Published: January 25, 2024 at 5:32 am Updated: January 25, 2024 at 5:32 am

by Victor Dey

Edited and fact-checked: January 25, 2024 at 5:32 am

In Brief

India’s AI4Bharat announced release of “Airavata”, a LLM to improve Hindi language support in AI models, built by fine-tuning OpenHathi.

AI4Bharat Releases ‘Airavat’, A Custom LLM for Improved Support Hindi Language

Indian higher education institute IIT Madras’ AI research lab AI4Bharat released Airavata, an instruction-tuned model for Hindi. According to the announcement, the model has been built by fine-tuning Sarvam AI’s OpenHathi, with diverse Hindi datasets to make it better suited for assistive tasks.

Hindi is the most spoken language in India with over 43% native speakers.

“Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages soon,” said the AI lab in a LinkedIn post. It is important to note that the performance of large language models (LLMs) relies on high-quality instruction tuning datasets. However, there is a scarcity of diverse datasets available for Hindi.

Major progress has also been made in developing datasets for pre-training like RedPajama; instruction tuning like Alpaca, UltraChat, Dolly, OpenAssistant, LMSYS-Chat; and evaluation benchmarks like AlpacaEval, MT-Bench. However, most of these advancements have been predominantly centered on the English language.

“There is some limited support for Indian languages, which can be attributed to the incidental inclusion of some Indian language data that slipped through the data filters during the pre-training of these language models. However, the representation of data, the efficacy of tokenizers, and task performance for Indian languages are considerably behind that of English,” AI4Bharat Labs said in its statement.

“The performance in Indian languages, even on closed-source models such as ChatGPT, GPT-4 and others, is inferior compared to English,” it added.

AI4Bharat Releases Instruction Tuning Datasets

The AI4Bharat team also released the instruction-tuning datasets used for the model to enable further research for IndicLLMs.

“Airavata” relies on human-curated datasets that are friendly to licensing agreements to develop instruction-tuned models. The team specifically avoid using data generated from proprietary models like GPT-4 because it would increase costs and limit the free usage of these models in other applications due to licensing restrictions.

Instead, the team believe human-curated datasets are a more sustainable approach for building models for most Indic languages.

However, Airavata, like other LLMs, encounters typical challenges. These include a possibility for hallucination, leading to fabricated information and may struggle with accuracy in complex or specialized topics. There’s also a risk of producing objectionable or biased content.

The team clarified that the model is for research purposes and is not recommended for any production use cases.

Previously, the AI4Bharat lab launched an open-source video transcreation platform – Chitralekha – which includes a workforce management system facilitating the complete transcreation process of a video from one language to another, covering transcription, translation and voice-over for the translated language.

It was created in collaboration with EkStep – a non-for-profit foundation and the team that was instrumental in developing India’s Aadhaar project.

Additionally, AI4Bharat has initiated the recruitment process for its AI resident and associate program for the 2024-25 term. This year-long pre-doctoral program emphasizes intensive work in natural language processing (NLP), speech, and vision projects.

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Kumar is an experienced Tech Journalist with a specialization in the dynamic intersections of AI/ML, marketing technology, and emerging fields such as crypto, blockchain, and NFTs. With over 3 years of experience in the industry, Kumar has established a proven track record in crafting compelling narratives, conducting insightful interviews, and delivering comprehensive insights. Kumar's expertise lies in producing high-impact content, including articles, reports, and research publications for prominent industry platforms. With a unique skill set that combines technical knowledge and storytelling, Kumar excels at communicating complex technological concepts to diverse audiences in a clear and engaging manner.

Kumar Gandharv