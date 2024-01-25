AI4Bharat Releases ‘Airavata’, a Custom LLM to Improve Hindi Language in AI Models

Indian higher education institute IIT Madras’ AI research lab AI4Bharat released Airavata, an instruction-tuned model for Hindi. According to the announcement, the model has been built by fine-tuning Sarvam AI’s OpenHathi, with diverse Hindi datasets to make it better suited for assistive tasks.

Hindi is the most spoken language in India with over 43% native speakers.

“Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages soon,” said the AI lab in a LinkedIn post. It is important to note that the performance of large language models (LLMs) relies on high-quality instruction tuning datasets. However, there is a scarcity of diverse datasets available for Hindi.

Major progress has also been made in developing datasets for pre-training like RedPajama; instruction tuning like Alpaca, UltraChat, Dolly, OpenAssistant, LMSYS-Chat; and evaluation benchmarks like AlpacaEval, MT-Bench. However, most of these advancements have been predominantly centered on the English language.

“There is some limited support for Indian languages, which can be attributed to the incidental inclusion of some Indian language data that slipped through the data filters during the pre-training of these language models. However, the representation of data, the efficacy of tokenizers, and task performance for Indian languages are considerably behind that of English,” AI4Bharat Labs said in its statement.

“The performance in Indian languages, even on closed-source models such as ChatGPT, GPT-4 and others, is inferior compared to English,” it added.

AI4Bharat Releases Instruction Tuning Datasets

The AI4Bharat team also released the instruction-tuning datasets used for the model to enable further research for IndicLLMs.

“Airavata” relies on human-curated datasets that are friendly to licensing agreements to develop instruction-tuned models. The team specifically avoid using data generated from proprietary models like GPT-4 because it would increase costs and limit the free usage of these models in other applications due to licensing restrictions.

Instead, the team believe human-curated datasets are a more sustainable approach for building models for most Indic languages.

However, Airavata, like other LLMs, encounters typical challenges. These include a possibility for hallucination, leading to fabricated information and may struggle with accuracy in complex or specialized topics. There’s also a risk of producing objectionable or biased content.

The team clarified that the model is for research purposes and is not recommended for any production use cases.

Previously, the AI4Bharat lab launched an open-source video transcreation platform – Chitralekha – which includes a workforce management system facilitating the complete transcreation process of a video from one language to another, covering transcription, translation and voice-over for the translated language.

It was created in collaboration with EkStep – a non-for-profit foundation and the team that was instrumental in developing India’s Aadhaar project.

Additionally, AI4Bharat has initiated the recruitment process for its AI resident and associate program for the 2024-25 term. This year-long pre-doctoral program emphasizes intensive work in natural language processing (NLP), speech, and vision projects.

