Introducing SUTRA: The Next Frontier in Multilingual AI
Learn about our proprietary series of ultrafast and lightweight Generative AI models with multilingual conversation, search, and visual capabilities.
Abhijit Bendale, Michael Sapienza, Simon Gibbs, and Steven Ripplinger
We're excited to introduce SUTRA, a new multilingual large language model (LLM) architecture. SUTRA is designed to separate concept learning from language learning, enabling it to operate in over 50 languages and deliver state-of-the-art performance on multilingual benchmarks.
Why SUTRA?
Most models that power today's generative AI tools are trained on data in English, leaving a massive gap for numerous languages — and potentially limiting access to the powerful technology for billions of non-English speakers around the world.
SUTRA was developed to address two main challenges of existing multilingual LLMs: the high computational and scaling costs of language-specific models, and the difficulties larger LLMs face with multilingual tasks, which often lead to language confusion.
While language-specific models perform admirably in their niches, they are inefficient to train and challenging to use at scale due to multilingual capabilities that most applications require. On the other end of the spectrum, more general large language models (LLMs) falter on multilingual tasks, struggling with understanding and language nuances such as grammar and formalities.
SUTRA introduces a cutting-edge method that achieves unparalleled multilingual performance, setting a new standard for language model versatility.
The Innovation Behind SUTRA
Humans first understand the world through concepts and then gradually learn their native language. Once fluent in one language, we learn new languages without having to re-learn common core concepts. Similarly, central to our approach is the innovative strategy of separating concept learning from language learning. This enables the core LLM capabilities to operate within a conceptual or latent space, while the heavy lifting of tokenization and translation is handled by specialized encoders and decoders. This approach enhances the scalability of training LLMs and supports a greater number of languages.
SUTRA follows a three-step training approach. First, during the concept learning stage, the core concept model undergoes training to grasp concepts within a small set of languages, setting a solid foundation for understanding concepts and skills. In parallel, specialized encoders and decoders, along with multilingual tokenizers specifically designed to master multi-language translation, are trained to ensure concept consistency across languages. Finally, we perform language-concept alignment, merging concept understanding with linguistic proficiency. SUTRA's methodology leverages commonalities between similar languages like Hindi, Gujarati, and Bengali with shared semantics and grammar. This strategy significantly enhances linguistic proficiency and scalability to ensure its readiness to tackle the complexities of multilingual communication.
SUTRA uses a sparse Mixture of Experts based architecture to enhance model efficiency by distributing computation across specialized experts, enabling significant scaling without linearly increasing computational costs. The SUTRA dataset, with over 100 million data points, is one of the largest conversational datasets, with key differentiation being long term and multi-turn conversations. SUTRA models are trained with this dataset and are complemented with publicly available datasets, creating a rich and diverse training environment.
SUTRA dataset contains conversations spanning wide range of topics (above figure shows topic distribution from 1M samples)
Benchmarking Excellence
In English, SUTRA is on par with GPT-3.5 and many other prominent models. On non-English languages like Korean, Hindi, and Gujarati, SUTRA truly shines, outperforming its peers by 20-30% in the MMLU benchmark. Notably, its purpose-built multilingual tokenizers efficiently reduce token usage by over 50% across languages, leading to significant savings during generation. These results underscore SUTRA’s robust understanding and processing capabilities, making it an optimal choice for developers and businesses seeking advanced multilingual AI solutions with low latency and high throughput.
Online and Up-To-Date
Traditional LLMs face a significant limitation: their data is time-locked, resulting in a cutoff date for knowledge that can quickly become outdated. In contrast, SUTRA models are connected to the internet and can reason from diverse data sources. This connectivity expands SUTRA’s knowledge base to include the entire internet, ensuring it provides timely and accurate information, making it especially valuable for applications that require accurate and up-to-date data.
Beyond Text with SUTRA-Avatar
Moving beyond text, SUTRA-Avatar is visual generative AI that can create photorealistic AI characters capable of realistic interactions. These models can produce natural-sounding speech, display a range of emotions, and perform gestures in real-time, offering new dimensions to AI-communication.
Looking Ahead with SUTRA
Our journey with SUTRA is only beginning. As we pave the way for further advancements, including the development of phonetic models (SUTRA-Dhvanim), our aim remains steadfast: to transcend linguistic barriers and build AI models and experiences aimed at the global community. We are excited to see what is built with SUTRA and invite partners to try it out for themselves at playground.two.ai.