Bytedance Researchers Present Cross Language Agent – Simultaneous Interpretation (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System

One of the most difficult challenges in translation is simultaneous speech translation (SiST). The ability to translate spoken words into another language in real time is known as simultaneous speech translation, and it paves the way for instantaneous communication across language barriers. There has been a lot of buzz about machine-assisted autonomous interpretation in natural language processing (NLP). Streaming Automatic Speech Recognition (ASR), punctuation, and Machine Translation (MT) models are typically employed in a cascaded system in traditional simultaneous translation systems. Unfortunately, the ASR module is a common latency and error propagation source in such cascaded systems.

Academic SiST models and commercial SiST engines have come a long way, yet translation quality still needs to improve. With the help of humans, studies evaluated the available SiST systems as they are now. These systems significantly impact the efficacy of communication from a user-centered standpoint since they only provide listeners with less than 42% of the correct information. On the other hand, a human translator can convey at least 95% of the intended meaning and often more than 70%. As a result, researchers utilize 80% to denote highly qualified human interpreters in this work. LLMs are suggested to complete the SiST task because of their enormous success with machine and spoken translation.

Starting with the read-write policy, which requires LLM only to offer partial translation for input speech, integrating LLM into the SiST takes work. Second, LLMs can’t learn rare terms or terminologies from training data; thus, getting human-equivalent performance is challenging. Finally, the performance on the SiST task is still hindered by the shortage of training data. In response to these challenges, researchers from ByteDance have introduced CLASI, a unique Cross-Lingual Agent that achieves Simultaneous Interpretation through the repeated execution of various operations.

CLASI overcomes the first obstacle by emulating human interpreters’ approach of segmenting full sentences into smaller, more manageable pieces based on syntactic markers and contextual meaning. This is achieved through a data-driven policy learning method, enabling CLASI to learn and apply a rigorous read-write policy for SiST. To address the second obstacle, the CLASI agent was enhanced with two additional modules: a memory that records speech context and an external knowledge database with terminologies and matched translations. However, the external knowledge database can introduce noise and slow down the technique. To mitigate this, the researchers propose a new method called Multi-Modal Retrieval Augmented Generation (MM-RAG). This method uses a multi-modal retriever to search an external database for relevant information, thereby improving the efficiency of the CLASI agent.

They add the obtained information and memory context to the LLM agent’s prompt to improve the translation using in-context learning. They use a three-stage training methodology—pretraining, ongoing training, and fine-tuning—to tackle the data scarcity of the SiST job. LLM and audio encoder are pre trained separately using their massive internal datasets. The team trains their model continuously using billions of tokens of low-quality synthetic speech translation data to further their goal of achieving modal alignment between voice and text. For LLM to make greater use of the retriever’s and preceding translation’s contextual information, they also incorporate several activities to improve its in-context learning capability. Finally, they use a tiny quantity of human-annotated data to fine-tune the model, making it more resilient and producing better translations by mimicking the actions of human professionals. Since SiST frequently incorporates compaction, abstraction, and paraphrasing, it is possible that the traditional automatic evaluation criteria of simultaneous interpretation do not accurately reflect its performance.

Valid Information Proportion (VIP)2 is a new evaluation metric they offer, which aligns with human interpreters. The primary goal of SiST is real-time communication, and VIP indicates the proportion of information that can be transmitted precisely. The researchers found that the proposed method significantly beats other available algorithms in human evaluations conducted on challenging real-world long speech datasets that are both diverse and varied in topic. As an example, in the direction of Chinese-to-English translation, CLASI gets an 81.3% VIP score, which is far better than human interpreters. This promising result indicates a bright future for SiST.

The results in Chinese-to-English and English-to-Chinese jobs were much better than those of commercial systems, but the team highlights that language considerations should be expanded in the future. Each translation round triggers a full action sequence in the presented implementation of CLASI. Since the model can accurately translate without any external knowledge, some activities are optional for simple translation scenarios. It is possible to train the model to skip extra steps in the future.

Therefore, the Valid Information Proportion (VIP) metric is suggested for enhanced human evaluation. This underscores the need for more reliable automated quality and latency measurements in the future. The evidence also points to the potential of reinforcement learning from human feedback (RLHF) to enhance LLM performance. While CLASI outperforms prior state-of-the-art systems, there is a clear need for additional research into improving multi-modal reward models, as well as RL approaches for SiST. Promising areas of study include multi-modal integration, such as end-to-end video-to-video or speech-to-speech production.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Source link