Tencent AI Lab has released Covo-Audio, a 7B-parameter end-to-end Large Audio Language Model (LALM). The model is designed to unify speech processing and language intelligence by directly processing continuous audio inputs and generating audio outputs within a single architecture.
System Architecture
The Covo-Audio framework consists of four primary components designed for seamless cross-modal int...
In the realm of AI research, Tencent AI Lab's Covo-Audio represents a significant step forward by developing an end-to-end model that natively processes audio inputs and generates high-fidelity audio outputs. This elimination of cascaded ASR-LLM-TTS pipelines could lead to reduced error propagation and information loss. The Hierarchical Tri-modal Speech-Text Interleaving strategy aligns continuous acoustic features, discrete speech tokens, and natural language text for improved semantic integrit...
