Icon

Ola: Pushing the Frontiers of Omni-Modal
Language Model

Zuyan Liu1,2,* Yuhao Dong2,3,* Jiahui Wang1
Ziwei Liu3 Winston Hu2 Jiwen Lu1 Yongming Rao2,1
1Tsinghua University  2Tencent Hunyuan Research  3S-Lab, NTU  *Equal Contribution 

Abstract

Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we rethink inter-modal relationships during omni-modal training, emphasizing cross-modal alignment with video as a central bridge, and propose a progressive training pipeline that begins with the most distinct modalities and gradually moves towards closer modality alignment. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.

Roads to Ola


We illustrate our innovations on visual and audio understanding in subfigures (a) and (b). Benefiting from the improved architectures and data, Ola achieves comparable performance with specialized models, outperforming state-of-the-art omni-modal models. We carefully design the training strategy for omni-modal models based on cross-modality and progressive alignment, as illustrated in subfigure (c).

Ola Performances


Ola pushes the frontiers of the omni-modal language model across image, video and audio understanding benchmarks. We compare Ola with existing state-of-the-art open-sourced multimodal models and GPT-4o on their abilities in mainstream image, video, and audio benchmarks. For fair comparisons, we select around 7B versions of existing MLLMs. Ola can achieve outperforming performance against omni-modal and specialized MLLMs in all modalities thanks to our progressive alignment strategy.

Ola Architecture

Ola Architecture. Ola supports omni-modal inputs including text, image, video, and audio, capable of processing the inputs simultaneously with competitive performance on understanding tasks for all these modalities. Meanwhile, Ola supports user-friendly real-time streaming decoding for texts and speeches thanks to the text detokenizer and the speech decoder.

Ola Training Strategies

Illustrations of the Ola Progressive Modality Alignment. We visualize the relationships among modalities in the left part. Speech acts as the connection between language and audio knowledge, while video constructs the bridge with highly relevant visual and audio information. Therefore, we design the progressive alignment training strategy from primary to periphery. Furthermore, we design the cross-modality video-audio data to better capture the relationships among modalities.

Benchmark Performance

Main Results across Image, Video, and Audio Understanding Benchmarks. We select representative benchmarks among image, video, and audio benchmarks, and select mainstream state-of-the-art open-source large language models in each modality. We also include open-source omni-modal LLMs for comparison.

Examples of Ola


Citation (BibTeX)


@article{liu2025ola,
  title={Ola: Pushing the Frontiers of Omni-Modal Language Model},
  author={Liu, Zuyan and Dong, Yuhao and Wang, Jiahui and Liu, Ziwei and Hu, Winston and Lu, Jiwen and Rao, Yongming},
  journal={arXiv preprint arXiv:2502.04328},
  year={2025}
  }