multimodal FMs - video / audio
- Lead team: team-2
In this session, our readings cover:
Required Readings:
NVLM: Open Frontier-Class Multimodal LLMs
- [Submitted on 17 Sep 2024 (v1), last revised 22 Oct 2024 (this version, v2)]
- Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
-
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we release the model weights at this https URL and will open-source the training code for the community soon.
- There are two main approaches to building multimodal LLMs:
- Method A: Unified Embedding Decoder Architecture approach;
- Method B: Cross-modality Attention Architecture approach.
- Directly compared:
- Method A: The Unified Embedding Decoder Architecture (“decoder-only architecture,” NVLM-D),
- Method B: The Cross-Modality Attention Architecture (“cross-attention-based architecture,” NVLM-X),
- A hybrid approach (NVLM-H).
- Key findings are as follows:
- NVLM-X: Offers superior computational efficiency for high-resolution images.
- NVLM-D: Delivers higher accuracy for OCR-related tasks.
- NVLM-H: Combines the strengths of both approaches for optimal performance.
LLMs Meet Multimodal Generation and Editing: A Survey
- [Submitted on 29 May 2024 (v1), last revised 9 Jun 2024 (this version, v2)]
- Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen
- With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at this https URL
A Survey on Speech Large Language Models
- Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu
- [Submitted on 24 Oct 2024 (v1), last revised 25 Oct 2024 (this version, v2)]
- Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs’ advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.
Video Understanding with Large Language Models: A Survey
- Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu
- [Submitted on 29 Dec 2023 (v1), last revised 24 Jul 2024 (this version, v4)]
- With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at this https URL.
Beta Release of Zonos-v0.1
FEBRUARY 10, 2025 PALO ALTO, CALIFORNIA We are excited to announce the release of Zonos-v0.1 beta, featuring two expressive and real-time text-to-speech (TTS) models with high-fidelity voice cloning. We are releasing our 1.6B transformer and 1.6B hybrid under an Apache 2.0 license.