学术论文

MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

来源：arXiv发布日期：2026-01-29作者：Meng Yang, Jon McCormack, Maria Teresa Llano, Wanchao Su, Chao Lei

内容摘要

Recent advances in multimodal large language models (MLLM) for audio music have demonstrated strong capabilities in music understanding, yet symbolic music, a fundamental representation of musical structure, remains unexplored. In this work, we introduce MIDI-LLaMA, the first instruction-following MLLM for symbolic music understanding. Our approach aligns the MIDI encoder MusicBERT and Llama-3-8B via a two-stage pipeline comprising feature alignment and instruction tuning. To support training, we design a scalable annotation pipeline that annotates GiantMIDI-Piano with fine-grained metadata, resulting in a MIDI-text dataset. Compared with the baseline trained on converting MIDI into ABC notation under the same instruction-tuning procedure, MIDI-LLaMA substantially outperforms in captioning and semantic alignment in question answering. Human evaluation further confirms the advantages of MIDI-LLaMA in music understanding, emotion recognition, creativity, and overall preference. These findings demonstrate that incorporating symbolic music into large language models enhances their capacity for musical understanding.

中文翻译

使用 AI 将内容摘要翻译为中文，便于快速阅读

使用 AI 分析这篇文章的核心发现、关键要点和深度见解

由 DeepSeek AI 提供分析 · 首次使用需配置 API Key