Engineers at the University of Pennsylvania School of Engineering and Applied Science have developed SmartDJ, an AI-powered editor that lets users modify immersive audio environments with instructions in everyday language, with potential applications in VR, AR and sound design.
Instead of requiring users to specify individual edits, SmartDJ can respond to high-level requests like “make this sound like a busy office,” then plan and carry out the steps needed to achieve that result.
The system addresses two major limitations of earlier AI audio-editing tools. First, most prior systems worked best with rigid, template-like commands, requiring users to identify sounds to add or remove. Second, those tools generally operated on single-channel or 'mono' audio, losing the spatial cues that are necessary for an immersive audio experience.
One of the central challenges of AI audio editing is that understanding a user’s request and generating sounds are usually handled by different kinds of AI systems. “We use language models to deal with text,” says Zitong Lan, a doctoral student in Electrical and Systems Engineering (ESE) and the study’s first author. “We further use diffusion models to edit sounds.”
The difference comes down to what each system has been trained to do. Language models learn patterns in words, helping them to interpret what users mean and to generate text in response. Diffusion models, by contrast, are designed to create media by gradually shaping noise into a coherent signal.
To bridge the gap, the team introduced an audio language model, or ALM, into the editing loop. Trained on both sound and text, the ALM analyzes the original audio together with the user’s prompt, then breaks that prompt into a sequence of smaller editing actions, such as adding, removing or repositioning a sound. A diffusion model then carries out those actions step by step, allowing SmartDJ to both interpret language and edit audio.
SmartDJ, by contrast, can interpret high-level instructions and is designed for stereo audio, allowing it to make edits that better preserve or reshape the spatial structure of a scene.
What’s more, the system is interpretable: users can see each step SmartDJ takes. For example, a prompt like “make this sound like a busy office” might lead SmartDJ to generate an instruction like “Add the sound of phone ringing at right by 3dB.” Users can then revise, remove or add individual steps, providing more control over the final result.
“With SmartDJ, users can describe the outcome they want in natural language, and the system figures out how to make it happen,” says Mingmin Zhao, Assistant Professor in Computer and Information Science (CIS) and senior author of a study presented at the 2026 International Conference on Learning Representations (ICLR). “We show that AI can help people edit audio in intuitive ways using simple language.”
image: Sylvia Zhang, Penn Engineering