WavJourney: Compositional Audio Creation with LLMs

July 28th, 2023

WavJourney is an innovative system that harnesses the power of Large Language Models (LLMs) to generate audio content with diverse elements such as speech, music, and sound effects. While LLMs have already demonstrated their potential in various language and vision tasks, their application in intelligent audio content creation has been relatively unexplored.

In this article, we delve into the WavJourney system, its working principles, and its practical use cases across different scenarios.

Connecting Audio Models for Content Generation

At the core of WavJourney lies its ability to generate structured audio scripts by leveraging LLMs. These audio scripts serve as conceptual representations of the desired auditory scenes, incorporating various audio elements organized based on their spatio-temporal relationships. The interactive and interpretable nature of audio scripts facilitates human engagement and creative control.

The audio script is then transformed into a computer program using a script compiler, where each line of the program corresponds to either a task-specific audio generation model or a computational operation function, such as concatenation or mixing.

The computer program is executed to produce an explainable and customized solution for audio generation.

Use Cases of WavJourney

  • Science Fiction

In this scenario, WavJourney creates an audio clip depicting a Mars news report, where a light-speed probe is sent to Alpha Centauri. The audio starts with a news anchor, followed by a reporter interviewing a chief engineer from the organization responsible for building the probe. Finally, the news anchor wraps up the report.

  • Education

WavJourney generates a one-minute introduction to quantum mechanics by a professor, providing a succinct overview of this complex subject in an audio format.

  • Radio Play

WavJourney showcases its ability to create a love comedy radio drama. The story revolves around a couple's date at a fine restaurant, interrupted by an unexpected event that completely ruins the atmosphere.

Case Study on AudioCaps Benchmark

To evaluate the performance of WavJourney, it is compared to state-of-the-art methods using the AudioCaps benchmark. Several audio clips generated by WavJourney are presented alongside clips generated by AudioLDM and Tango, as well as the ground truth target audio clips. These comparisons demonstrate the capabilities and advancements offered by WavJourney.


WavJourney opens up new possibilities for intelligent audio content creation. By leveraging Large Language Models, it enables the generation of diverse audio content guided by text instructions.

The explainable and interactive design of WavJourney fosters human-machine co-creation and enhances creative control and adaptability in audio production. With its application in scenarios like science fiction, education, and radio plays, WavJourney presents an exciting frontier in multimedia content creation.