A Unified Neural Codec Language Model for Selective Editable Text to Speech Generation

Abstract

Neural codec language models achieve impressive zero-shot Text-to-Speech (TTS) by fully imitating the acoustic characteristics of a short speech prompt, including timbre, prosody, and paralinguistic information. However, such holistic imitation limits their ability to isolate and control individual attributes. In this paper, we present a unified codec language model SpeechEdit that extends zero-shot TTS with a selective control mechanism. By default, SpeechEdit reproduces the complete acoustic profile inferred from the speech prompt, but it selectively overrides only the attributes specified by explicit control instructions. To enable controllable modeling, SpeechEdit is trained on our newly constructed LibriEdit dataset, which provides delta (difference‑aware) training pairs derived from LibriHeavy. Experimental results show that our approach maintains naturalness and robustness while offering flexible and localized control over desired attributes.

Overview of the SpeechEdit framework. Instruction tokens, textual content, and acoustic prompts are unified into a single token sequence through an instruction-guided conditioning interface. The codec language model performs selective attribute editing through data-driven implicit disentanglement with delta pairs.

Audio Samples of Emotion Editing Tasks

Easy Task uses neutral prompts, presenting no emotional conflict with the target

Prompt Audio

Generate Text

Target Emotion

Step Audio EditX

SpeechEdit

Neutral

This makes me so happy for you, I am glad you are getting the right kind of support you need!

Happy

iteration_2

demo 1

demo 2

Absolutely unbelievable! You people should be ashamed of yourselves.

Angry

iteration_2

demo 1

demo 2

I don't know. My life is a big mess. Everything is so complicated.

Sad

iteration_2

demo 1

demo 2

Is that true? He doesn't look like a guy who'd ever cheat on his wife, does he?

Surprise

iteration_2

demo 1

demo 2

Neutral

I'm a guy and it's the same for me! I'm glad I'm not the only one who feels this way.

Happy

iteration_2

demo 1

demo 2

Well could you do me the favor of making this quick? It's the third quarter and you've been blabbering on since the first!

Angry

iteration_2

demo 1

demo 2

But I... I still love him! And it's all my fault! I can't believe how immature and selfish I was being. I mean, he is a firefighter, it's not like he can just leave someone in a burning building and meet me for dinner. I've totally messed this up!

Sad

iteration_2

demo 1

demo 2

Oh, really? I know that telephone signal must have been shielded in the elevator shaft, so what did you do then?

Surprise

iteration_2

demo 1

demo 2

Hard Task edits target emotions under conflicting acoustic prompts

Prompt Audio

Generate Text

Target Emotion

CosyVoice 2

IndexTTS 2

SpeechEdit

Surprise

"You cads, you brutes!" he shouted, trampling on the fragments.

Angry

demo

Angry

“Your receptions are always delightful,” she said to that lady.

Happy

demo

Angry

But the period of my dissipation would end, and I always felt very sick afterwards.

Sad

demo

Happy

He is very poor, you know, and cannot afford to go to college just yet.

Sad

demo

Happy

The law, what would the law do but protect him and make me an outcast?

Angry

demo

Sad

And it will be a great pleasure to us to pay Toby Clark's salary as your clerk until you become prosperous enough to pay it yourself.

Happy

demo

Angry

But we're sick of prayers and providence we're going to do without.

Sad

demo

Happy

Our trade rivals are getting ahead of us. The whisper goes round Rossiter and Smith are talking not working.

Angry

demo

Sad

"Oh!" cried Phoebe, scenting a clew at last.

Surprise

demo

Audio Samples of Feature Editing Task

For each prosodic feature, we generate paired speech samples under low and high control instructions using the same transcription.

Prompt Audio

Generate Text

Tag

SpeechEdit

The world felt unusually still, as if time paused to appreciate the delicate harmony surrounding us.

speed-high

The world felt unusually still, as if time paused to appreciate the delicate harmony surrounding us.

speed-low

A serene moment unfolded as the evening sky settled into warm colors that whispered peaceful stories.

pitch-high

A serene moment unfolded as the evening sky settled into warm colors that whispered peaceful stories.

pitch-low

Every subtle sound blended into a graceful rhythm that shaped the atmosphere with quiet elegance.

energy-high

Every subtle sound blended into a graceful rhythm that shaped the atmosphere with quiet elegance.

energy-low

The gentle motion of the day created a flowing balance that invited reflection and steady comfort.

speed-high

The gentle motion of the day created a flowing balance that invited reflection and steady comfort.

speed-low

The world felt unusually still, as if time paused to appreciate the delicate harmony surrounding us.

pitch-high

The world felt unusually still, as if time paused to appreciate the delicate harmony surrounding us.

pitch-low

A serene moment unfolded as the evening sky settled into warm colors that whispered peaceful stories.

energy-high

A serene moment unfolded as the evening sky settled into warm colors that whispered peaceful stories.

energy-low

Audio Samples of Voice Conversion Task

The converted speech preserves the linguistic content of the reference speech while matching the speaker characteristics of the target speech.

Prompt Audio

Target Speaker

SpeechEdit

Reference

Target

Converted

Reference

Target

Converted

Reference

Target

Converted

Reference

Target

Converted

Reference

Target

Converted

Reference

Target

Converted

Reference

Target

Converted

Reference

Target

Converted