A Unified Neural Codec Language Model for Selective Editable Text to Speech Generation

Abstract

Neural codec language models achieve impressive zero-shot Text-to-Speech (TTS) by fully imitating the acoustic characteristics of a short speech prompt, including timbre, prosody, and paralinguistic information. However, such holistic imitation limits their ability to isolate and control individual attributes. In this paper, we present a unified codec language model SpeechEdit that extends zero-shot TTS with a selective control mechanism. By default, SpeechEdit reproduces the complete acoustic profile inferred from the speech prompt, but it selectively overrides only the attributes specified by explicit control instructions. To enable controllable modeling, SpeechEdit is trained on our newly constructed LibriEdit dataset, which provides delta (difference‑aware) training pairs derived from LibriHeavy. Experimental results show that our approach maintains naturalness and robustness while offering flexible and localized control over desired attributes.

SpeechEdit System Architecture

Overview of the SpeechEdit framework. Instruction tokens, textual content, and acoustic prompts are unified into a single token sequence through an instruction-guided conditioning interface. The codec language model performs selective attribute editing through data-driven implicit disentanglement with delta pairs.

Audio Samples of Emotion Editing Tasks

Easy Task uses neutral prompts, presenting no emotional conflict with the target
Prompt Audio
Generate Text
Target Emotion
Step Audio EditX
SpeechEdit
SpeechEdit
Neutral
This makes me so happy for you, I am glad you are getting the right kind of support you need!
Happy
iteration_2
demo 1
demo 2
Absolutely unbelievable! You people should be ashamed of yourselves.
Angry
iteration_2
demo 1
demo 2
I don't know. My life is a big mess. Everything is so complicated.
Sad
iteration_2
demo 1
demo 2
Is that true? He doesn't look like a guy who'd ever cheat on his wife, does he?
Surprise
iteration_2
demo 1
demo 2
Neutral
I'm a guy and it's the same for me! I'm glad I'm not the only one who feels this way.
Happy
iteration_2
demo 1
demo 2
Well could you do me the favor of making this quick? It's the third quarter and you've been blabbering on since the first!
Angry
iteration_2
demo 1
demo 2
But I... I still love him! And it's all my fault! I can't believe how immature and selfish I was being. I mean, he is a firefighter, it's not like he can just leave someone in a burning building and meet me for dinner. I've totally messed this up!
Sad
iteration_2
demo 1
demo 2
Oh, really? I know that telephone signal must have been shielded in the elevator shaft, so what did you do then?
Surprise
iteration_2
demo 1
demo 2
Hard Task edits target emotions under conflicting acoustic prompts
Prompt Audio
Generate Text
Target Emotion
CosyVoice 2
IndexTTS 2
SpeechEdit
Surprise
"You cads, you brutes!" he shouted, trampling on the fragments.
Angry
demo
demo
demo
Angry
“Your receptions are always delightful,” she said to that lady.
Happy
demo
demo
demo
Angry
But the period of my dissipation would end, and I always felt very sick afterwards.
Sad
demo
demo
demo
Happy
He is very poor, you know, and cannot afford to go to college just yet.
Sad
demo
demo
demo
Happy
The law, what would the law do but protect him and make me an outcast?
Angry
demo
demo
demo
Sad
And it will be a great pleasure to us to pay Toby Clark's salary as your clerk until you become prosperous enough to pay it yourself.
Happy
demo
demo
demo
Angry
But we're sick of prayers and providence we're going to do without.
Sad
demo
demo
demo
Happy
Our trade rivals are getting ahead of us. The whisper goes round Rossiter and Smith are talking not working.
Angry
demo
demo
demo
Sad
"Oh!" cried Phoebe, scenting a clew at last.
Surprise
demo
demo
demo

Audio Samples of Feature Editing Task

For each prosodic feature, we generate paired speech samples under low and high control instructions using the same transcription.

Prompt Audio
Generate Text
Tag
SpeechEdit
The world felt unusually still, as if time paused to appreciate the delicate harmony surrounding us.
speed-high
The world felt unusually still, as if time paused to appreciate the delicate harmony surrounding us.
speed-low
A serene moment unfolded as the evening sky settled into warm colors that whispered peaceful stories.
pitch-high
A serene moment unfolded as the evening sky settled into warm colors that whispered peaceful stories.
pitch-low
Every subtle sound blended into a graceful rhythm that shaped the atmosphere with quiet elegance.
energy-high
Every subtle sound blended into a graceful rhythm that shaped the atmosphere with quiet elegance.
energy-low
The gentle motion of the day created a flowing balance that invited reflection and steady comfort.
speed-high
The gentle motion of the day created a flowing balance that invited reflection and steady comfort.
speed-low
The world felt unusually still, as if time paused to appreciate the delicate harmony surrounding us.
pitch-high
The world felt unusually still, as if time paused to appreciate the delicate harmony surrounding us.
pitch-low
A serene moment unfolded as the evening sky settled into warm colors that whispered peaceful stories.
energy-high
A serene moment unfolded as the evening sky settled into warm colors that whispered peaceful stories.
energy-low

Audio Samples of Voice Conversion Task

The converted speech preserves the linguistic content of the reference speech while matching the speaker characteristics of the target speech.

Prompt Audio
Target Speaker
SpeechEdit
Reference
Target
Converted
Reference
Target
Converted
Reference
Target
Converted
Reference
Target
Converted
Reference
Target
Converted
Reference
Target
Converted
Reference
Target
Converted
Reference
Target
Converted