Abstract

One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model's generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.

Proposed Model

Image1

Fig.1: The overall architecture of the proposed model.

Converted samples

Male to Female conversion

Text
Source
Target
Some have accepted it as a miracle without physical explanation.
Again-VC
BetaVAE-VC
SRD-VC
TriAAN-VC
S2VC
LIMI-VC
Proposed

Male to Male conversion

Text
Source
Target
Some have accepted it as a miracle without physical explanation.
Again-VC
BetaVAE-VC
SRD-VC
TriAAN-VC
S2VC
LIMI-VC
Proposed

Female to Male conversion

Text
Source
Target
The tournament has now been cancelled.
Again-VC
BetaVAE-VC
SRD-VC
TriAAN-VC
S2VC
LIMI-VC
Proposed

Female to Female conversion

Text
Source
Target
Otherwise, it will cost a lot of money.
Again-VC
BetaVAE-VC
SRD-VC
TriAAN-VC
S2VC
LIMI-VC
Proposed

Ablation Study

Male to Female conversion

Text
Source
Target
There is, according to legend, a boiling pot of gold at one end.
Methods
Audio
ASR Result
Proposed
there is according to legend a boiling pot of gold at one end
w/o content CL
there is according to them a boiling pot of gold and wine
w/o mel speaker CL
there is according to legend a boiling pot of gold at one end
w/o SSL speaker CL
there is according to legend a boiling pot of gold at one end

Male to Male conversion

Text
Source
Target
When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.
Methods
Audio
ASR Result
Proposed
when the sunlight strikes rain drops in the air they act as a prism and form a rainbow
w/o content CL
when the sunlight strikes wing out in the air the act is imprisoned in the former wing
w/o mel speaker CL
when the sunlight strikes rain drops in the air they act as a prism and form a rainbow
w/o SSL speaker CL
when the sunlight strikes raindrops in the air they act as a prism and form a rainbow

Female to Male conversion

Text
Source
Target
We have always maintained that we would never abandon Scottish football.
Methods
Audio
ASR Result
Proposed
we have always maintained that we would never abandon scottish football
w/o content CL
we have always maintained that we would never have handled such a football
w/o mel speaker CL
we have always maintained that we would never abandon scottish football
w/o SSL speaker CL
we have always maintained that we would never abandon scottish football

Female to Female conversion

Text
Source
Target
Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
Methods
Audio
ASR Result
Proposed
six boons of fresh snow peas five thick slabs of blue cheese and maybe a snack for her brother bob
w/o content CL
six pounds of fresh lopes five thick slabs of blue cheese and a sack for her brother bob
w/o mel speaker CL
six boons of fresh snow peas five thick slabs of blue cheese and maybe a snack for her brother bob
w/o SSL speaker CL
six beans of fresh snow peas five thick slabs of blue cheese and maybe a snack for her brother bob