最近在跑github的waveRNN实现,地址:GitHub - fatchord/WaveRNN: WaveRNN Vocoder + TTS,记录一下学习过程..
首先从github上将项目下载下来,想把模型跑起来很简单,不会遇到什么问题..
想要自行训练模型也可以,按照readme的指导来做基本都没有什么问题..
这个记录呢主要是记录一下对于项目中tacotron模型代码的认识、学习.
tacotron简单来说就是一个端到端的TTS模型.你只负责输入文本,模型输出语音.
接下来咱们按照上图的顺序,一步一步走.
Encoder
首先Character embeddings词嵌入,就是把每个字符变成向量,因为你的神经网络可是不认字符的,他只认向量数组.
对应的项目代码是这样的:
self.encoder = Encoder(embed_dims, num_chars, encoder_dims,
encoder_K, num_highways, dropout)
class Encoder(nn.Module):
def __init__(self, embed_dims, num_chars, cbhg_channels, K, num_highways, dropout):
super().__init__()
self.embedding = nn.Embedding(num_chars, embed_dims)
self.pre_net = PreNet(embed_dims)
self.cbhg = CBHG(K=K, in_channels=cbhg_channels, channels=cbhg_channels,
proj_channels=[cbhg_channels, cbhg_channels],
num_highways=num_highways)
def forward(self, x):
x = self.embedding(x)
x = self.pre_net(x)
x.transpose_(1, 2)
x = self.cbhg(x)
return x
获得词向量后,有一个per-net模块,这个模块是一个3层的网络结果,有两个隐藏层,用于对输入进行一系列的非线性变换.使模型更好的收敛、泛化.
class PreNet(nn.Module):
def __init__(self, in_dims, fc1_dims=256, fc2_dims=128, dropout=0.5):
super().__init__()
self.fc1 = nn.Linear(in_dims, fc1_dims)
self.fc2 = nn.Linear(fc1_dims, fc2_dims)
self.p = dropout
def forward(self, x):
x = self.fc1(x)
x = F.relu(x)
x = F.dropout(x, self.p, training=self.training)
x = self.fc2(x)
x = F.relu(x)
x = F.dropout(x, self.p, training=self.training)
return x
可以看到,这个网络结构采用relu激活函数,dropout系数0.5.第一层256维,第二层128维.
经过pre-net后,输出将被输入值CBHG模块,这个模块主要是用于提高模型泛化能力.
output=output1∗output2+input∗(1−output2)
highway layers层之后,会有一个双向GRU,,从GRU中输出的结果就是encoder的输出。
代码块:
class CBHG(nn.Module):
def __init__(self, K, in_channels, channels, proj_channels, num_highways):
super().__init__()
# List of all rnns to call `flatten_parameters()` on
self._to_flatten = []
self.bank_kernels = [i for i in range(1, K + 1)]
self.conv1d_bank = nn.ModuleList()
for k in self.bank_kernels:
conv = BatchNormConv(in_channels, channels, k)
self.conv1d_bank.append(conv)
self.maxpool = nn.MaxPool1d(kernel_size=2, stride=1, padding=1)
self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
self.conv_project2 = BatchNormConv(proj_channels[0], proj_channels[1], 3, relu=False)
# Fix the highway input if necessary
if proj_channels[-1] != channels:
self.highway_mismatch = True
self.pre_highway = nn.Linear(proj_channels[-1], channels, bias=False)
else:
self.highway_mismatch = False
self.highways = nn.ModuleList()
for i in range(num_highways):
hn = HighwayNetwork(channels)
self.highways.append(hn)
self.rnn = nn.GRU(channels, channels, batch_first=True, bidirectional=True)
self._to_flatten.append(self.rnn)
# Avoid fragmentation of RNN parameters and associated warning
self._flatten_parameters()