PyTorch语⾳识别框架,将语⾳转成⽂本格式
patter
PyTorch中的语⾳到⽂本框架,初始⽀持DeepSpeech2架构(及其变体)。
特征
基于⽂件的语料库定义配置,模型体系结构和可重复性的培训配置
DeepSpeech模型具有⾼度可配置性
各种RNN类型(RNN,LSTM,GRU)和⼤⼩(层/隐藏单元)
各种激活功能(Clipped ReLU,Swish)
具有Lookahead(⽤于流式传输)或双向RNN的仅向前RNN
可配置的CNN前端
可选的batchnorm
可选的RNN重量噪⾳
具有KenLM⽀持的波束解码器
validation框架数据集扩充,⽀持:
速度扰动
获得扰动
移动(及时)扰动
噪声添加(随机SNR)
脉冲响应扰动
Tensorboard集成
基于gRPC的模型服务器
安装
需要⼿动安装两个依赖项:
SeanNaren / warp-ctc和包含在回购中的pytorch绑定
parlance / ctcdecode CTC波束解码器⽀持语⾔模型
⼀旦安装了这些依赖项,就可以通过简单运⾏来安装模式python setup.py install。出于调试和开发⽬的,可以安装模式python setup.py develop。
数据集定义
使⽤带有换⾏符分隔的json对象的json-lines⽂件定义模式的数据集。每个链接都包含⼀个json对象,它定义了⼀个话语的⾳频路径,转录路径和持续时间(以秒为单位)。
{"audio_filepath": "/path/to/utterance1.wav", "text_filepath": "/path/", "duration": 23.147}
{"audio_filepath": "/path/to/utterance2.wav", "text_filepath": "/path/", "duration": 18.251}
训练
Patter包括⼀个顶级训练器脚本,该脚本调⽤底层库⽅法进⾏训练。要使⽤内置命令⾏培训师,必须定义三个⽂件:语料库配置,模型配置和培训配置。以下提供各⾃的实例。
语料库配置
语料库配置⽂件⽤于指定语料库中的训练和验证集以及⾳频应该发⽣的任何增强。有关这些选项的更多⽂档,请参阅下⾯的⽰例配置。
# Filter the audio configured in the `datasets` below to be within min and max duration. Remove min or max (or both) to
# do no filtering
min_duration = 1.0
max_duration = 17.0
# Link to manifest files (as described above) of the training and validation sets. A future release will allow multiple
# files to be specified for merging corpora on the fly. If `augment` is true, each audio will be passed through the
# augmentation pipeline specified below. Valid names for the datasets are in the set ["train", "val"]
[[dataset]]
name = "train"
manifest = "/path/to/corpora/train.json"
augment = true
[[dataset]]
name = "val"
manifest = "/path/to/corpora/val.json"
augment = false
# Optional augmentation pipeline. If specified, audio from a dataset with the augment flag set to true will be passed
# through each augmentation, in order. Each augmentation must minimally specify the type and a probability. The
# probability indicates that the augmentation will run on a given audio file with that probability
# The noise augmentation mixes audio from a dataset of noise files with a random SNR drawn from within the range specified. [[augmentation]]
type = "noise"
prob = 0.0
[fig]
manifest = "/path/to/noise_manifest.json"
min_snr_db = 3
max_snr_db = 35
# The impulse augmentation applies a random impulse response drawn from the manifest to the audio
[[augmentation]]
type = "impulse"
prob = 0.0
[fig]
manifest = "/path/to/impulse_manifest.json"
# The speed augmentation applies a random speed perturbation without altering pitch
[[augmentation]]
type = "speed"
prob = 1.0
[fig]
min_speed_rate = 0.95
max_speed_rate = 1.05
# The shift augmentation simply adds a random amount of silence to the audio or removes some of the initial audio
[[augmentation]]
type = "shift"
prob = 1.0
[fig]
min_shift_ms = -5
max_shift_ms = 5
# The gain augmentation modifies the gain of the audio by a fixed amount randomly chosen within the specified range
[[augmentation]]
type = "gain"
prob = 1.0
[fig]
min_gain_dbfs = -10
max_gain_dbfs = 10
型号配置
此时,patter仅⽀持DeepSpeech 2和DeepSpeech 3(与DS2 w / o BatchNorm + Weight Noise)架构相同的变体。未来版本中可能包含未来的模型体系结构,包括新颖的体系结构。要配置体系结构和超参数,请将模型定义为配置TOML。见例⼦:
# model class - only DeepSpeechOptim currently
model = "DeepSpeechOptim"
# define input features/windowing. Currently only STFT is supported, but window is configurable.
[input]
type = "stft"
normalize = true
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hamming"
# Define layers of [2d CNN -> Activation -> Optional BatchNorm] as a frontend
[[cnn]]
filters = 32
kernel = [41, 11]
stride = [2, 2]
padding = [0, 10]
batch_norm = true
activation = "hardtanh"
activation_params = [0, 20]
[[cnn]]
filters = 32
kernel = [21, 11]
stride = [2, 1]
padding = [0, 2]
batch_norm = true
activation = "hardtanh"
activation_params = [0, 20]
# Configure the RNN. Currently LSTM, GRU, and RNN are supported. QRNN will be added for forward-only models in a future release [rnn]
type = "lstm"
bidirectional = true
size = 512
layers = 4
batch_norm = true
# DS3 suggests using weight noise instead of batch norm, only set when rnn batch_norm = false
#[ise]
#mean=0.0
#std=0.001
# only used/necessary when rnn bidirectional = false
#[context]
#context = 20
#activation = "swish"
# Set of labels for model to predict. Specifying a label for the CTC 'blank' symbol is not required and handled automatically [labels]
labels = [
"'", "A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L", "M", "N", "O", "P", "Q",
"R", "S", "T", "U", "V", "W", "X", "Y", "Z", " ",
]
测试
提供模式测试脚本⽤于对训练模型进⾏评估。它将测试配置和训练模型作为参数。
cuda = true
batch_size = 10
num_workers = 4
[[dataset]]
name = "test"
manifest = "/path/to/manifests/test.jl"
augment = false
[decoder]
algorithm = "greedy" # or "beam"
workers = 4
# If `beam` is specified as the decoder type, the below is used to initialize the beam decoder
[decoder.beam]
beam_width = 30
cutoff_top_n = 40
cutoff_prob = 1.0
# If "beam" is specified and you want to use a language model, configure the ARPA or KenLM format LM and alpha/beta weights [decoder.beam.lm]
lm_path = "/path/to/language/model.arpa"
alpha = 2.15
beta = 0.35
更多使⽤⽅法可以查看官⽅⽂档
开源地址:
github/ryanleary/patter
更多更优质的资讯,请关注我,你的⽀持会⿎励我不断分享更多更好的优质⽂章。