Why do mainstream speech models no longer require a personalized training step?

Question

Back in the Windows XP era, when setting up Windows OS-built-in speech/dictation, I had to speak out a bunch of programmed-in text samples to the speech-to-text engine to personalize my voice profile.

Today, with networked speech-to-text engines like Siri or Cortana, I can just start dictating.

The quality of the text-to-speech conversion seems equivalent, though my memory may be faulty on that aspect.

Have speech models advanced past the need for any personalization of the training data? Or, do they just do the personalization under the covers now, without an explicit training wizard? Or, do they not do training, even though it would still be beneficial (e.g. because it's inconvenient)?

score 4 · Accepted Answer · answered Jan 07 '19 at 18:50

Have speech models advanced past the need for any personalization of the training data?

There were two aspects which improved accuracy significantly:

Deep learning and neural networks greatly improved the accuracy.
Amount of training data that major companies use has grown over years by order of magnitude. Companies collected so much data that effect of adaptation decreased.

Or, do they just do the personalization under the covers now, without an explicit training wizard?

There is a small adaptation usually going on, but it is very marginal in effect. It is basically matching your voice with some baseline voices and produces a vector of similarities and then this vector is used in realtime and adjust neural network input (so called i-vector adaptation). This kind of adaptation is pretty fast, you can adapt from 2-3 seconds of speech. For technical details you can read

https://www.microsoft.com/en-us/research/uploads/prod/2018/04/ICASSP2018_CortanaAdapt.pdf

Or, do they not do training, even though it would still be beneficial (e.g. because it's inconvenient)?

There are some cases where adaptation would be beneficial but again there are multiple aspects here:

It works good without adaptation.
Neural network recognition does not fit well with adaptation actually. You need many many GPU nodes to train a big neural network, it is very hard to adjust it afterwards. You can adjust a small layer with adaptation data but effect is usually small just because neural network is pretty tightly tied thing and you can't simply modify a bit without retraining.
Like I said above the amount of training data is so huge that your custom data is probably already in the training set and adaptation will not help much
Adaptation can also harm. Imagine your speech had an unusual crack or beep from background or something like music and system adapted to it. Then it will actually decode your normal clean speech with less accuracy than unadapted system.
Adaptation is not very convenient for users. Why do you need to adapt when you can simply start using the system.

So the system design moved to the "it just works" and it is a good direction.

Why do mainstream speech models no longer require a personalized training step?

1 Answers1