Questions tagged [speech-recognition]

12 questions
9
votes
2 answers

Why are HMMs appropriate for speech recognition when the problem doesn't seem to satisfy the Markov property

I'm learning about HMMs and their applications and trying to understand their usages. My knowledge is a bit spotty, so please correct any incorrect assumptions I'm making. The specific example I'm wondering about is for using HMMs for speech…
8
votes
1 answer

Why do mainstream speech models no longer require a personalized training step?

Back in the Windows XP era, when setting up Windows OS-built-in speech/dictation, I had to speak out a bunch of programmed-in text samples to the speech-to-text engine to personalize my voice profile. Today, with networked speech-to-text engines…
tsutsu
  • 113
  • 3
3
votes
1 answer

Synchronizing speech and text

I have a text and a narration of the exact same text. What is the best way to synchronize them together? By synchronizing I mean, finding out for example the location of each word in the audio. For example if the sentence is "I took a cab" I want…
Ameer Jewdaki
  • 539
  • 2
  • 14
2
votes
0 answers

State of the art in multi-modal command recognition

I'm currently researching various fusion methods in a multi-modal (video, audio, identity, user position and gesture) human-computer interaction environment (think in terms of a smart-home system). What is the current state of the art in this field…
Seanny123
  • 651
  • 8
  • 23
2
votes
1 answer

What type of HMM-GMM I need

Context: I have 100 speech sentences that I asked my friend to speak. The vocabulary in the sentences are same but only the order of words are changed. My friend says that he spoke exactly what was asked for each sentence. But I don't know whether…
Pupil
  • 21
  • 1
1
vote
0 answers

How much training data for speech recognition?

How much training data is needed to build a speech-to-text engine based on machine learning? (To within an order of magnitude or so.) Big companies like Google, Facebook have a massive amount of data. For usual people its not possible to acquire…
1
vote
1 answer

How to use frame based speech features for learning using a neural network classifier?

I am doing supervised learning on speech audio files using neural networks. For this purpose, I'll have to extract features from the audio file. But since an audio file is a time varying signal, it is generally divided into multiple frames and then…
ksb
  • 801
  • 1
  • 7
  • 5
0
votes
0 answers

Examples for speech recognition systems and spoken dialogue systems

I am collecting material for a MOOC about speech technology. My aim is that students also have examples to try rather than just watching the lecture and some complimentary youtube videos. So the idea was that they could call up some spoken dialogue…
0
votes
1 answer

Is it computationally possible to voice-recognize and word-tag-time-align audiobooks to their actual text?

I would like to know whether it is computationally possible for a computer to go through the words of an audiobook as input, output a file containing both the original audio and the text corresponding to each word (which could be reviewed by a…
0
votes
1 answer

Build Automatic Speech Recognition (ASR) from scratch

I want to build a Automatic Speech Recognition (ASR) engine for myself, but I've no idea from where to start. I've read that most ASR's are build upon Hidden Markov Models, but also I've read that HMM is limited somehow and a better approach is to…
0
votes
1 answer

Speaker independent voice command recognition

I am looking for a software, a library or an algorithm that can be trained to recognize about a dozen speaker independent voice commands. The commands will be very distinct phrases of 4-5 words each. They can be chosen to sound very different from…
Sigman
  • 9
  • 1
0
votes
1 answer

How to understand a equation related to speaker recognition?

This question refers to the following paper: Support Vector Machines for Speaker and Language Recognition, W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-Carrasquillo, Computer speech and Language 20 (2006) 210-229. I am…