Help understanding an audio processing algorithm

Question

I'm reading “Speech segmentation without speech recognition” by Dong Wang, Lie Lu and Hong-Jiang Zhang. The algorithm I'm looking at is a V/C/P (Vowel/Consonant/Pause) classification algorithm on a digital speech signal. It is described as such:

Audio data is segmented into 20ms-long non-overlapping frames, where features, including ZCR, Energy and Pitch, are extracted.
Energy and pitch curve is smoothed.
The Mean_En and Std_En of energy curve are calculated to coarsely estimate the background noise energy level, as:
```
NoiseLevel = Mean_En - 0.75 Std_En. 
```
Similarly the threshold of ZCR (ZCR_dyna) is defined as:
```
ZCR_dyna = Mean_ZCR + 0.5 Std_ZCR
```

Frames are classified as V/C/P coarsely by using the following rules, where FrameType is used to denote the type of each frame.

If ZCR > ZCR_dyna then FrameType = Consonant 
Else if Energy < NoiseLevel, then  FrameType = Pause 
Else FrameType = Vowel

Update the NoiseLevel as the weighted average energy of the frames at each vowel boundary and the background segments.
Re-classify the frames using algorithm in step 4 with the updated NoiseLevel. Pauses are merged by removing isolated short consonants. Vowel will be split at its energy valley if its duration is too long

I do not understand step #5. Like I don't know how to interpret the wording - is there another way to describe what they are doing? I get that we want to update the NoiseLevel variable and re-run step #4 for every frame, I just don't understand how exactly.

Help understanding an audio processing algorithm

0 Answers0