1

I'm reading “Speech segmentation without speech recognition” by Dong Wang, Lie Lu and Hong-Jiang Zhang. The algorithm I'm looking at is a V/C/P (Vowel/Consonant/Pause) classification algorithm on a digital speech signal. It is described as such:

  1. Audio data is segmented into 20ms-long non-overlapping frames, where features, including ZCR, Energy and Pitch, are extracted.

  2. Energy and pitch curve is smoothed.

  3. The Mean_En and Std_En of energy curve are calculated to coarsely estimate the background noise energy level, as:

    NoiseLevel = Mean_En - 0.75 Std_En. 
    

    Similarly the threshold of ZCR (ZCR_dyna) is defined as:

    ZCR_dyna = Mean_ZCR + 0.5 Std_ZCR
    
  4. Frames are classified as V/C/P coarsely by using the following rules, where FrameType is used to denote the type of each frame.

    If ZCR > ZCR_dyna then FrameType = Consonant 
    Else if Energy < NoiseLevel, then  FrameType = Pause 
    Else FrameType = Vowel   
    
  5. Update the NoiseLevel as the weighted average energy of the frames at each vowel boundary and the background segments.

  6. Re-classify the frames using algorithm in step 4 with the updated NoiseLevel. Pauses are merged by removing isolated short consonants. Vowel will be split at its energy valley if its duration is too long

I do not understand step #5. Like I don't know how to interpret the wording - is there another way to describe what they are doing? I get that we want to update the NoiseLevel variable and re-run step #4 for every frame, I just don't understand how exactly.

Gilles 'SO- stop being evil'
  • 44,159
  • 8
  • 120
  • 184
YoungMoney
  • 111
  • 1

0 Answers0