How to understand a equation related to speaker recognition?

Question

This question refers to the following paper:

Support Vector Machines for Speaker and Language Recognition, W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-Carrasquillo, Computer speech and Language 20 (2006) 210-229.

I am trying to implement the algorithm in table 1 and table 2 in page 18. In step 6 of of table 1 they are calculating $b_z^i$ as a mean (or sum) of $b(z_i)$ and number of entries is $N_z$ which they claim to be the number of features.

The question is what is $N_z$ here. As I understand each feature set, which is of dimension $N_z$, has been used to create $b(z_i)$, so what this summation means? One can only sum over time dimension, which has nothing to do $N_z$. $N_z$ is kind of spatial dimension as one time frame of data is converted to features.

score 0 · Accepted Answer · answered Apr 18 '19 at 23:09

0

$N_z$ is number of frames in the utterance, it is exactly time dimension. Instead "number of features" they should say "number of feature vectors".

answered Apr 18 '19 at 23:09

Nikolay Shmyrev

385
2
7

How to understand a equation related to speaker recognition?

1 Answers1