I'm currently researching various fusion methods in a multi-modal (video, audio, identity, user position and gesture) human-computer interaction environment (think in terms of a smart-home system). What is the current state of the art in this field and what fusion methods do they use?
The most recent publication I found was a PhD thesis that relied on lip reading, but I don't think this methodology is reasonable in the type of wide environment I'm considering.
Additionally, I found this publication which fuses various channels into "acts", but this seems to focus semantic-level fusion, which is seems simplistic.