2

I've just realized my prediction approach for LSTM might not be correct.

I am trying to predict character by character, by reading over the book. The way I've approached the problem is as follows:

   b                                    c                d                e
   ^     carry cell state forward       ^                ^                ^
LSTM_t0  ------------------------->  LSTM_t1  ----->  LSTM_t2  ----->  LSTM_t3
   ^                                    ^                ^                ^
   a                                    b                c                d

This means I have 4 timesteps, at at each one I feed next letter into LSTM, expecting it to immediately predict the next letter.

Should I instead do this:

ignore          ignore           ignore             e
  ^               ^                ^                ^
LSTM_t0  ---->  LSTM_t1  ----->  LSTM_t2  ----->  LSTM_t3
  ^               ^                ^                ^
  a               b                c                d

In the first case, I am able to get 4 loss-values, but in the second example, I only have 1 source of gradient, at _t3

My main concern is in first example, I demand LSTM to make prediction of 'b' and 'c' without supplying it enough previous context. It's fine for 'd' and 'e', but asking for answer at timestep 0 and 1 is a bit unfair?

What would be best for this particular example?

Kari
  • 2,756
  • 2
  • 21
  • 51

1 Answers1

1

Your first example is basically not a sequential model. You have an input and an output and that's it. You don't need a recurrent layer for that...

What I would suggest you do is:

    e               f               g                         
    ^               ^               ^ 
 LSTM_t1  ---->  LSTM_t2  -----> LSTM_t3  ----->  ...
    ^               ^               ^  
  abcd            bcde            cdef               

Obviously, you will lose the ability of predict the first characters but hopefully your dataset will include a lot more examples using those specific characters so it will learn how to use them eventually. I would say this is the most general way of producing char to char text generation models. A couple of examples including implementations: here and here.

TitoOrt
  • 1,892
  • 14
  • 23