10

I have a list of stock price sequences with 20 timesteps each. That's a 2D array of shape (total_seq, 20). I can reshape it into (total_seq, 20, 1) for concatenation to other features.

I also have news title with 10 words for each timestep. So I have 3D array of shape (total_seq, 20, 10) of the news' tokens from Tokenizer.texts_to_sequences() and sequence.pad_sequences().

I want to concatenate the news embedding to the stock price and make predictions.

My idea is that the news embedding should return tensor of shape (total_seq, 20, embed_size) so that I can concatenate it with the stock price of shape (total_seq, 20, 1) then connect it to LSTM layers.

To do that, I should convert news embedding of shape (total_seq, 20, 10) to (total_seq, 20, 10, embed_size) by using Embedding() function.

But in Keras, the Embedding() function takes a 2D tensor instead of 3D tensor. How do I get around with this problem?

Assume that Embedding() accepts 3D tensor, then after I get 4D tensor as output, I would remove the 3rd dimension by using LSTM to return last word's embedding only, so output of shape (total_seq, 20, 10, embed_size) would be converted to (total_seq, 20, embed_size)

But I would encounter another problem again, LSTM accepts 3D tensor not 4D so

How do I get around with Embedding and LSTM not accepting my inputs?

Zephyr
  • 997
  • 4
  • 11
  • 20
offchan
  • 305
  • 3
  • 12

1 Answers1

8

I'm not entirely sure if this is the cleanest solution but I stitched everything together. Each of the 10 word positions get their own input but that shouldn't be too much of a problem. The idea is to make an Embedding layer and use it multiple times. First we will generate some data:

n_samples = 1000
time_series_length = 50
news_words = 10
news_embedding_dim = 16
word_cardinality = 50

x_time_series = np.random.rand(n_samples, time_series_length, 1)
x_news_words = np.random.choice(np.arange(50), replace=True, size=(n_samples, time_series_length, news_words))
x_news_words = [x_news_words[:, :, i] for i in range(news_words)]
y = np.random.randint(2, size=(n_samples))

Now we will define the layers:

## Input of normal time series
time_series_input = Input(shape=(50, 1, ), name='time_series')

## For every word we have it's own input
news_word_inputs = [Input(shape=(50, ), name='news_word_' + str(i + 1)) for i in range(news_words)]

## Shared embedding layer
news_word_embedding = Embedding(word_cardinality, news_embedding_dim, input_length=time_series_length)

## Repeat this for every word position
news_words_embeddings = [news_word_embedding(inp) for inp in news_word_inputs]

## Concatenate the time series input and the embedding outputs
concatenated_inputs = concatenate([time_series_input] + news_words_embeddings, axis=-1)

## Feed into LSTM
lstm = LSTM(16)(concatenated_inputs)

## Output, in this case single classification
output = Dense(1, activation='sigmoid')(lstm)

After compiling the model we can just fit it like this:

model.fit([x_time_series] + x_news_words, y)

EDIT:

After what you mentioned in the comments, you can add a dense layer that summarizes the news, and adds that to your time series (stock prices):

## Summarize the news:
news_words_concat = concatenate(news_words_embeddings, axis=-1)
news_words_transformation = TimeDistributed(Dense(combined_news_embedding))(news_words_concat)

## New concat
concatenated_inputs = concatenate([time_series_input, news_words_transformation], axis=-1)
Jan van der Vegt
  • 9,448
  • 37
  • 52