5

I am attempting to improve my RNN model by making my dependent variable, a stock price, non-stationary. I am aiming to make the series stationary by removing the trend with a log transformation and then performing moving average differencing to remove noise.

I have a function that initially logs the series, to penalise the larger values and then performing rolling mean differencing on the values.

def moving_avg_differencing(col, n_roll=30, drop=False):
    log_values = np.log(col)
    moving_avg = log_values.rolling(n_roll).mean()
    ma_diff = log_values - moving_avg

My conundrum is, if I perform this differencing before my train-val-test split, I will be informing my validation and test set of mean values that precede their respective values.

If I perform the differencing after my train-val-test split, and process the transformations individually, I will have 30 NaN values before my validation and test set. This seems messy.

Is there a better approach to differencing?

1 Answers1

1

According to econometrics literature, the standard approach is to convert your data into log returns as follows: $r'(t) = log(P{t} / P_{t-1})$, where $P(t)$ is the price at timestep $t$. This improves results because it de-trends the input and is relatively stationary compared to raw prices.

There is little difference if this is performed before or after train-test split, because the log return of each row relies only on the previous row. Specifically, if you do it after split you just lose one row of data.

Enk9456
  • 125
  • 1
  • 11