I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) mentioned below in my data. Note that I have included both the numerical predictor variables and the target variable.
[![Correlation between the predictor variables(including the target)][1]][1]
I am wondering about the the following questions:
- We see that
pv3is highly correlated topv6,pv7,pv4andpv5. Is it a good strategy perhaps to leave outpv6? - Can we make any other obvious inferences from this heatmap?
- Another piece of domain information I have is that
pv7is a renormalization of the target. But its correlation is only0.42. Why is this the case? I have not scaled or normalised any of the data columns. I do see that the scale ofpv7and the scale oftargetare very different. Perhaps, I should be scaling all the numerical columns before I compute the correlations? [1]: https://i.sstatic.net/4tBkI.png