Scaling and handling highly correlated features in tabular data for regression

Question

I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) mentioned below in my data. Note that I have included both the numerical predictor variables and the target variable.

[![Correlation between the predictor variables(including the target)][1]][1]

I am wondering about the the following questions:

We see that pv3 is highly correlated to pv6,pv7,pv4 and pv5. Is it a good strategy perhaps to leave out pv6?
Can we make any other obvious inferences from this heatmap?
Another piece of domain information I have is that pv7 is a renormalization of the target. But its correlation is only 0.42. Why is this the case? I have not scaled or normalised any of the data columns. I do see that the scale of pv7 and the scale of target are very different. Perhaps, I should be scaling all the numerical columns before I compute the correlations? [1]: https://i.sstatic.net/4tBkI.png

Subhash C. Davar · Answer 1 · 2023-03-13T20:00:00.620

PV3 is highly correlated with a number of other variables. It can be dropped to avoid multiple interactions.pv6 has positive and negative correlations. It is desirable to remove it in the light of positive-negative effects on several variables. Scaling or normalizing issue should be evaluated while preparing data for analysis. The scattergraphs of probable relationships between observed data or descriptive statistics could be useful in deciding need for transforming of data.

Scaling and handling highly correlated features in tabular data for regression

1 Answers1