I'm currently working on applying data science to High Performance Computing cluster, by analyzing the log files generated and trying to see if there is a pattern that leads to a system failure(specifically STALE FILE HANDLEs for now in GPFS file system). I am categorizing the log files and clustering based on their instances per time interval. Since some messages are more predominant over the others in any given time frame than the others, i don’t want the clustering to bias towards the one with maximum variance.
Asked
Active
Viewed 93 times
1 Answers
2
Its unclear what the OP is asking (so this response is somewhat general), but the table below illustrates common contexts and the transformations that are typical:
sales, revenue, income, price --> log(x)
distance --> 1/x, 1/x^2, log(x)
market share, preference share --> (e^x)/(1+e^x)
right-tailed dist --> sqrt(x), log(x) caution log(x<=0)
left-tailed dist --> x^2
You can also use John Tukey's three-point method as discussed in this post. When specific transformations don't work, use Box-Cox transformation. Use package car to lambda <- coef(powerTransform()) to compute lambda and then call bcPower() to transform. Consider Box-Cox transformations on all variables with skewed distributions before computing correlations or creating scatterplots.
Brandon Loudermilk
- 1,216
- 8
- 19