3

I have data that map several nominal variables and one real parameter into a real value. For example:

('A', 'left', 'male', 'dog', 1.3459) -> 3.453
('A', 'top', 'male', 'dog', 6.3459) -> 6.137
...
('C', 'right', 'female', 'cat', 4.726) -> 1.456

I need to use these data to fit a function so that I can predict values for new input. For example:

('C', 'top', 'female', 'cat', 0.3459) -> ?

Fortunately, within a good approximation the output is a linear function of the real argument: y = c + k*x. However, out hypothesis is that we can improve the quality of the fit if we assume that the parameters of the linear fit ("c" and "k") depend on the nominal variables.

My questions is: How to find dependency of the fit parameters on the nominal variables?

The first idea is to consider all possible combinations of the value of the nominal variables and for each combination perform a fit. Let us say, we take all the data for ('A', 'left', 'male', 'dog') and perform the linear fit. Then we do the same for ('A', 'left', 'male', 'cat'). However, this approach will not work since I have a lot of combinations that have no data points or have a small number of points.

Alternatively, I can make independent fits for different values of a fixed nominal variable and ignore the other variables. Then I can do the same for the second nominal variable and so on. But then the question is how to combine these independent fits.

So, what would be you approach?

Roman
  • 533
  • One thing you can do is use linear regression by introducing dummy variables that take on discrete values for the nominal variables. But this question will definitely get a better treatment on stats.SE – Raskolnikov Jul 05 '13 at 13:11
  • I believe that one can do more. For example one can group by one of the many nominal variables (or few of them). One can independently consider different nominal variables and then, some how, combine different dependencies... By the way, thanks for the link. – Roman Jul 05 '13 at 14:07
  • This does not seem to be possible without additional assumptions or information. Say, the function is the weight of animal vs. time, and there are two nominal (non-numeric) parameters, Dog/Cat and Male/Female. For each combination of nominal params we have W(t)=W0+c*t, where W is the weight and W0 is the birth weight. Suppose we know W0(D,M), W0(D,F), W0(C,M). Is there any way to extrapolate this to W0(C,F)? Clearly we cannot do this without making assumptions, something like that for all considered species the birth weight of male is by the same percentage larger than that for female. – Maxim Umansky Jul 06 '13 at 20:19
  • @Maxim Umansky, isn't extrapolation and interpolation always about an assumption and a reasonable guess? All we have is data for some cases and we would like to predict what value we will have for the case that we never observed. We always make the most reasonable and natural and simple guess and hope that our guess is close to reality. For example we have the following (x,y) pairs: (1,1), (3,9), (4,16), (5,25), (6,36). We guess that for x=2 we have y=4, but it might be that for x=2 we have y=100. – Roman Jul 08 '13 at 08:02
  • There is a proposal for a new Stackexchange website for mathematical modeling that would deal with questions like this:http://area51.stackexchange.com/proposals/57490/mathematical-modeling?referrer=B5TWQ70QNWlDGxDR-o9vQA2. If you're interested, you can post more sample questions there. – Brian Rushton Jul 09 '13 at 03:33
  • @Roman If the first argument is numeric then it is fine to interpolate or extrapolate. But if the samples are like this: ("plate",1), ("car",3), ("fork",16), ("snake",25), ("ocean",36) - how would you interpolate to the first argument equal "trombone"? That's the issue here. – Maxim Umansky Jul 14 '13 at 14:16

2 Answers2

1

Here is one approach. First, for each training example calculate the expected output using just the real value and the learnt linear relation. Next subtract the real output from this estimate to give you your prediction error. You can now re-write your training set as:

('A', 'left', 'male', 'dog') -> -0.2
('A', 'top', 'male', 'dog') -> 0.41
...
('C', 'right', 'female', 'cat') -> 0.251

ie, with just the nominal variables as inputs and the error resulting from the learnt linear approximation as outputs.

You now want to learn the mapping between nominal variables an error of linear prediction.

My initial suggestion would be to learn a decision tree [1] to estimate just c (using your original k). Then using these modified c values learn a new decision tree to estimate the gradient k. You might want to repeat these two steps a few time ie, now re-estimate the c values using our latest k values, and so on. I suspect (/hope!) this process would converge, and the final two decision trees (one for c and one for k) can be used to predict c and k for a give input set of nominal variables.

I hope that helps - or at least gives you an approach to consider.

user83704
  • 151
1

I would try some locality-sensitive hashing function to map the nominal parameters (or entire tuples) into numbers. The general idea behind this solution is dimensionality reduction, which is about reducing over-complex data to smaller domains.

Arbitrary hash function also would map nominal parameters (or tuples) to numbers, but probably the interpolation wouldn't work, because most hashing functions are designed to avoid collisions of similar objects, as they are used for distinguishing between objects, in contrast to LSH functions.

Another solution is, having similarity function (which returns similarity of two tuples), to compare the new tuple with each one of the learning set, and compute the value by weighing their respective values with similarities. The Jaccard index should work well as a similarity function, but you may design something more sophisticated.

Adam
  • 140