10

For a recommendation system I'm using cosine similarity to compute similarities between items. However, for items with small amounts of data I'd like to bin them under a general "average" category (in the general not mathematical sense). To accomplish this I'm currently trying to create a synthetic observation to represent that middle of the road point.

So for example if these were my observations (rows are observations, cols are features):

[[0, 0, 0, 1, 1, 1, 0, 1, 0],
 [1, 0, 1, 0, 0, 0, 1, 0, 0],
 [1, 1, 1, 1, 0, 1, 0, 1, 1],
 [0, 0, 1, 0, 0, 1, 0, 1, 0]]

A strategy where I'd simply take the actual average of all features across observations would generate a synthetic datapoint such as follows, which I'd then append to the matrix before doing the similarity calculation.

[ 0.5 ,  0.25,  0.75,  0.5 ,  0.25,  0.75,  0.25,  0.75,  0.25]

While this might work well with certain similarity metrics (e.g. L1 distance) I'm sure there are much better ways for cosine similarity. Though, at the moment, I'm having trouble reasoning my way through angles between lines in high dimensional space.

Any ideas?

eric chiang
  • 233
  • 2
  • 7

3 Answers3

13

You are doing the correct thing. Technically, this averaging leads to computing the centroid in the Euclidean space of a set of N points. The centroid works pretty well with cosine similarities (cosine of the angles between normalized vectors), e.g. the Rocchio algorithm.

Debasis
  • 1,556
  • 12
  • 10
2

As @Debasis writes, what you are doing is correct. Do not worry about the relative values as the cosine similarity focusses on the angle, not the length of the vector.

Note that the averages that you compute actually are the proportion of observations that have that feature in your subset. If the subset is sufficiently large, you can interpret them as probabilities.

damienfrancois
  • 1,506
  • 14
  • 8
0

Depending on a task there could be a better strategy than simply averaging the vectors.

Imagine following scenario.

Say we have two point clouds of particular shape in multidimensional space, representing presence of particular orthogonal features.

It could be particular colours, sampled from the surface of a flower.

If you average complimentary colours, you get something grey in-between, but both samples initially had very vivid and saturated colours.

In the end the centroids of perceptibly similar objects would be more distant from the centroid of the search target than the centroid of a more distant object!

So if you find a maximum similarity for each point in target features among the sample's features, and then average those similarities, this problem is solved.

I found that this way I get a better result in similarity search. The formula is as follows:

average by i (maximum by j (cosine distance (a[i], b[j] ) ) )

There could be even better strategy. For example, you might want to apply some more sophisticated activation function to pairs of points instead of max and a better averaging function to too.

Samples of objects with multiple features