See any problems with this search-specific homomorphic encoding strategy?

Question

I'm imagining this for use in the scenario of cloud-stored client-encrypted email, where, when seeking to do a string search across messages, you don't want to have to download every stored message in order to decrypt and search.

[Note that inspiration for this approach is taken from Jones & Mewhort, 2007; I otherwise assert no formal rights to the algorithm, but appreciate props if it's a good idea : O) ]

Initialization steps:

For each unicode character, client generates a D-dimensional vector (where D is large, >1024) where each element is a 64-bit float drawn from a random Gaussian with mean of zero and a standard deviation of 1/sqrt(D). Call these the "character vectors".
Client also generates a "permutation vector" representing a random permutation of the order of elements 1 through D.
All vectors are encrypted and stored by the client for later decryption and use while encrypting messages as described below.

When encrypting a message, a standard strong encryption (RSA, etc) of the message is computed, as well as it's associated "hologram", computed as follows:

For every bigram of characters, the associated character vectors are combined via circular convolution made non-commutative (and therefore encoding the bigram's character order) by first shuffling the second vector's elements according to the order prescribed by the permutation vector generated during initialization. Thus, if & is used to denote this non-commutative circular convolution, then a&b is a D-dimensional vector that is very different from (low cosine similarity with) b&a.
Since all vectors generated in step 1 are D-dimensional, they can be summed to a single vector H that holographically encodes the full set of vectors. That is, if a&b contributed to H but c&d did not, then the cosine similarity between H and a&b will be much greater than the cosine similarity between H and c&d. This isn't lossless encoding (circular convolution causes data loss, as does the superposition of bigram vectors into a hologram), but it'll be good enough...
Now repeat step 1 for all ngrams (up to a reasonable maximum search string length) in the text (i.e. if the text is abc, do a&b, b&c, (a&b)&c, a&(b&c)), continuing to sum each into H. At the end, H holographically represents the content of the message.

When done, client sends the encrypted message as well as it's associated hologram to the cloud for storage.

Steps for searching for a given search string:

Client computes a hologram for the search string exactly as they would for a regular message
Client sends the resulting search string hologram to the server
Server computes the cosine similarity between the search string hologram and each message hologram, yielding a rank ordered set of messages, the top X of which are sent back to the client
For each message returned from the server, the client decrypts the message and performs a standard plaintext string search.

It's possible that this scheme would get a worthwhile security benefit from having two permutation vectors, one for the left-hand vector and one for the right-hand vector, but I'm not clear yet on whether this would be necessary. I described it using one above for simplicity.

Certainly the initialization would need to be high-entropy, probably avoiding a single random seed, but since it's done only once, this shouldn't be too much of a hardship.

Thoughts?

See any problems with this search-specific homomorphic encoding strategy?

0 Answers0