I'm imagining this for use in the scenario of cloud-stored client-encrypted email, where, when seeking to do a string search across messages, you don't want to have to download every stored message in order to decrypt and search.
[Note that inspiration for this approach is taken from Jones & Mewhort, 2007; I otherwise assert no formal rights to the algorithm, but appreciate props if it's a good idea : O) ]
Initialization steps:
- For each unicode character, client generates a D-dimensional vector (where D is large, >1024) where each element is a 64-bit float drawn from a random Gaussian with mean of zero and a standard deviation of
1/sqrt(D). Call these the "character vectors". - Client also generates a "permutation vector" representing a random permutation of the order of elements 1 through D.
- All vectors are encrypted and stored by the client for later decryption and use while encrypting messages as described below.
When encrypting a message, a standard strong encryption (RSA, etc) of the message is computed, as well as it's associated "hologram", computed as follows:
- For every bigram of characters, the associated character vectors are combined via circular convolution made non-commutative (and therefore encoding the bigram's character order) by first shuffling the second vector's elements according to the order prescribed by the permutation vector generated during initialization. Thus, if
&is used to denote this non-commutative circular convolution, thena&bis a D-dimensional vector that is very different from (low cosine similarity with)b&a. - Since all vectors generated in step 1 are D-dimensional, they can be summed to a single vector
Hthat holographically encodes the full set of vectors. That is, ifa&bcontributed toHbutc&ddid not, then the cosine similarity betweenHanda&bwill be much greater than the cosine similarity betweenHandc&d. This isn't lossless encoding (circular convolution causes data loss, as does the superposition of bigram vectors into a hologram), but it'll be good enough... - Now repeat step 1 for all ngrams (up to a reasonable maximum search string length) in the text (i.e. if the text is
abc, doa&b,b&c,(a&b)&c,a&(b&c)), continuing to sum each intoH. At the end,Hholographically represents the content of the message.
When done, client sends the encrypted message as well as it's associated hologram to the cloud for storage.
Steps for searching for a given search string:
- Client computes a hologram for the search string exactly as they would for a regular message
- Client sends the resulting search string hologram to the server
- Server computes the cosine similarity between the search string hologram and each message hologram, yielding a rank ordered set of messages, the top X of which are sent back to the client
- For each message returned from the server, the client decrypts the message and performs a standard plaintext string search.
It's possible that this scheme would get a worthwhile security benefit from having two permutation vectors, one for the left-hand vector and one for the right-hand vector, but I'm not clear yet on whether this would be necessary. I described it using one above for simplicity.
Certainly the initialization would need to be high-entropy, probably avoiding a single random seed, but since it's done only once, this shouldn't be too much of a hardship.
Thoughts?