Security of tokenization of plain text conversations - cryptanalysis

Question

I came across a marketing video here. They claim to perform AES encryption and tokenization of sensitive data, at the corporate gateway, before it leaves the company firewall destined for the public cloud. So keys and token<=>plain text mapping tables remain at gateway and cipher text/tokens live in the public cloud.

The purposes of this question is to understand the real security of the product than just what is claimed. The video link is there but I've put screenshots below on the relevant portions. I did some basic analysis by simply eyeballing the video and seems the security is rather poor. You can clearly see, just from their public demo videos at the 2:19 mark, that

all strings are dropped to lower case => tokenized
letter casing 'flips' are then encoded (bitfield?) and appended to the token (eg "flip case of 1st character")
the token is wrapped within pre/post ambles or delimiters

enter image description here

Full resolution: https://i.sstatic.net/dRnCI.jpg

NOTE: Image is being referenced lawfully. Considerations include, but are not limited to, sections 107 through 118 of the copyright law, Title 17, U. S. Code. In particular, section 107 provides for lawful reproduction for the purposes of criticism, research as well as commentary.

The supposed security of such a system is that despite access to the "data in the cloud", no information is revealed. However, even without access to clear text counterparts, it should be fairly trivial to sort the "secured" tokens, drop the case encoding information, and then perform a frequency analysis and mapping it to regular English histograms. Presumably this is how they do their "search and sort, even inside encrypted data" feature.

My questions are:

How serious do you consider such a security system?
How many words/tokens of such "secured tokenized" data are needed to statistically decipher everything (with say, 90-95% confidence) assuming access to only tokenized data?
Any particular reason to encode the tokens in Chinese? I would do that to visually consume the space of one character (in browsers for humans) while still encoding 2-3 bytes per UTF-8 character.

score 5 · Accepted Answer · answered Mar 07 '13 at 19:56

Chinese "looks" encrypted. I believe it to be a marketing gimmick; they want to be able to show "encrypted text" that people will unconsciously associate with notions of unreadability and utter obscurity -- and nothing beats Chinese for that (except in China, of course; if they want to sell their product to China they will have to use Greek or Sumerian scripts).

There are more than 70000 Chinese characters in Unicode, but only 20000 or so in the "first plane" (the code points which fit on 16 bits), which are the most likely to be supported by existing software. This still allows for 14 bits worth of data in each glyph, which is not bad.

As your analysis shows, this appears to be a simple word-based substitution cipher, which will be broken by statistical analysis. People use perhaps 2000 common words, so I suppose that observing a corpus of, say, 30000 tokens ought to be enough to decrypt the whole thing, especially since there is an awful lot of context (on your screenshot we even have the pictures of the individuals involved !). Minoen "Linear B" script was unravelled with less data than that.

No, it is not secure, and hardly "serious".

score 1 · Answer 2 · answered Mar 07 '13 at 20:41

It is possible that the tokens are actually encrypted, and maybe then remapped onto Chinese character set in order to have it correctly handled by websites. The length of each Chinese block seems indeed to be compatible with that of an AES block.

The problem is that, as the test you made shows, the encryption method is equivalent to ECB: identical plaintexts get encrypted to identical cyphertexts. This makes it not much better than a substitution cypher, and leads to the vulnerability Tom Leek was speaking of, even if technically there is AES crypto involved.

"AES ECB" gets referred to in this image of this funny and informative page.

Notice that "acquisition(s).", the longest whitespace-delimited token (last of John Nelson's sentence), looks like it is mapped to a glyph group twice as long as all the others. And (making allowance for my complete ignorance of Chinese script), I'd say that the number of glyphs dovetails with the "20,000 or so" tokens reported by Tom Leek. If it was a dictionary substitution, it's not believable that they need more than one glyph group (at least 10E+30 values) to encode the dictionary.

The anomaly might be that while "Commodities", 11 bytes and therefore 88 bits, fits in a single AES block, "acquisition(s)." is fifteen bytes, and should also fit in a single AES block (even considering a zero termination). It could be interesting to check whether different strings of a same, suitable length turn out to have the second glyph group identical or not (this would indicate a further vulnerability: even, possibly, leading to secret key disclosure).

On the other hand, if one wanted to be able to run a "blind" word-based search across an encrypting firewall, i.e., enter 'assets' within the cryptowall, and having Google receiving 'zxXXXXXxz', which would be found on a Facebook page - because someone has written 'assets' on Facebook from within the cryptowall, and Facebook has received 'zxXXXXXxz' - then the Google page displaying 'zxXXXXXxz' would be back-translated as 'assets' upon entering the cryptowall), I'm not sure there would be many alternatives: one would need to choose between safety and searchability. The mapping between 'assets' and 'zxXXXXXxz' would have to either be permanent and vulnerable to statistical analysis (searchable, but not safe) or randomized (safe, but totally unsearchable).

Security of tokenization of plain text conversations - cryptanalysis

2 Answers2