Implementing symmetric encryption algorithms with whole words

Question

Scenario

I have some unique requirements to encrypt text:

symmetric encryption;
the encrypted text will be a valid text (that means that humans could read it and not just “gibberish”);
I can not replace the key from time to time (that means no temporary \ one time key algorithms);
each word must be always encrypted to the same word (and only to one word).

These needs are important to support the following use cases:

a third party will get the encrypted text and will be able to index it for searches (effective indexing needs basic morphology, this is why need number 2 is important);
when someone sends this third party an encrypted word(s) it will be able to find the texts that associated with these words (this is why needs number 3 and 4 are important).

But I also have some flexibility in the system:

I can always add more words to a dictionary (if need in the encryption / decryption process);
both sender and receiver has access to this dictionary (always synced).

Example scheme

So basically I can use some sort of a dictionary and map words and replace each word in the encrypted word (like Caesar cipher but for words).

Every time I encounter a word I can add it to the dictionary as both original word and encrypted word.

For example:

Word: “I” Encrypted word: “Cat”
Word: “Love” Encrypted word: “Dog”
Word: “You” Encrypted word: “Home”

Now the text “I Love You” will be encrypted to “Cat Dog Home”. This fits all the needs that I have. But as you probably know, this is a very week encryption.

To make it stronger I was thinking about randomly inserting words that I know that are fake (no mapping to a real words), this (maybe?) prevents statistical analysis of the language (common words and etc).

For example: - Encrypted word: “Boy” is fake. - Encrypted word: “Test’s” is fake.

Now the text “I Love You” will be encrypted to “Cat Test’s Boy Dog Test’s Home Boy”. This fits all the needs that I have. But I believe this is still weak, so I’m looking for something better.

Symmetric ciphers

I’ve read about symmetric encryption algorithms like Blowfish, Twofish and AES-256 and from my (limited) understanding they are all algorithms that replace one byte with another and they consider to be strong encryption methods.

Is it possible to implement a symmetric cipher, but instead of operating on bytes it will operate on words?

NOTE

After further research I've come to the understanding that this question is more misleading than helpful.

The question asks about encryption but directs to obfuscation - and so may some of the answers.

I believe that noobs (like me) will be more distracted and misled than gaining anything from this.

score 8 · Accepted Answer · answered Nov 10 '16 at 13:36

Your best bet is, I think, to do it the other way round.

encrypt securely using a proven algorithm.
make the encrypted text human readable.

For example: AES-256 encode "Squeamish Ossifrage" using some key. Get, say, 512 bits of absolute gibberish. That is 64 bytes of 8 bits each; each byte can have one of 256 values.

Get a dictionary of 256 words (possibly, short words?). Read the 64 bytes, and for each value transform it into the appropriate word, giving you a "sentence" of 64 words:

Cat Hue Way Ten One Bat Bob Nod Ben Elk Joe Sex Van Eke ...

To decode, read in each word, and look for it in the dictionary. It is the 42nd word ==> read it as byte with a value of 42. Put all the bytes one after the other and get back the encrypted text, AES-256 decode it and Bob's your uncle.

This way, approximately 100 bytes of text become 60 bytes of encrypted binary code and those become 240 bytes of rehumanized code (each word is 3 letters plus one space). Or 180 bytes if you camelcase it: BobTenHueNitPinZoo... . You've got an overall expansion of about 1:2 or 1:3, which is quite reasonable.

And at the core you've got AES-256. Nobody wants to mess with AES :-)

Word becomes word

You want 'Cat' to always be encrypted as 'Dog'. This presents two problems. In the first place you either need to specify all the possible input words, or accept that some words will become gibberish; possibly pronounceable gibberish (there are algorithms for that) but still gibberish.

For example you can use a dictionary, consisting in a simple scrambling algorithm. When you find word #1138 in the cleartext, encrypt 1138 - this can be done with bit swapping if the dictionary has a size that is an integer power of two, for example 65536 words - so that it becomes another number within the same dictionary, say 31794. Word 1138 ("Horse") becomes word 31794 ("Staple").

What if a word is not in the dictionary? Then you need to encrypt it as a different word that is surely not in the dictionary (or you wouldn't be able to decrypt it). So Cat becomes Dog, Horse becomes Staple, Battery becomes Correct... and rumplestiltskin becomes xyzbatelkjoesexvanekewayhuejimsixnoddandinsupfoxfenjoeputits - where the xyz prefix warns the decoder that what follows is an encrypted sequence.

Or if you can store the new word in the dictionary, then you can simply use its MD5 to index it in 32 bits, or 4 bytes, or 4 three-letter words; giving you xykjimsuptitbat. The xyk prefix now tells the decoder that what follows is a 32-bit MD5 hash. Sender and receiver must agree on the dictionaries and now they also need to keep them synced.

Your encrypted sentence would become something like

Fence, xykjimsuptitbat!
Normal bog nuclear cascade iron irksome
Welcome japanese brillig sympathetic
Anathematize copper bleat irksome kill airplane
Severn'emporium zip cascade enormous

and here's the second (and possibly third) problem: on long texts, some words will repeat and lend themselves to analysis. Here, irksome and cascade are present twice each. Which makes one suspect that they (or at least one of them) might be a common English word such as "it", "the", "of", or something like that. Just as database applications need to be wary of little Bobby Tables, this kind of encryption schemes need to fear young Etaoin.

Also, you know which words are not common (by their prefix), so that you can tell between a normal sentence and 'Twas brillig, and the slithy toves... at a glance.

Notice the punctuation is kept, therefore it is quite easy to surmise that emporium is either 'll, 'm, 's, 't or 've; not too many English words contain a quote sign (a dozen? Twenty? Still too few).

To encrypt anything except very short texts, and seldom at that, means that an attacker will (relatively) quickly discover that xykjimsuptitbat actually means supercalifragilisticexpialidocious. In other words, this is not encryption -- it is "mere" obfuscation.

So... why do you have such a "dangerous" requirement?

I'll go out on a limb and guess that you need this to be able to run searches on the encrypted texts. There are techniques that allow you to do this securely, but they all exploit some features of the search being performed; in other words, there is no one-size-fits-all solution that is also proved and secure (i.e. demonstrably at least as secure as some proven, or at least widely adopted, solution). But if you supply some more information we might be able to help.

Or, of course, you may need it for some other reason altogether - but without knowing the why, one is hard put to come out with a how.

Some search schemes

There are several possible "searches". First of all let us consider that all searches by necessity disclose some information, and if you allow powerful enough searches, this is exactly equivalent to having the text unencrypted (you basically run a (sort of) Hangman game with the ciphertext - what is called a oracle attack).

Now a "search" could be the operation of querying "What records contain a particular keyword among a given set?", or "What records contain this arbitrary keyword?", or "What records contain this arbitrary user supplied text?", or even "What records match this regex?". There has been quite some work on the matter; try googling for 'PEKS'.

The easiest (and most secure) case is that where you have a fixed set of keywords. So in the records

 1: Shall I compare thee to a summer's day?
 2: Thou art more lovely and more temperate.

you can search by thee, day, art or thou, but not by summer.

You can implement this by simply appending to the encrypted text a sequence with the keywords - which could as well be in cleartext:

 1: blindalasbirigudacomefosseantani|Thee|Day
 2: tarapiatapiocalasupercazzolaprematurata|Thou|Art

After a search for Thou we know that it is contained in record 2, and nothing else. And as you see, the more keywords we store, the less security we have. Store all keywords, and you might as well not encrypt at all.

The above scheme can be made more space efficient by noticing that the encrypted text is actually never used, so you can just store the keywords.

Otherwise, we're just obfuscating; and therefore we might just as well go for performance.

For example you might XOR all records with a so-called "worm" (we're talking Vigenère cipher here) of, say, 16 bytes, and expand all letters to their hexadecimal equivalents. The resulting text is vulnerable to several possible attacks, but could still be thought of as "encrypted". To search, you XOR the search string with the 16 possible rotations of the worm - SAYHELLOWORLDNOW, AYHELLOWORLDNOWS, YHELLOWORLDNOWSA..., hex-encode it, and use the result as a normal text search. It will be anywhere from 1.x to 16 times slower depending on storage retrieval performance, but it will work.

The most promising possibilities exist wherever the actual search is not under control of the Searcher - he may submit texts to the search engine, but can not act on behalf of it, and for instance, is limited in the submission and retrieval speed (in other words, there is no real feasibility of a large scale dictionary/oracle attack). For instance, you could have a MySQL engine with a search function implemented as a closed source, loadable UDF module.

score 3 · Answer 2 · answered Nov 10 '16 at 15:02

First of all, I'd like to point out, that, because you are using no chaining algorithm, this cannot be secure encryption. (if you re-use the key for a long time, then someone with good math skills would be able to decrypt the messages)

Secondly, since the resulting words need to be real English language words, you have to have a dictionary available as part of the encryption program.

You'll need to have the same dictionary sorted in two different ways. One of them has to be random. The other can be alphabetical.

Use a (preferably Cryptographically Secure FWIW) Random Number Generator to put a random value on each word, then you can sort the dictionary by its random value.

To increase security, (against brute force attacks) you could add to the processing power needed to produce a dictionary. I would suggest generating a large pre-defined number of throw-away random numbers. Perhaps 10s, 100s or 1000s; depending on how long you want it to take to produce the random dictionary. Even better than 'throw away numbers', would be to fill in an actual buffer, as that is more efficient. The buffer again can be either 1KB or more or less depending on desired processing time.

Your encryption Key would be the Seed of the random number generator. You could also swap out which dictionary you use for better results.

For each word to be encrypted,
Locate it's position in the alphabetical dictionary (i.e. 100 would be the 100th word alphabetically)
Check that position in the randomly sorted dictionary. (i.e. the 100th word in the already-randomly-sorted dictionary)
There is your resulting word.

Reverse the process to decrypt.

One inconvenience you may find is that some short words will become very long. For example I might become interstellar, or Love might become quadruple. You could use separate dictionaries for long words and short words, but this would weaken your encryption a bit.

Also, sometimes there will be words outside of the dictionary. You will have to have some alternate solution to handle this, for example, a special word signalling that you are going to phonetically pronounce each letter, and then another special word to revert to the dictionary-based method.

Finally, you'll need to determine how to format numbers, punctuation, and capitalization. All of these, if preserved in format would weaken the encryption. Phonetic pronunciation would be the most secure, but result in a longer encrypted message.

score 3 · Answer 3 · edited Apr 13 '17 at 12:48

This is just a polygraphic substitution cipher, where you have a fixed substitution over the set of words in the English language. As your input text has a non-uniform distribution, this will show in the ciphertext.

As you stated, you want to add fake words to make frequency analysis more difficult, however, it's not that simple:

Just adding fake words does not reduce the frequency relation between other words.
Words like articles (the, a, an), pronouns (I,you,he,she,we,...) and some other words (go, say, do, house, car, ...) are much more common than others. Of course this also depends somewhat on the context - but it's usually a bad idea to make your key dependent on the message.
You could use homophonic substitution to alleviate this problem. However if you get the frequency for single words almost uniform, this will show up in the frequencies of bigrams for words (two words following each other). There is an example of this effect for just usual homophonic substitution in this question.

In general, any kind of substitution cipher can be defeated by frequency analysis somehow. In general, any kind of classical cipher is considered entitrely insecure from the point of view of modern cryptography. Most of them fail for known plaintext attacks already, while encryption schemes have to withstand at least chosen plaintext attacks today.

Even trying to strengthen this with a proper symmetric cipher seems unreasonable, because the requirements stated in the question make it impossible to fullfill any kind of proper security definition:

That's fine
This points towards steganography, maybe this question is interesting in this regard
That's a no-go. The entire reason for the existance of cryptographic keys is that you can easily replace them. If this is fixed, we should consider the dictionary as part of the algorithm and not a key - and that clashes with Kerckhoffs' principle. Actually it means you have no key, this has $0$ bit security, and is really just an encoding and not a cipher.
This is broken with known-plaintext attacks, given enough pairs to cover the words used. With a decryption oracle (chosen-ciphertext attack) this is done in a single try: You can query the oracle any ciphertext you like except the challenge-ciphertext. So if you just do a slight modificaiton (repeat the last word, switch the order of words, ...) this can be queried and then the oracle gives you the plaintext.

score 2 · Answer 4 · answered Nov 10 '16 at 13:58

You can´t just replace bytes for words in modern encryption methods.

They are not just simple byte substitutions. They actually work on bit level, with S-Boxes (Substitutions using a table called S-Box), permutations of bits, XORing of bits and so on. You can´t use those with words.

So no, there is no way to have a secure encryption system with your requisites. Actually, some requisites you add (ex each word translates to the same word) just mean that the any algorithm that implements it will be insecure.