Can you break a multi language code using Frequency analysis?

Question

Let say that I wrote a 26 letter alphabet, each letter of my alphabet represent a letter from the latin alphabet.

I'm writing in 3 languages, only I know which languages. Grammar is the one from my native language.

Each time I need to write a word I randomly chooses in which language I'm going to write it.

Can you actually break this code using Frequency analysis?

EDIT

example :

abcdefghijklmnopqrstuvwxyz  
xzcdnefqghijbkslmaoprtuvwy

The sentence :

this is a cherry tree

becomes :

ceci is a sakura

[fr] [en] [en] [jp]

and once encoded :

cfch gi x oxirax

score 6 · Accepted Answer · answered Jul 22 '13 at 05:36

Yes, we can almost certainly break this, given enough ciphertext.

One approach would be to use a dictionary and use word patterns. For instance, if the ciphertext word is qddxfozogf, then the plaintext word was probably ammunition. Notice how the 2nd and 3rd letters are the same; and the 5th and 10th letters, and the 6th and 8th letters? The word ammunition is essentially the only word that has this special pattern. One could expect that some fraction of words will be essentially unique, in their pattern of repetition. With enough ciphertext, you'll be able to uniquely recover one or more of the words, and then the rest of the cipher will be easy to break.

Frequency analysis will also almost certainly be possible. For instance, the letter q occurs with frequency 0.1% in English, 0.02% in German, and 0.9% in Spanish. In contrast, the letter e occurs with frequency 14.7%, 17.4%, and 13.7% in English, German, and Spanish, respectively. Therefore, if you choose uniformly at random between English, German, and Spanish, the letter q will occur in the plaintext with frequency about 0.3%, while the letter e will occur with frequency about 15.3%. Notice how e is still much more common than q? This indicates that frequency analysis remains possible. (Wikipedia has data on the frequency statistics of different languages.)

Digraph analysis (the frequency of pairs of letters, like the pair th or qu) will also almost certainly be possible.

Finally, there are more sophisticated methods for cryptanalysis of simple substitution ciphers, such as hill-climbing. These methods will almost certainly be effective at ciphertext-only cryptnanalysis.

And the cipher will completely fall apart in the presence of known plaintext (let alone chosen ciphertext).

TL;DR: Your proposal is highly insecure. Don't use it.

score 1 · Answer 2 · edited Apr 13 '17 at 12:48

All natural languages use some symbols more than they use others (or maybe not? Is there a linguistics expect among us to correct me?).

Therefore, if your cipher doesn't mask the frequency of the characters (eg. a simple permutation or substitution) then you are vulnerable to frequency analysis. Perhaps Chinese is somewhat resistant to this but still vulnerable.

However, since you are using the Latin alphabet for all languages, I don't see how you gain any advantage by mixing them up. The only disadvantage goes to the receiver who (potentially) has to use a dictionary to see what you mean. So can the attacker.

Your system does provide some degree of obfuscation but there are only so many languages using the Latin alphabet. If you use a single language for the entire message, the cipher can easily be broken. If you use a different translation for each word in the message you may complicate things a bit but you don't gain any real advantage.

Can you break a multi language code using Frequency analysis?

2 Answers2

Linked