handling junk bytes in chi squared character freq analysis

Question

I'm doing the Matasano crypto challenges. After implementing a naive character frequency scoring (that worked), i was looking for something mathematically more precise and found this thread: Developing algorithm for detecting plain text via frequency analysis

While the Chi Squared analysis does work for me, I encountered a problem. My current code just sets the score to infinity if I encounter a nonprintable character. I want to avoid doing that, since a possible plaintext could contain a few ascii non printable characters. (for example it is actually unicode and contains a greek word) But when I ignore every non printable character for scoring, as I already do with chars like ".,!\n\t", I run into the problem that strings with a few high scoring characters and lots of junk get scored better than actual english plaintext.

My current code is as follows: (frequency is a dict that associates ord(character) with its frequency for a-z and space)

import frequency
import copy
from math import inf
from string import printable

def score_string(inp):
exp_dist = copy.deepcopy(frequency.frequency)
length = len(inp)
for key in exp_dist:
    exp_dist[key] *= length
dist = dict.fromkeys(exp_dist.keys(), 0)
for char in inp.lower():
    if chr(char) not in printable:  # This is the part i like to avoid
        return inf
    if char in dist:
        dist[char] += 1
score = 0
for key in exp_dist:
    score += (dist[key]-exp_dist[key])**2/exp_dist[key]
return score

If I understand what is going on correctly, my issue is that for shorter strings (the example i worked with was 34 characters) the negative impact of even a single low scoring printable charater outweights the negative impact of a character that is ignored by the scoring by alot.

I'm looking for an elegant way of guaranteeing that nonprintables affect the score worse than any printable character but don't dismiss them completely.

score 3 · Accepted Answer · answered Mar 04 '19 at 10:50

One approach you might try is applying additive smoothing via pseudocounts. At its most basic, this can be as simple as adding 1 to the number of times each byte (or $n$-tuple, or whatever) occurs in your corpus before calculating the frequencies. This allows you to calculate a non-zero (although possibly very small) likelihood for any decrypted byte string, even those containing bytes that never appear in your original corpus.

So, for example, if you started with a 1,000,000 byte corpus of English text, where the byte e occurs 124,319 times and the byte ` occurs 0 times, you'd increment those counts to 124,320 and 1 respectively, and the total number of bytes observed to 1,000,256, before dividing the counts by this total to obtain the smoothed frequency of each byte in the corpus (e.g. 1 / 1,000,256 ≈ 0.000000997 for the byte `).

The mathematical theory behing why this seemingly arbitrary adjustment makes sense is kind of tricky, but basically it turns out to be a sensible way to adjust the probabilities obtained from a series of observations to account for the possibility that some rare but possible outcomes might simply not have appeared by chance during the finite number of observations made.

(In particular, analyzed in the framework of Bayesian statistics, this smoothing rule turns out to yield the same expected posterior probabilities as we'd get by starting from a "flat prior", i.e. the assumption that any distribution of byte frequencies is equally likely, and repeatedly applying Bayes' theorem to update this belief as we observe new bytes in the corpus. Furthermore, if we did have reason to believe a priori that some bytes should be more likely than others even before looking at the corpus, we could include this prior information by assigning the more likely bytes pseudocounts higher than 1.)

Of course, if you're using a given table of byte / letter frequences instead of compiling your own, you'd have to guesstimate the amount of smoothing needed yourself. But you can still do it: just multiply each frequency by some assumed corpus size (e.g. 1,000), increment them by one and (if needed) rescale them back down so that they sum to one again.

handling junk bytes in chi squared character freq analysis

1 Answers1

Linked