Efficient encoding of sudoku puzzles

Question

Specifying any arbitrary 9x9 grid requires giving the position and value of each square. A naïve encoding for this might give 81 (x, y, value) triplets, requiring 4 bits for each x, y, and value (1-9 = 9 values = 4 bits) for a total of 81x4x3 = 972 bits. By numbering each square, one can reduce the positional information to 7 bits, dropping a bit for each square and a total of 891 bits. By specifying a predetermined order, one can reduce this more drastically to just the 4 bits for each value for a total of 324 bits. However, a sudoku can have missing numbers. This provides the potential for reducing the number of numbers that have to be specified, but may require additional bits for indicating positions. Using our 11-bit encoding of (position, value), we can specify a puzzle with $n$ clues with $11n$ bits, e.g. a minimal (17) puzzle requires 187 bits. The best encoding I've thought of so far is to use one bit for each space to indicate whether it's filled and, if so, the following 4 bits encode the number. This requires $81+4n$ bits, 149 for a minimal puzzle ($n=17$). Is there a more efficient encoding, preferably without a database of each valid sudoku setup? (Bonus points for addressing a general $n$ from $N \times N$ puzzle)

It just occurred to me that many puzzles will be a rotation of another, or have a simple permutation of digits. Perhaps that could help reduce the bits required.

According to Wikipedia,

The number of classic 9×9 Sudoku solution grids is 6,670,903,752,021,072,936,960 (sequence A107739 in OEIS), or approximately $6.67×10^{21}$.

If I did my math right ($\frac{ln{(6,670,903,752,021,072,936,960)}}{ln{(2)}}$), that comes out to 73 (72.498) bits of information for a lookup table.

But:

The number of essentially different solutions, when symmetries such as rotation, reflection, permutation and relabelling are taken into account, was shown to be just 5,472,730,538[15] (sequence A109741 in OEIS).

That gives 33 (32.35) bits, so it's possible that a clever method of indicating which permutation to use could get below the full 73 bits.

Janoma · Answer 1 · 2012-03-10T11:44:40.293

Is there a more efficient encoding, preferably without a database of each valid sudoku setup?

Yes. I can think of an encoding improving your 149-bit encoding of a minimal $9\times 9$ puzzle in 6 or 9 bits, depending on a condition. This is without a database or any register of other solutions or partial boards. Here it goes:

First, you use $4$ bits to encode a number $m$ with a minimal number of appearances in the board. The next $4$ bits encode the actual number $\ell$ of times $m$ appears. The next $7\ell$ bits encode each of the positions in which $m$ appears.

The following $81-\ell$ bits are flags indicating whether the remaining positions have a number or not (you just skip the positions in which $m$ is). Whenever one of these bits is 1, then the next 3 bits indicate which number it is (in the ordered set $\{1,\ldots,9\}$ without $m$). For example, if $m=4$ and the 3 bits are 101, then the number in the corresponding position on the board is the the 5th (counting from 0) in the set $\{1,2,3,5,6,7,8,9\}$, so it is $6$. Numbers $j<m$ will be encoded in binary as $j-1$, while numbers $j>m$ will be encoded as $j-2$. Since we had already written $\ell$ positions, only $3(n-\ell)$ bits will be added to encode the rest of the board in this step.

Thus, the total number of bits required to encode a board using this procedure is $$B=4+4+7\ell+(81-\ell)+3(n-\ell)=89+3\ell+3n.$$

For $n=17$, we note that $\ell$ can be 0 or 1 (in general, $\ell\leq\lfloor n/9\rfloor$). Thus, $B$ can be 140 or 143 depending on whether there's a number not appearing on the board.

It's worth pointing out that Kevin's solution is way better in the general case. This encoding uses at most 149 bits only for $n\in\{17,18,19\}$, or for $n=20$ provided that $\ell=0$. At least it shows a general idea on how to take advantage of the fact that $N=9$ is very close to $2^{\lfloor\log_2N\rfloor}$ (which means we tend to "lose memory" by using 4 bits per value, since 4 bits allow us to express $N=16$ numbers as well.

Example. Consider the following board with $n=17$ clues.

.  .  .   .  .  .   .  1  .
4  .  .   .  .  .   .  .  .
.  2  .   .  .  .   .  .  .

.  .  .   .  5  .   4  .  7
.  .  8   .  .  .   3  .  .
.  .  1   .  9  .   .  .  .

3  .  .   4  .  .   2  .  .
.  5  .   1  .  .   .  .  .
.  .  .   8  .  6   .  .  .

Here, no number does not appear on the board, and numbers 6, 7 and 9 appear only once. We take $m=7$ (0111) and $\ell=1$ (0001). Reading the positions from left to right and then from top to bottom, $m$ appears in the position $36$ (0100100). Thus, our encoding begins with 011100010100100.

Next, we need seven 0s, one 1 and the 3-bit encoding of the number $1$, then a 0 followed by a 1 and the 3-bit encoding of $4$, etc. (0000000100101100). Eventually, we will skip the position where $m=7$ is, and we will encode 8 as 110 (the 6th number counting from 0 on the list ${1,2,3,4,5,6,8,9}$) and 9 as 111. The full encoding goes as follows:

// m=7, l=1 and its position on the board.
011100010100100
// Numbers 1 and 4 at the beginning. Note that 1 is encoded 000, and 4 is 011.
0000000100001011
// Numbers 2 and 5.
0000000001001000000000001100
// Numbers 4 and 8. We skip the appearance of 7 and encode 8 as 110.
010110001110
// 3, 1 and 9. 9 is encoded as 111.
00010100000100001111
// 3, 4, 2, 5, 1, 8, 6 and the last empty cells.
0000101000101100100100011000100000000000111001101000

The complete encoding is 01110001010010000000001001010110000000001001000000000001100010110001110000101000001000011110000101000101100100100011000100000000000111001101000, and the reader can check the length of that string is indeed 143 :-)

53x15 · Answer 2 · 2020-12-18T22:34:44.167

Here's a no-table method that combines a 75-bit encoding of the solution grid with an 81-bit encoding of which cells are clues to give a 156 bit fixed-length encoding for all puzzles ([edit] also see below for further development of a variable length encoder for the clue positions which needs 73.1 bits on average instead of 81, yielding 148.1 overall with an upper bound of 156).

Consider the pattern below:

1 2 3|x x x|x x x
4 5 6|x x x|x x x
7 8 9|x x x|x x x
-----+-----+-----
y y y|. . .|. . .
y y y|. . .|. . .
y y y|. . .|. . .
-----+-----+-----
y y y|. . .|. . .
y y y|. . .|. . .
y y y|. . .|. . .

There are 36288^2 = 1316818944 configurations the x's and y's that are essentially different in terms of the constraints they impose on solutions completing the rest of the grid. (There are 1+27+27+1=56 ways to choose unordered sets of three digits to fill the rows of the first box of x's, but this double-counts since the left and right boxes can be exchanged, so it's really 28. Then holding the first row fixed in some canonical order, there are (3!)^4 ways to permute the x digits on the second and third rows, giving 36288 essentially different configurations for the x's with the same logic applying to the y's).

For each of these there are 1881169920 equivalence-preserving transformations arising from (2 * 3! * 3! ways to permute the rightmost 6 columns) X (2 * 3! * 3! ways to permute the bottom 6 rows) X (9! ways to permute all the digits).

Running a backtracking solver for each of the essentially different configurations, we find that none has more than 11664 completed solutions[1].

We can therefore uniquely encode a Sudoku solution grid by identifying a starting configuration, an equivalence-preserving transformation, and a solution index (given, say, lexicographical ordering). This requires 75 bits (pretty close to the 73 bit minimum possible) since:

$\small{\mathrm{log_2}(1316818944) + \mathrm{log_2}(1881169920) + \mathrm{log_2}(11664) \approx 74.6}$

Combining this with an 81-bit encoding of which cells are given as clues gives a 156 bit fixed-length encoding overall.

[edit] Following the encouragement of greybeard below, we can also try to use fewer than 81 bits to encode which cells are clues. One way to do this is to evaluate each cell in turn following some deterministic ordering and emitting the corresponding bit only if its value can not be determined by the clues earlier in the ordering and the constraints of solution uniqueness or puzzle minimality (if this is assumed).

For example, suppose we have a propositional theory, $T$, encoding the exactly-one constraints of Sudoku, where $ x_{ij} $ represents a candidate for cell $ i \in \{1..81\} $ and value $ j \in \{1..9\} $, and $c_i$ represents whether cell $i$ contains a clue. To address a specific puzzle and its solution, let $s_i$ be an alias for whichever $x_{ij}$ is the solution for cell $i$, let $S$ be the set of clauses $\land_i\{ \lnot c_i, s_i \}$, and let $g_i$ be alias for either $c_i$ or $\lnot c_i$, depending on whether cell $i$ is given.

We can write the clue info as follows:

$\mathrm{for}\, i\, \mathrm{in} \, 1..81 \\ \quad \mathrm{if \, (satcount}( T \land S \land (\land_{j<i} g_j) \land \lnot s_i) = 0) \, continue \\ \quad \mathrm{if \, (satcount}( T \land S \land (\land_{j<i} g_j)\land (\land_{j>i} c_j)) > 1) \, continue \\ \quad \mathrm{emit}(g_i)$

The first condition skips writing non-clue indicators for cells whose value is implied by prior clues. The second skips writing clue indicators for cells which, given prior clues, must be clues to make the solution unique.

Testing with a million Sudoku generated by a controlled-bias sampler finds that this encoder requires on average 73.8 bits per puzzle, or 76.6 bits per puzzle if we don't assume the puzzles are minimal. Using a knight's move cell ordering instead of the natural ordering brings this down to 73.1 bits per puzzle (assuming minimality), and there is likely room for further improvement by finding heuristic orderings based on the solution grid.

So this gives us a full encoding with expected length of 148.1 bits, with a 156 bit upper bound.

[edit2]: Another scheme, maybe simpler, but achieving slightly inferior compression is to use 5 bits to encode the clue count (giving a range from 17 to 48, which should include all minimal puzzles), and then to encode the clue positions for n-clue puzzles using position in a ordering of the $81 \choose n$ ways to choose n clues. Assuming the same sample of puzzles as used above, this requires on average 75.6 bits per puzzle.

[1] A table of solution counts for the configurations described above can be found here: https://github.com/t-dillon/tdoku/releases/tag/tables in tables.tar.xz

$ hexdump -d grid.counts | awk '{for (i=2;i<NF;i++) if ($i>x) x=$i} END {print x}'
11664

Pseudonym · Answer 3 · 2020-09-10T03:55:50.970

it's possible that a clever method of indicating which permutation to use could get below the full 73 bits

No, because encoding the permutation/relabelling/rotation/whatever takes you back to 73 bits.

Off the top of my head, I'm guessing there are 8 possible symmetries. Then assuming that you need two permutations (rows and columns?) and one relabelling:

$$32.35 + 3 \log_2 9! + \log_2 8 \approx 73.76$$

I must have overcounted somewhere in counting the permutations and relabellings, but it's very close. Presumably not all permutations result in valid grids.

I do have an encoding that's going to get you extremely close to the minimal 73 bits, and it doesn't use any tables.

You're not going to like it.

Essentially, you encode each cell using your favourite fractional-bit encoder (e.g. arithmetic, range, ANS, whatever), and encode each symbol with probability $\frac{1}{n}$ where $n$ is the number of values that the cell could have in any valid Sudoku puzzle.

You determine that set by using a Sudoku solver; ideally one that allows for incremental updates such as DLX. For the cell in question, you try each possible digit as a given value and see if the resulting puzzle has any solution. You don't need the solution, just to know that one exists. If it does, that's a possible value.

If it seems a bit crazy to use a Sudoku solver in the inner loop just to test individual values, do bear in mind that this is how you do "perfect pencilling".

Bartłomiej Uliasz · Answer 4 · 2025-02-28T22:53:02.837

I've implemented pretty simple yet efficient encoding of normal Sudoku puzzles. On average it encodes any solution to 72.9 bits of data (this statistical value is calculated on a 120117 puzzles probe).

The logic is quite simple: Calculate the number of possible values for each cell ('possible' assessment uses simple Sudoku rules: unique digit per row, column and box) after any next value inserted, select the first of least count of possible values and store it in a BigInteger variable (to store it multiplies the current value index by total possible values count for the current cell and add the index to the encoded BigInteger).

Thanks to the fact that this logic can be reversed during decoding it works flawlessly.

To include any valid game state information it uses additional 81 bits value (one bit for each cell). So to represent any valid Sudoku state it uses 153.9 bits on average in total. It is also converted to an alphanumerical string to show it in game.

Improvements: I have also implemented a bit ( literally :D ) more efficient algorithm which was using DLX to check what are the really possible values for each cells at any stage of encoding. It was though way more expensive on resources and the gain was usually only 1 bit. It was working, it's just not quick enough for use inside the Android game.

Another improvement (as suggested elsewhere) would be to use the fact that the minimum required cells do define a solution is 17. And again, the gain seems too small to bother.

I've released the implementation in the Open Sudoku project which I'm authoring since version 4. The code is available here (the game is on GPLv3 license). You can take a look. Please let me know if you find any improvement that is not-heavy on resources.

Efficient encoding of sudoku puzzles

4 Answers4