5

Say there is a data table $D$ that we cannot see, with $M$ columns. We are given exact cross-tabulation frequencies for all ${M \choose 2}$ pairs of columns, that is how often each combination of two values occurs.

From the cross-tabulations, we can derive a set of possible rows $R$ of $D$ and maximum frequencies for each possible row.

How can we reconstruct the original table $D$? If there is not enough information to do so, how can we construct a possible table $D'$ that has the same cross-tab frequencies? In this case, is it possible to count the number of possible tables?

(Edit: As Vor noted, define a table as an unordered collection of rows. A permutation of the rows of a table yields the same table.)


For example, if $D$ has rows:

X  A  j
Y  A  k
X  B  j
X  B  j

We have three sets of cross-tab frequencies:

X  A  1
X  B  2
Y  A  1
Y  B  0

X  j  3
X  k  0
Y  j  0
Y  k  1

A  j  1
A  k  1
B  j  2
B  k  2

I would like an algorithm which will take the cross-tab frequencies as input and output the original table or a possible original table.

frafl
  • 2,339
  • 1
  • 17
  • 32
Sarkom
  • 213
  • 1
  • 5

1 Answers1

2

We have $M$ columns $C_1, \dots, C_M$ in the original table, each with a finite set of distinct values $V_1, \dots, V_M$.

Given the cross-table frequencies $c_{i,j}(a,b)$ (frequency of $a$ in $V_i$ and $b$ in $V_j$ occurring together), proceed as follows:

  1. Determine $V_1, \dots, V_M$ and denote $V_i = \{v^{(i)}_j \mid j \in [1,n_i] \}$ where $n_i = |V_i|$.
  2. Solve the following integer program:

    $\qquad\displaystyle \min \sum_{i_1 = 1}^{n_1} \cdots \sum_{i_M = 1}^{n_M} x_{(i_1,\dots,i_M)} \\ \qquad \begin{align} \text{s.t. } \sum_{i_1 = 1}^{n_1} \cdots \sum_{i_{l-1} = 1}^{n_{l-1}} \sum_{i_{l+1} = 1}^{n_{l+1}} \cdots \sum_{i_{r-1} = 1}^{n_{r-1}} \sum_{i_{r+1} 1}^{n_{r+1}} \cdots \sum_{i_M = 1}^{n_M} x_{(i_1, \dots, i_M)} &= c_{(l,r)}(i_l,i_r) \\ \qquad \text{for all } 1 \leq l < r \leq M, i_l \in [1,n_l], i_r \in [1,n_r], \\ x_{(i_1, \dots, i_M)} &\in \mathbb{N} \\ \qquad \text{for all } i_1 \in [1,n_1], \dots, i_M \in [1,n_M] \end{align}$

The $x_{(i_1, \dots, i_M)}$ now indicate which row should occur how often, and we get a solution with the minimum number of rows possible. If we can influence the search, we can abort as soon as we find any feasible solution.

This is not likely to be efficient, but there are honed solvers for integer programs around so it's simple to implement. It may be possible to use an LP-solver and round to a feasible solution.

Raphael
  • 73,212
  • 30
  • 182
  • 400