3

Suppose we have a set of words of common length randomly sampled from a dictionary of words; assume you have access to that dictionary. Let $d(x,y)$ be the Hamming distance between words $x$ and $y$. The Fréchet mean word minimizes the sum of squared Hamming distances, or: $$\bar{x} = \arg\min_y \sum_{i=1}^{n} d^2(x_i, y).$$ I want to know algorithms for finding such a minimizer. We could do a brute-force search over the dictionary, but if our dictionary is large or our words long (or both), this could become a major computational bottleneck. Can we do better than brute force? Suppose that there are rules for what word is in the dictionary, so we don't necessarily need to search the whole dictionary but can use our rules to generate valid words when needed. What are some optimization algorithms to try and find such a minimizer in reasonable time?

cgmil
  • 143
  • 4

1 Answers1

3

If the Fréchet mean word can be any word, the problem is an instance of the $p$-Norm Hamming Centroid problem where $p=2$. Given a set of $m$ strings each of length $n$ and a real $k$, Chen et al., 2019 show that for any fixed rational $p>1$, the problem of finding a string $s^*$ such that $(\sum_{s\in S}\text{d}^p(s^*, s))^{1/p}\leq k$ is NP-hard. There are some positive results in the paper, though, including a sub-exponential time algorithm, an FPT algorithm, and a polynomial time 2-approximation algorithm.

In particular, I think the approximation algorithm is the most relevant and practical for your case. They show that if you pick the word from the input set which minimizes the total $p$-distance to all of the other input strings, then the $p$-norm is no more than twice as large as the $p$-norm of the optimal solution $s^*$. In other words, let $s_1=\text{arg}\min_i\sum_{s\in S}\text{d}^p(s_i, s)$, then

$$ \left(\sum_{s\in S}\text{d}^p(s_1, s)\right)^{1/p}\leq 2\cdot \left(\sum_{s\in S}\text{d}^p(s^*, s)\right)^{1/p} $$

For your case where $p=2$, we have

$$ \sum_{s\in S}\text{d}^2(s_1, s)\leq 4\cdot \sum_{s\in S}\text{d}^2(s^*, s) $$

Throckmorton
  • 1,039
  • 1
  • 8
  • 21