What is a general algorithm to reorder a list of letters such that it minimizes the cost for a fixed list of words?
Algorithm input
To be optimized
- a list of unique letters:
[a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z]
Fixed
- a list of words:
[cat, dog, apple]- this will be replaced with a list of my own words which uses only the letters in
lettersabove (for example, there will not be the Germanßin my words)
- this will be replaced with a list of my own words which uses only the letters in
Algorithm output
- a reordered list of letters such that the sum of all word cost is minimized for the input list of words
- if there are multiple such lists, then any of them is acceptable
Terms
- word cost: the cost of a word is the lowest integer $i$ such that the set of letters in word is a subsset of $\{letters_1, letters_2, ..., letters_{i - 1}, letters_i\}$
Examples
Text description
For example, if letters = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z] and my list of words were:
cat
dog
apple
then the cost would be 20 (looking at a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t) + 15 (looking at a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) + 16 (looking at a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p`, which totals to 51.
However, we can optimize the letters to:
letters = [p, l, e, c, a, t, d, o, g, b, f, h, i, j, k, m, n, q, r, s, u, v, w, x, y, z].
The cost would then be 6 (looking at p, l, e, c, a, t) + 9 (looking at p, l, e, c, a, t, d, o, g) + 5 (looking at p, l, e, c, a), which totals to 20.
The heuristic of sorting letters by the frequency in words appears to do well, but I cannot prove that it is optimal. This heuristic has been shown to be non-optimal by @greybeard with letters = ["g", "d", "o", "t", "c", "a", "p", "l", "e", ...] with cost of 18 using the heuristic "start with empty list. For each word, for each letter, add into the new list if it is not seen before" (this is not always optimal. Consider words = [a, b, bb] and letters = [a, b]).
Program
Try it online! (using the "most frequent letter heuristic")
from collections import Counter
TODO: optimize this function by returning a new list of letters such that cost is minimized
Currently, it uses the heuristic of sorting letters by the frequency in words
def optimize_letters(letters, words):
letter_freq = Counter()
for word in words:
letter_freq += Counter(word)
optimized_ordered_letters = [letter for (letter, count) in letter_freq.most_common()]
# add in letters that did appear in words at the back
for letter in letters:
if letter not in optimized_ordered_letters:
optimized_ordered_letters.append(letter)
return optimized_ordered_letters
words = ["cat", "dog", "apple"]
letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]
letters = ["p", "l", "e", "c", "a", "t", "d", "o", "g", "b", "f", "h", "i", "j", "k", "m", "n", "q", "r", "s", "u", "v", "w", "x", "y", "z"]
letters = optimize_letters(letters, words)
def word_cost(letters, word):
word_letters = set(word)
for i in range(len(letters)):
if set(letters[0:i + 1]) >= word_letters:
return i + 1
raise Exception("letters does not contain all letters in word")
for word in words:
cost = word_cost(letters, word)
print(f"cost of '{word}' is {cost}")
total_cost = sum(word_cost(letters, word) for word in words)
print(f"total cost of the list of words {words} is {total_cost} for the list of letters {letters}")
cost of 'cat' is 4
cost of 'dog' is 7
cost of 'apple' is 9
total cost of the list of words ['cat', 'dog', 'apple'] is 20 for the list of letters ['a', 'p', 'c', 't', 'd', 'o', 'g', 'l', 'e', 'b', 'f', 'h', 'i', 'j', 'k', 'm', 'n', 'q', 'r', 's', 'u', 'v', 'w', 'x', 'y', 'z']
Through a brute force approach, the minimum cost for the example is 18.