Questions tagged [string-metrics]

Questions about different distance metrics on strings.

Questions about different distance metrics on strings.

Popular examples include:

See the Wikipedia article for a more expansive list.

60 questions
43
votes
2 answers

Efficient data structures for building a fast spell checker

I'm trying to write a spell-checker which should work with a pretty large dictionary. I really want an efficient way to index my dictionary data to be used using a Damerau-Levenshtein distance to determine which words are closest to the misspelled…
Charles Menguy
  • 1,193
  • 1
  • 10
  • 12
12
votes
1 answer

Edit distance of list with unique elements

Levenshtein-Distance edit distance between lists is a well studied problem. But I can't find much on possible improvements if it is known that no element does occurs more than once in each list. Let's also assume that the elements are…
user362178
  • 221
  • 1
  • 5
11
votes
2 answers

Fast k mismatch string matching algorithm

I am looking for a fast k-mismatch string matching algorithm. Given a pattern string P of length m, and a text string T of length n, I need a fast (linear time) algorithm to find all positions where P matches a substring of T with at most k…
Paresh
  • 3,368
  • 1
  • 21
  • 33
11
votes
1 answer

Determining how similar a given string is to a collection of strings

I'm not sure if this question belongs here and I apologize if not. What I am looking to do is to develop a programmatic way in which I can probabilistically determine whether a given string "belongs" in a bag of strings. For example, if I have bag…
Andrew
  • 213
  • 1
  • 4
10
votes
1 answer

Micro-optimisation for edit distance computation: is it valid?

On Wikipedia, an implementation for the bottom-up dynamic programming scheme for the edit distance is given. It does not follow the definition completely; inner cells are computed thus: if s[i] = t[j] then d[i, j] := d[i-1, j-1] // no…
9
votes
1 answer

Expressing an arbitrary permutation as a sequence of (insert, move, delete) operations

Suppose I have two strings. Call them $A$ and $B$. Neither string has any repeated characters. How can I find the shortest sequence of insert, move, and delete operation that turns $A$ into $B$, where: insert(char, offset) inserts char at the given…
Geoff
  • 191
  • 1
9
votes
2 answers

Alternative to Hamming distance for permutations

I have two strings, where one is a permutation of the other. I was wondering if there is an alternative to Hamming distance where instead of finding the minimum number of substitutions required, it would find the minimum number of translocations…
9
votes
1 answer

Can an Earley Parser be made into a fuzzy parser similar to the Levenshtein Automata Algo for DFA?

There's a way to perform fuzzy parsing (accepts strings even with typos to a certain edit distance), with a DFA and a run-time constructed Levenshtein Automata of the input word. Can something similar be done with an Earley parser? I'm finding it…
8
votes
1 answer

Find all pairs of strings in a set with Levenshtein distance < d

I have a set of $n = $ 100 million strings of length $l = 20$, and for each string in the set, I would like to find all the other strings in the set with Levenshtein distance $\le d = 4$ from that string. The Levenshtein distance (also called the…
1''
  • 183
  • 1
  • 6
8
votes
1 answer

Is there a continuous hash?

Questions: Can there be a (cryptographically secure) hash that preserves the information topology of $\{0,1\}^{*}$? Can we add an efficiently computable closeness predicate which given $h_k(x)$ and $h_k(y)$ (or $y$ itself) tells us if $y$ is very…
Kaveh
  • 22,661
  • 4
  • 53
  • 113
7
votes
1 answer

Levenstein distance and dynamic time warp

I am not sure how to draw parallel between the Wagner–Fischer algorithm and dtw algo. In both case we want to find the distance of each index combination (i,j). In Wagner–Fischer, we initiate the distance by the number of insert we'd have to do from…
7
votes
1 answer

Find string that minimizes the sum of the edit distances to all other strings in set

I have a set of strings $S$ and I am using the edit-distance (Levenshtein) to measure the distance between all pairs. Is there an algorithm for finding the string $x$ which minimizes the sum of the distances to all strings in $S$, that is $\arg_x…
7
votes
2 answers

How many strings are close to a given set of strings?

This question has been prompted by Efficient data structures for building a fast spell checker. Given two strings $u,v$, we say they are $k$-close if their Damerau–Levenshtein distance¹ is small, i.e. $\operatorname{LD}(u,v) \geq k$ for a fixed $k…
Raphael
  • 73,212
  • 30
  • 182
  • 400
6
votes
1 answer

Find the closest string to a fixed set of strings

I want to find the closest string to a fixed set of strings. The strings are all equal in length, and the number of strings in the set is relatively small (compared to all the possible strings of the fixed size). For this problem you can assume the…
spyr03
  • 238
  • 1
  • 8
5
votes
3 answers

Find member of CFL that is Levenshtein-closest to non-member string

Is there an (efficient?) algorithm which given a context-free language $L$ (given as a grammar) and a string $x$ with $x \not \in L$ computes a $y$ with $y \in L$ and $\forall y': y' \in L \implies d(x, y) \le d(x, y')$, where $d$ is the Levenshtein…
1
2 3 4