please help me understand the algorithm for building the KMP failure function

Question

I am struggling to grasp the algorithm for building the KMP failure function. The bulk of what is making my understanding incomplete concerns the line length=PI[length-1]. There is the psuedo code for the algorithm below. Here are my questions:

1.) How do we know that in the event s[i] != s[length], the best possible option for our candidate prefix length is PI[length-1]?

2.) If the candidate PI[length-1] fails, and on the next iteration of the while loop, we must go to PI[PI[length-1]-1] how do we know that the best possible candidate length is not actually BETWEEN PI[length-1] and PI[PI[length-1]-1]?

3.) When s[i]!=s[length] how do we know there does not exist a suffix(that matches a prefix) that ends at s[i] and that begins at a point in the string PRIOR to i-length?

I think my confusion could best be cleared by short informal proofs.

function f(string s):
PI = an array of integers with size equal to length of s
length=0
i=1
while i < the length of the s:
    if s[length] == s[i]:
        length+=1
        PI[i]=length
        i+=1
    else:
        if length!=0:
            length=PI[length-1]
        else:
            PI[i]=0
            i+=1
return PI

Thank you!

score 2 · Answer 1 · answered Jul 16 '21 at 09:12

The failure link at a certain position $k$ in the string points to the longest prefix of the string which is also a suffix of the string before the position $k$.

The first observation we need is that in fact all the prefixes that are suffixes at that point can be found using repeatedly following the failure links. This implies that there are no positions in between.

So, if $k$ points to $t_0$ then the string indicated by $(0)$ in my diagram is the prefix/suffix at $k$. The next prefix that is also a suffix at $k$ is $t_1$ the failure link at $t_0$. The reason is that the two strings $(1)$ at the prefix match (by definition of failure link at $t_0$) but can also be found as $(1')$ at the end of the suffix. Thus, any prefix/suffix at $t_0$ must also be a prefix/suffix at $k$.

please help me understand the algorithm for building the KMP failure function

1 Answers1