3

I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]

Not like that.

But bigger patterns like this:

If I have to _________, then I will ___.

["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]

I WANT TO DISCOVER NEW PATTERNS LIKE THE ABOVE. I don't already know the patterns.

I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.

Example usage of this would be finding the most common sentence structures a book uses.

The goal isn't to create the perfect algorithm or anything. I am willing to do it the brute-force way if need be like you might to find common substrings in sentences.

Is there a way to find patterns like this?

Mohit Gangrade
  • 131
  • 1
  • 4

2 Answers2

1

It's not easy, especially if you want any kind of pattern with various number of words and at any distance from each other.

The closest method I know would be to compute a huge coocurrence matrix with ngrams:

  1. Extract all the possible $n$-grams with size $n\leq N$ (for instance $N=3$).
  2. Filter out the least frequent ones. Depending on the size of the data the frequency threshold should be high enough to make the number of n-grams manageable, but not too high other some patterns may be missed.
  3. Given the resulting set of n-grams, count the number of coocurrences (number of sentences containing both) for every pair of n-grams. Store this in the coocurrence matrix.
  4. Extract the most common coocurrences from the matrix.
Erwan
  • 26,519
  • 3
  • 16
  • 39
1

The class of algorithms to search for is called "sequence alignment", usually found in bioinformatics. Example: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm or https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm

sudoer
  • 11
  • 2