How to find common patterns in thousands of strings?

Question

I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]

Not like that.

But bigger patterns like this:

If I have to _________, then I will ___.

["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]

I WANT TO DISCOVER NEW PATTERNS LIKE THE ABOVE. I don't already know the patterns.

I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.

Example usage of this would be finding the most common sentence structures a book uses.

The goal isn't to create the perfect algorithm or anything. I am willing to do it the brute-force way if need be like you might to find common substrings in sentences.

Is there a way to find patterns like this?

score 1 · Answer 1 · answered Jun 13 '22 at 17:28

It's not easy, especially if you want any kind of pattern with various number of words and at any distance from each other.

The closest method I know would be to compute a huge coocurrence matrix with ngrams:

Extract all the possible $n$-grams with size $n\leq N$ (for instance $N=3$).
Filter out the least frequent ones. Depending on the size of the data the frequency threshold should be high enough to make the number of n-grams manageable, but not too high other some patterns may be missed.
Given the resulting set of n-grams, count the number of coocurrences (number of sentences containing both) for every pair of n-grams. Store this in the coocurrence matrix.
Extract the most common coocurrences from the matrix.

sudoer · Answer 2 · 2022-11-04T12:56:47.040

1

The class of algorithms to search for is called "sequence alignment", usually found in bioinformatics. Example: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm or https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm

edited Nov 04 '22 at 12:56

answered Nov 04 '22 at 12:56

sudoer

11
2

How to find common patterns in thousands of strings?

2 Answers2