Identify given patterns in unstructured data like text files

Question

I wasn't sure if I had to ask it here or in Stackoverflow, but since I am also seeking research papers/algorithms and not only code, I decided to do it here.

When I have a text, I can manually write a regex to find all the possible outputs from what I want to extract from the file. What I want to do, is to find an algorithm or a research, which can let you highlight (set the input) different positions of the same (repeated) data you want to extract in the text file, train the algorithm and then identify all the others under the same contentions of those you set.

For example, let's say that I have a text with several titles which are following with \n\n\n and starting with \n\n. It is easy with regex, but I want to do it dynamically.

An idea is to build an algorithm which will take examples and create regex automatically. But I am not aware of any research like this and maybe there are also other techniques that you can achieve it.

Any ideas?

score 0 · Accepted Answer · answered Sep 02 '15 at 00:22

That is exactly what the Trifecta product does (in addition to other features). It uses the Wrangle language which is a DSL (domain specific language) designed for data manipulation. There is a much earlier research project called Wrangler from the same people. The Wrangler papers might give you ideas.

Identify given patterns in unstructured data like text files

1 Answers1