parse pdf into Json or Xml

Question

I want to create a neural net that can obtain some specific words from a pdf document into JSON or XML. For example let's assume that I have a pdf containing some information about countries and i want to recuperate the countries name and population to obtain something like this :

<countries>
  <country>
    <name>
      France
    </name
    <population>
      70m
    </population
  </country>
.
.
.
</countries>

Should I build a neural net and train it myself? If so can you give a good tutorial to follow please, or is there an already trained one that I can use?

score 1 · Answer 1 · answered Aug 18 '18 at 19:06

Well, Unless your goal is to build a neural net to solve the problem. This can be done in a much simpler way, Like in case of country name you can just check against a list of country names, and so on. At best some NLP could give you what you want. A neural net solution might be a little overkill.

If a neural net is compulsory, Then I think You could get a better answer if some details were specified. Are you looking for a fixed set of fields, what kind of text content do the pdfs contain etc.

Also just in case, if you were thinking a neural net will give you a json as output (just in case if you were thinking that). That will not be the case. you would have to convert it to json from the neural nets output, but that conversion stuff is very trivial, so i should not even be talking about that.

I know i have not answered your question. But i hope you got some direction.

parse pdf into Json or Xml

1 Answers1