1

I have two collections of PDFs. One (collection1) is 1000+ PDFs, much larger in file size (100+GB), and in illogical sections (think pdf 1 (1), 1 (3), ... when it could and should just be one file). The other (collection2) is 300 files.

Collection2 is supposed to be a compressed and organized version of collection1. I used Adobe Acrobat to process, condensed multiple PDFs into a single PDF, and then applied compression (and bates numbering). After doing a few I had a junior staff take over...

And, we've recently discovered that there are errors. Sections missing as compared to the original PDFs, and similar issues. This is a whopper of an error and something I'm hoping we can fix easily.

Not sure if what I'm looking for in this case is really diff, as I'd need to compare multiple files to one single file.

If I could isolate the problem files, I could fix those easily. The best I can figure right now is perhaps surprisingly Preview (MacOS), which allows you to open multiple set of files (and provides page count). From there I can check first, last and several in the middle. If these are consistent and page count is consistent, it's likely the files are solid, from what I can tell from the errors. This isn't the most thorough solution however.

Answers for similar questions are here and here however they are either several years old, windows specific (Which is fine if need be, but not preferred in this particular case), or not at the scale I need to operate at. No one on my team has advanced technical skills, relative to the SU community, so a detailed answer or links out to relevant prereq knowledge would be much, much appreciated.

Thank you so much SU

Gryph
  • 418

2 Answers2

1

You absolutely need first some way of mapping the 1000 files with the 300 files, in order.

In the simplest case, you will have say "CIDOC Ontology 2.0 (1).pdf", "CIDOC Ontology 2.0 (2).pdf" and "CIDOC Ontology 2.0 (3).pdf" on one hand, and "CIDOC ontology.pdf" on the other.

Now, the best approaches I can figure are these:

  1. Using pdftk or pdf2json, extract the number of pages of the 1000 group, and see whether the sum corresponds to the 300-group:

    12, 9, 10  vs.   31   = OK
    12, 9, 10  vs    22   = BAD (and you might suspect section 2 is missing)
    

    This method is quite basic and won't recognize three sections being out of order.

  2. Using pdf2ps and ps2ascii, create text versions of all the files. Depending on the PDF process, these might well be next to illegible, but it matters little: with a little luck, the tool used to coalesce the files will not have changed text metrics and grouping. If it is so, then the concatenation of the three files will be very, very much alike the fourth file (and if not, you'll mark it as an anomaly). So these heuristics should work:

    • the sum of the outputs of "wc" from the three files will be equal (or very close to) to the output from the fourth file.
    • cat'ting the three text files, or the fourth file, through cat file1 file2 file3 | sed -e "s#\s#\n#g" | sort should yield almost identical word lists (the output from diff -Bbawd should be no more than three or four lines; ideally, none). If you omit the | sort stage, then sections out of order should be recognizable: if the sorted check matches and the unsorted does not, you're facing a section-out-of-order situation.

The sed part will split words, which might help even if the coalescing tool did alter the text somewhat. A change in kerning, with words turning out to have been split differently inside the PDF ("homeostasis" having become "ho meos tas is" from "home osta sis"), will render even this insufficient; but it's not that likely.

The difficulty I see is matching the raw files with the final. Having a sample of each, I could probably whip up a script to run the comparison.

LSerni
  • 8,620
1

You could use a sequence alignment process similar to DNA sequence analysis. Specifically, a dynamic programming approach to sequence alignment.

Extract the text of each PDF in each collection and then attempt to align each individual text sequence from Collection 1 with each longer, concatenated sequence from Collection 2. Perfect matching of any letter gets a score of one, and mismatches get a zero. The overall score is the number of matches between aligned sequences. You can also allow for edits between sequences but introducing gaps.

The algorithm isn't hard, but might take a while to run. Given the dataset size you mentioned, I'm guessing it would run in a few hours or overnight.

Here's a link to the algorithm in Wikipedia: https://en.m.wikipedia.org/wiki/Sequence_alignment

KirkD_CO
  • 167