I have a table in a pdf file with more than 100000 rows and over 1900 pages which I decided to write into a .csv file with the R package tabulizer.
When I try to exctract the whole data from the pdf file with
pdf <- extract_tables("pdffile.pdf", method = "csv")
I get an error,
error in .jcall("rjavatools", "ljava/lang/object;", "invokemethod", cl, : java.lang.outofmemoryerror: gc overhead limit exceeded
Therefore I followed another approach.
What I did was to extract a page of the pdf one by one, and save the output as a .csv file.
1) get the number of pages of the pdf file
pdfPages <- length(get_page_dims("pdffile.pdf"))
2) Create a for loop to store a .csv file for each page.
for (i in 1:pdfPages) {
page <- extract_tables("pdffile.pdf", pages = i, method = "data.frame")
write.csv(page, file = paste(i,".csv", sep = ""))
}
3) Then created another loop for reading each file by one, and rbind it to the next one.
dataPdf <- data.frame() # to rbind each .csv file
for (i in c(1:pdfPages)){
page <- read.csv(paste(i,".csv", sep = ""))
dataPdf <- bind_rows(dataQuilpue, page)
}
I had to use bind_rows()from the dplyrpackage since not all .csv files
ended with the same number of columns.
The result was more than satisfactory, though it took about 1.75 hours to complete, so I was thinking that maybe there is a better way to do it. Any ideas?