Saturday 17 September 2022

Monty P. strikes back!

Another pitfall with the "language" whose strength is mostly PR and fame. Ladies and gentlemen, my trial & error approach to .py libraries to extract some trash from a pdf.

tabula was my first choice, based on the celebrated "google search". And since the trash has line breaks inside the cells. Wrong term... these are now ex-cells (sic!), and were cells in their previous life. Most likely the pdf is generated from a spreadsheet, and of course the institution which publishes the pdf does not allow you to download the spreadsheet). So tabula gives a trashy csv, a lot of information is being "lost" - I won't get it collected using just one line splitter and formatter. Perhaps I'd have to manually reconstruct the original values by un-wrapping the stuff from "invalid" lines correctly. Some guess work and a lot of bureaucratic coding...

Trying to avoid this, I turned to the camelot library. It's very strange... First, the inside-the-script version runs silently for about 2 seconds and gives no warnings or messages and no output

python3 main.py --create-csv-only
tables done
finished 

Yes, the code looks like this:

tables = camelot.read_pdf(inFilePdf)
print('tables done')
tables.export(outFileCsv, f='csv') 
print('finished')

Both file names are passed from the calling function correctly. Proof: this function did work with tabula's function calls previously, before I exchanged them to camelot-related stuff. And some csv was being produced. This behavior looks rather shitty to me, I'd much prefer a word of warning or an exception. Perhaps some empty catch block or so.

Next, running the cli version from shell is much more entertaining, before you get it right... After

nat@nutria:~/dev/py/ExtractTableData$ camelot lattice -back x.pdf
 
which I copied from the (supposed) documentation or faq list, I got a message
 
Usage: camelot lattice [OPTIONS] FILEPATH
Error: Please specify output file path using --output

And calling it with either --output=path or --output path, with or without quotes, it says:

Error: no such option: --output

Now, that's very Monty Pythonic indeed! I let it go for a couple of hours, so as not to get too mad... Later, after reading output of camelot --help, it turned out that the previous message is simply inconsistent with the real usage that seems to work. The command (in the previous case: lattice) should follow the options and then comes the input, like

camelot -o x.csv -f csv -p all stream x.pdf

The lattice command fails for some reason, I don't care why nor about the distinction between lattice and stream.

I'm wondering if this output differs from the one generated by tabula... But I have to wait - the x.pdf is over 1700 pages.

Wow. After like 10 minutes of heating the cpu and using 0.8% of 16 Gb of memory - I was just about to kill it (signal 9, of course), it started giving messages about processed tables... Push, baby! Now the job is done - I got over 1700 csv files. And the line breaks are treated and it seems I can handle these easier than in the previous case.

Hell-aluyah and hail Satun! (Inspired by M. M. O'Hair, B. Larson, Z. LaVey and N. Schreck) Thank you, camelot! (South Park: thank you, clitoris!) Got to scrape some $s and donate to their project.

No comments:

Post a Comment