You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/12/19 03:23:29 UTC

[GitHub] [drill] cgivre commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

cgivre commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-997322720


   @paul-rogers 
   Thanks for all your review.  I addressed all your comments (I think) and did the following:
   * Added additional unit tests
   * Refactored table list so that all tables are not read into memory if not requested
   * Added iterator classes to avoid counters in the batch reader
   * Moved metadata collection to separate class
   * Refactored to allow a pdf with no tables to return metadata if requested
   * Added config option for different extraction algorithms.
   * General code cleanup
   
   I removed all but one of the `System.env` calls and I'm a little stuck on this.  The reason I added this line is that when querying a PDF with Drill in embedded mode, it opens an additional java window.  This does not occur when running unit tests which makes for difficult debugging.   I'm going to keep digging into this, but I was wondering if you could take a look at the rest of the revisions in the mean time?   The issue seems to be in either Tabula or PdfBox, which are the underlying libraries that read the PDF file. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org