You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/11/04 05:35:45 UTC

[GitHub] [drill] dzamo commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

dzamo commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-960472017


   @cgivre @paul-rogers, my 2c.  I guess some partial precedents for a format plugin like this are ones like format-image and format-esri (as noted), though those do only go after the explicitly structured content.  It would not surprise me if there are quite often cases of unfortunate people needing to scrape data out of 10^0 - 10^5 PDFs.  The purist in me agrees with Paul's thought: a Groovy script over whatever Java lib is used here could be employed instead.  That would not be automatically be parallelised like a Drill query is so a user with many PDFs _might_ be pushed all the way through GNU parallel to Spark.  All that is followed by this recurring thought: if Drill is disciplined and focussed only on SQL against standard big data formats then it finds itself trying to compete in an uncomfortably crowded space.  Probably fatally crowded.
   
   I do also note that the "big" and "small" data worlds are not disjoint.  I have in practice joined big data in Parquet with small reference data in Excel (actually EtherCalc).  Even in the big data regime reference data remains small and is maintained by humans in human forrmats rather than pumped out by machines in machine formats.
   
   This is getting long again.  Last thoughts.
   
   - I feel this really should be good at finding and parsing tables (work often, rather than work seldom) for inclusion.
   - We should consider a subproject to contain our long tail of "non-standard" data formats.  This would separate away plugins that should not be expected to run with speed or realiability of the core data formats and keep the core distributable size down as we add format after format.  We could then start to distribute tarballs for `drill-core` and `drill-extra`.
   - This looks like a plugin that will benefit from Drill's optional schema.  That means we might have some unique ability to compete here.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org