You are viewing a plain text version of this content. The canonical link for it is here.
Posted to corpora-dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2020/06/26 15:53:50 UTC

packaged subsets

All,

   I received a request to package some of the bugtracker data for easier
downloading.  For this request, I've zipped the PDFs and FDFs from the
bugtrackers and made those zips available here:
https://corpora.tika.apache.org/base/packaged/pdfs/

   I don't think we'll be inundated with one-off requests, and I don't
think we should be zipping large chunks of govdocs1 or commoncrawl.

   Are there any objections?  Is there a better way to package data and/or
make it available/browsable/navigable/retrievable?

  Cheers,

                 Tim