You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Tim Allison <ta...@apache.org> on 2019/04/22 20:29:50 UTC

[COMPRESS] zip-based entry names/metadata data set available

All,
  For some recent work on Apache Tika, I used commons-compress to
extract entry names and metadata via a streaming read from roughly
500k zip-based files we have in Tika's regression corpus.
  I was happy to see we have some POI-generated files in there. :)
  I noticed some areas for improvement in Tika's coverage of detection
of zip based files, and I noticed that MSOffice OOXMLs nearly always
place the [Content_Types].xml file as the first physical entry...which
we could use on Tika to help improve speed of detection.
  Because others may have other uses for this data, I'm sharing the
key/value table here: http://162.242.228.174/share/zips.txt.gz See
https://issues.apache.org/jira/browse/TIKA-2849 for some discussion.
  The file is 5GB uncompressed.

   Let me know if you have any questions and/or if this is of any
interest.  Many thanks to commons-compress!

       Cheers,

                  Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org