You are viewing a plain text version of this content. The canonical link for it is here.
Posted to corpora-dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2021/10/01 15:13:19 UTC

JXLs from Common Crawl CC-MAIN-2021-31

Possibly in response to a recent PDF Days talk (?)[0], Micky Lindlar
asked on twitter if anyone had seen JPEG XL files in the wild[1].

I added jxl detection to Tika and re-detected all the files that had
been previously identified as "application/octet-stream".  I found
~462 likely jxl files.  I have not yet looked for them embedded in
other files.

I've tgz'd the files (20M) and made them available here:
https://corpora.tika.apache.org/base/share/CC-MAIN-2021-31-jxls.tgz

For those interested in JXL, Jon Sneyers also pointed to this
resource: https://github.com/libjxl/conformance

Cheers,

         Tim

[0] https://twitter.com/CHLThor/status/1443585512426520584?s=20 and
https://www.pdfa.org/presentation/a-work-in-progress-pdf-r-revisions-and-new-highly-compressed-image-format/

[1] https://twitter.com/MickyLindlar/status/1443585512258695169?s=20