You are viewing a plain text version of this content. The canonical link for it is here.
Posted to corpora-dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2021/10/01 15:13:19 UTC
JXLs from Common Crawl CC-MAIN-2021-31
Possibly in response to a recent PDF Days talk (?)[0], Micky Lindlar
asked on twitter if anyone had seen JPEG XL files in the wild[1].
I added jxl detection to Tika and re-detected all the files that had
been previously identified as "application/octet-stream". I found
~462 likely jxl files. I have not yet looked for them embedded in
other files.
I've tgz'd the files (20M) and made them available here:
https://corpora.tika.apache.org/base/share/CC-MAIN-2021-31-jxls.tgz
For those interested in JXL, Jon Sneyers also pointed to this
resource: https://github.com/libjxl/conformance
Cheers,
Tim
[0] https://twitter.com/CHLThor/status/1443585512426520584?s=20 and
https://www.pdfa.org/presentation/a-work-in-progress-pdf-r-revisions-and-new-highly-compressed-image-format/
[1] https://twitter.com/MickyLindlar/status/1443585512258695169?s=20