You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Matthew Caruana Galizia (JIRA)" <ji...@apache.org> on 2017/08/22 15:01:00 UTC

[jira] [Created] (TIKA-2444) JP2 codestream files not parsed

Matthew Caruana Galizia created TIKA-2444:
---------------------------------------------

             Summary: JP2 codestream files not parsed
                 Key: TIKA-2444
                 URL: https://issues.apache.org/jira/browse/TIKA-2444
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.16
            Reporter: Matthew Caruana Galizia


We've come across some embedded files in the wild that are detected by Tika as {{image/x-jp2-codestream}}. The identification is correct according to a description of the format [1].

However, no Parser implementation declares support for this format.

It would makes to declare support for this format in the Tesseract OCR parser. However, the parser would need to contain functionality that either:

1) wraps the codestream in a JP2 container;
2) or transcodes the image to PNG.

This is because while Tesseract supports JP2 (via Leptonica), it doesn't support the raw codestream as a file.

[1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)