You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/07/06 13:51:00 UTC

[jira] [Comment Edited] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

    [ https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563243#comment-17563243 ] 

Tim Allison edited comment on TIKA-3812 at 7/6/22 1:50 PM:
-----------------------------------------------------------

These are the diffs when tesseract is not installed and a user has both {{tika-parsers-standard-package}} and {{tika-parser-scientific-package}} on their class path.
{noformat}
application/x-hdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.hdf.HDFParser
application/x-netcdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.netcdf.NetCDFParser
image/bmp 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
image/gif 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
image/jpeg 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.JpegParser
image/png 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
video/mp4 2.4.1: class org.apache.tika.parser.external.CompositeExternalParser 2.4.0: class org.apache.tika.parser.mp4.MP4Parser
{noformat}

These are the diffs when tesseract is installed:
{noformat}
application/x-hdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.hdf.HDFParser
application/x-netcdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.netcdf.NetCDFParser
image/bmp 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
image/gif 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
image/jp2 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.ocr.TesseractOCRParser
image/jpeg 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.JpegParser
image/png 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
video/mp4 2.4.1: class org.apache.tika.parser.external.CompositeExternalParser 2.4.0: class org.apache.tika.parser.mp4.MP4Parser
{noformat}


was (Author: tallison@mitre.org):
These are the diffs when tesseract is not installed:
{noformat}
application/x-hdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.hdf.HDFParser
application/x-netcdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.netcdf.NetCDFParser
image/bmp 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
image/gif 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
image/jpeg 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.JpegParser
image/png 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
video/mp4 2.4.1: class org.apache.tika.parser.external.CompositeExternalParser 2.4.0: class org.apache.tika.parser.mp4.MP4Parser
{noformat}

These are the diffs when tesseract is installed:
{noformat}
application/x-hdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.hdf.HDFParser
application/x-netcdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.netcdf.NetCDFParser
image/bmp 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
image/gif 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
image/jp2 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.ocr.TesseractOCRParser
image/jpeg 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.JpegParser
image/png 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class org.apache.tika.parser.image.ImageParser
video/mp4 2.4.1: class org.apache.tika.parser.external.CompositeExternalParser 2.4.0: class org.apache.tika.parser.mp4.MP4Parser
{noformat}

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --------------------------------------------------------------------------
>
>                 Key: TIKA-3812
>                 URL: https://issues.apache.org/jira/browse/TIKA-3812
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4.1
>            Reporter: Eugen Caruntu
>            Priority: Minor
>             Fix For: 2.4.2
>
>
> The selected parser seems to be different in 2.4.1. For example sending an image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)