You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2020/03/09 20:39:00 UTC

[jira] [Commented] (TIKA-3067) Different numbers of embedded inline images with PDF inline image extraction code

    [ https://issues.apache.org/jira/browse/TIKA-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055350#comment-17055350 ] 

Tim Allison commented on TIKA-3067:
-----------------------------------

The reasons for these diffs could be innocuous...a problem with the eval code, one of the methods handling duplicates differently...

I drilled down into: http://162.242.228.174/docs/govdocs1/437/437698.pdf

tika-eval noted that there were 6 inline images with 1.23 and 4 with 1.24-pre-rc1.

The missing images appear to be near duplicates and don't actually appear in the PDF.

> Different numbers of embedded inline images with PDF inline image extraction code
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-3067
>                 URL: https://issues.apache.org/jira/browse/TIKA-3067
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: 437698_tika_1_23.tgz, 437698_tika_1_24.tgz
>
>
> I ran extract inline images on a local sample of 20k files of common crawl and govdocs1.
> These are embedded files missing in 1.23 when compared with 1.24-pre-rc1:
> ||MIME_STRING||CNT||
> |image/png|175,413|
> |image/tiff|59,507|
> |image/jpeg|6,435|
> |image/x-jbig2|4,998|
> |image/jp2|4,573|
> |image/x-jp2-codestream|1|
> This would look like we're gaining ~175k png files with the new method...However, in other files, it looks like we're losing a bunch of files as well.
> These are embedded files missing in 1.24-pre-rc1
> |MIME_STRING||CNT||
> |image/png|105,885|
> |image/tiff|55,636|
> |image/jpeg|3,289|
> |image/x-jbig2|291|
> |text/plain; charset=windows-1252|2|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)