You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/05/04 02:21:04 UTC

[jira] [Created] (TIKA-2354) Missing many embedded images in .doc files

Tim Allison created TIKA-2354:
---------------------------------

             Summary: Missing many embedded images in .doc files
                 Key: TIKA-2354
                 URL: https://issues.apache.org/jira/browse/TIKA-2354
             Project: Tika
          Issue Type: Bug
            Reporter: Tim Allison
            Priority: Blocker


On a slightly deeper look at the comparison results between 1.14 and trunk, it looks like we're missing quite a few embedded images from .doc files.  I initially thought these could be explained by different handling of macros, but that's not the issue.

I haven't traced the commit that did it (very likely my fault), but...
when we call this with a null character run.
{noformat}
        // Handle any pictures that we haven't output yet
        for (Picture p = pictures.nextUnclaimed(); p != null; ) {
            handlePictureCharacterRun(
                    null, p, pictures, xhtml
            );
            p = pictures.nextUnclaimed();
        }
{noformat}

the null character run then triggers skipping of the picture in this check because {{isRendered(cr)}} returns false if {{cr}} is {{null}}

{noformat}
        if (!isRendered(cr) || picture == null) {
            // Oh dear, we've run out...
            // Probably caused by multiple \u0008 images referencing
            //  the same real image
            return;
        }
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)