You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/05/04 02:21:04 UTC
[jira] [Created] (TIKA-2354) Missing many embedded images in .doc
files
Tim Allison created TIKA-2354:
---------------------------------
Summary: Missing many embedded images in .doc files
Key: TIKA-2354
URL: https://issues.apache.org/jira/browse/TIKA-2354
Project: Tika
Issue Type: Bug
Reporter: Tim Allison
Priority: Blocker
On a slightly deeper look at the comparison results between 1.14 and trunk, it looks like we're missing quite a few embedded images from .doc files. I initially thought these could be explained by different handling of macros, but that's not the issue.
I haven't traced the commit that did it (very likely my fault), but...
when we call this with a null character run.
{noformat}
// Handle any pictures that we haven't output yet
for (Picture p = pictures.nextUnclaimed(); p != null; ) {
handlePictureCharacterRun(
null, p, pictures, xhtml
);
p = pictures.nextUnclaimed();
}
{noformat}
the null character run then triggers skipping of the picture in this check because {{isRendered(cr)}} returns false if {{cr}} is {{null}}
{noformat}
if (!isRendered(cr) || picture == null) {
// Oh dear, we've run out...
// Probably caused by multiple \u0008 images referencing
// the same real image
return;
}
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)