You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Robert Kaulbach (Jira)" <ji...@apache.org> on 2020/08/12 00:59:00 UTC

[jira] [Updated] (TIKA-3157) Missing content from .docx file with hyperlinked shape

     [ https://issues.apache.org/jira/browse/TIKA-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Kaulbach updated TIKA-3157:
----------------------------------
    Description: 
The attached .docx file was created in MS Office, simply drew a rectangle and then added a hyperlink to it. While the hyperlink doesn't show inside LibreOffice, it's still there and clickable when opened with MS Office.

When parsing with Tika, the hyperlink attached to the shape is nowhere to be found in the output. Enabling all Office/OOXML parse options in the context has not helped.

 

When debugging, I can see the "a:hlinkClick" tag with the link inside is being skipped at org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java in the StartElement method, because "inACChoiceDepth" is greater than 0.

And then the fallback tag, which also has the link inside a

  was:
The attached .docx file was created in MS Office, simply drew a rectangle and then added a hyperlink to it. While the hyperlink doesn't show inside LibreOffice, it's still there and clickable when opened with MS Office.

When parsing with Tika, the hyperlink attached to the shape is nowhere to be found in the output. Enabling all Office/OOXML parse options in the context has not helped.

 

When debugging, I can see the linked shape is being skipped at org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java in the StartElement method, because "inACChoiceDepth" is greater than 0.

For my use case I'd like to extract as much information as possible from the document. It would be helpful if the parser config could either disable this check on "inACChoiceDepth" or increase the allowed limit before skipping content.


> Missing content from .docx file with hyperlinked shape
> ------------------------------------------------------
>
>                 Key: TIKA-3157
>                 URL: https://issues.apache.org/jira/browse/TIKA-3157
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Robert Kaulbach
>            Priority: Minor
>
> The attached .docx file was created in MS Office, simply drew a rectangle and then added a hyperlink to it. While the hyperlink doesn't show inside LibreOffice, it's still there and clickable when opened with MS Office.
> When parsing with Tika, the hyperlink attached to the shape is nowhere to be found in the output. Enabling all Office/OOXML parse options in the context has not helped.
>  
> When debugging, I can see the "a:hlinkClick" tag with the link inside is being skipped at org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java in the StartElement method, because "inACChoiceDepth" is greater than 0.
> And then the fallback tag, which also has the link inside a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)