You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/03/14 14:57:41 UTC

[jira] [Commented] (TIKA-2177) microsoft.OfficeParser shows add links in additional paragraphs

    [ https://issues.apache.org/jira/browse/TIKA-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924360#comment-15924360 ] 

Tim Allison commented on TIKA-2177:
-----------------------------------

Sorry for taking so long to reply.

I was just looking into this again over on POI.  The issue is that hyperlink addresses (or their reference ids) are actually stored after the sheet data in xls, xlsx and xlsb.  We would have to parse the full sheet data, cache the hyperlink addresses and then reparse the sheet data.

A hyperlink can have 3 values: display, url and tooltip (at least in xlsx).  The display is (typically) stored in the sheet data in the appropriate cell.  The url and tooltip are stored outside of the sheet data.  What you are seeing in your example is what happens when display==url.  It looks like there's a duplicate.

So, if we made it configurable, would you be willing to double parse each sheet in order to get the hyperlinks right?

> microsoft.OfficeParser shows add links in additional paragraphs
> ---------------------------------------------------------------
>
>                 Key: TIKA-2177
>                 URL: https://issues.apache.org/jira/browse/TIKA-2177
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.13
>         Environment: org.apache.tika.parser.microsoft.OfficeParser and org.apache.tika.parser.microsoft.ooxml.OOXMLParser
>            Reporter: Sara Miller
>            Priority: Minor
>
> I'm converting Excel files, both .xls and .xlsx.
> .xls uses org.apache.tika.parser.microsoft.OfficeParser and 
> .xlsx uses org.apache.tika.parser.microsoft.ooxml.OOXMLParser
> If I have a link in my excel document, for example santa@gmail.com, the .xls parser adds additional elements in the document structure which shows an incorrect output of how the document looks. 
> For example, this table in file.xls: 
> mailadress	password
> santa@gmail.com	hohoho
> will output: 
>  <div class="page">
>             <h1>Sheet1</h1>
>             <table>
>                 <tbody>
>                     <tr>
>                         <td>mailadress</td>
>                         <td>password</td>
>                     </tr>
>                     <tr>
>                         <td>santa@gmail.com</td>
>                         <td>hohoho</td>
>                     </tr>
>                 </tbody>
>             </table>
>             <div class="outside">
>                 <a href="mailto:santa@gmail.com">santa@gmail.com</a>
>             </div>
>         </div>
> The <div class="outside"> should be removed because it does not correspond to the document structure. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)