You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris Bryant (JIRA)" <ji...@apache.org> on 2014/10/22 21:01:35 UTC

[jira] [Created] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

Chris Bryant created TIKA-1454:
----------------------------------

             Summary: Extracting as HTML loses links in xlsx, ppt, and pptx files
                 Key: TIKA-1454
                 URL: https://issues.apache.org/jira/browse/TIKA-1454
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.6
         Environment: I tested this only on RedHat EL5.
            Reporter: Chris Bryant


I am trying to convert documents to HTML, then looking through the HTML for anchor tags to find links to external URLs.  This works fine when looking at some document types, including PDFs, Open Document formats, Microsoft Word formats .doc and .docx, and the older Microsoft Excel .xls format, but it does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it does not work for the newer Excel .xlsx format.  For the .ppt, .pptx, and .xlsx formats, the text is extracted properly and formatted into HTML, but the link is not converted to an anchor tag.

I am running tika in --server --html mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)