You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2015/03/14 02:03:00 UTC

[jira] [Resolved] (TIKA-1094) Bugged WordExtractor#handleSpecialCharacterRun method

     [ https://issues.apache.org/jira/browse/TIKA-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tyler Palsulich resolved TIKA-1094.
-----------------------------------
    Resolution: Fixed

Marking as fixed, since the linked files are parsed into the following correct-looking content:
{code}
<body><p>To jest <a href="http://onet.pl/">jakiƛ</a> link.</p>
{code}

> Bugged WordExtractor#handleSpecialCharacterRun method
> -----------------------------------------------------
>
>                 Key: TIKA-1094
>                 URL: https://issues.apache.org/jira/browse/TIKA-1094
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Konrad Tendera
>            Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> As javadoc says, special character runs are defined as follow:
> "Can be \13..text..\15 or \13..control..\14..text..\15"
> In fact there are some serious differences which causes that e.g. hyperlinks aren't parsed properly. I checked it using LibreOffice and Microsoft Office and I figured out that paragraph containing HYPERLINK looks rather like that:
> \13 (space here)HYPERLINK "address here" \1 \14 text \15
> "\u0001" and "\u0014" are separate character runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)