You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2013/03/19 16:35:15 UTC
[jira] [Commented] (TIKA-1094) Bugged
WordExtractor#handleSpecialCharacterRun method
[ https://issues.apache.org/jira/browse/TIKA-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606409#comment-13606409 ]
Nick Burch commented on TIKA-1094:
----------------------------------
Is there a file you could share that shows the problem?
> Bugged WordExtractor#handleSpecialCharacterRun method
> -----------------------------------------------------
>
> Key: TIKA-1094
> URL: https://issues.apache.org/jira/browse/TIKA-1094
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Konrad Tendera
> Priority: Minor
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> As javadoc says, special character runs are defined as follow:
> "Can be \13..text..\15 or \13..control..\14..text..\15"
> In fact there are some serious differences which causes that e.g. hyperlinks aren't parsed properly. I checked it using LibreOffice and Microsoft Office and I figured out that paragraph containing HYPERLINK looks rather like that:
> \13 (space here)HYPERLINK "address here" \1 \14 text \15
> "\u0001" and "\u0014" are separate character runs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira