You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2014/11/09 14:48:33 UTC

[jira] [Commented] (TIKA-1468) Symbol character handling in WordExtractor

    [ https://issues.apache.org/jira/browse/TIKA-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203923#comment-14203923 ] 

Nick Burch commented on TIKA-1468:
----------------------------------

Any chance of a small junit unit test for this? Probably involving a short test and a very small test word document?

As for the right location of the logic, it might be better in POI itself. That way, users of POI will benefit too, and we minimise the amount of POI-specific logic in Tika. POI 3.11 beta 3 is being voted on right now, but we ought to be able to get it into the next release

> Symbol character handling in WordExtractor
> ------------------------------------------
>
>                 Key: TIKA-1468
>                 URL: https://issues.apache.org/jira/browse/TIKA-1468
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Moritz Dorka
>            Priority: Minor
>         Attachments: WordExtractor.patch
>
>
> Attached is a patch to allow for proper handling of _symbol characters_ in *.doc files (i.e. stuff which can be inserted via Insert->Symbol in Word).
> Side note: I am a little unsure where exactly the boundary between the scope of TIKA and POI lies here. Theorectically one could add that patch to {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument, CharacterRun, Element)}} as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)