You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Moritz Dorka (JIRA)" <ji...@apache.org> on 2014/11/09 14:38:33 UTC

[jira] [Created] (TIKA-1468) Symbol character handling in WordExtractor

Moritz Dorka created TIKA-1468:
----------------------------------

             Summary: Symbol character handling in WordExtractor
                 Key: TIKA-1468
                 URL: https://issues.apache.org/jira/browse/TIKA-1468
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.6
            Reporter: Moritz Dorka
            Priority: Minor


Attached is a patch to allow for proper handling of _symbol characters_ in *.doc files (i.e. stuff which can be inserted via Insert->Symbol in Word).

Side note: I am a little unsure where exactly the boundary between the scope of TIKA and POI lies here. Theorectically one could add that patch to {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument, CharacterRun, Element)}} as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)