You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Cristian Vat (Jira)" <ji...@apache.org> on 2019/12/11 10:33:00 UTC
[jira] [Commented] (TIKA-3008) Word Doc/Docx Formatting Extraction
- Superscript/Subscript
[ https://issues.apache.org/jira/browse/TIKA-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993418#comment-16993418 ]
Cristian Vat commented on TIKA-3008:
------------------------------------
Work-in-progress branch at [https://github.com/deathy/tika/tree/TIKA-3008]
Will open PR once cleaned up and added multiple test documents
> Word Doc/Docx Formatting Extraction - Superscript/Subscript
> -----------------------------------------------------------
>
> Key: TIKA-3008
> URL: https://issues.apache.org/jira/browse/TIKA-3008
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.23
> Reporter: Cristian Vat
> Priority: Major
>
> Word extraction from .doc/.docx doesn't handle Superscript/Subscript at all.
> This changes the actual text extracted since character runs are merged together if only sup/sub is the difference since it doesn't generate any tags in between.
> Found to be especially problematic in case of some legal documents where getting "according to Art 51" instead of "according to Art 5^1^" completely changes the meaning.
>
> Problem seems to be both in old Word .doc and OOXML .docx formats parsing.
> Sub/sup can be present on actual character run or on the document style assigned to a character run.
>
> I'm already working on fixes and test documents, will comment with work in progress branch.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)