You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/06/14 11:06:00 UTC

[jira] [Commented] (TIKA-3008) Word Doc/Docx Formatting Extraction - Superscript/Subscript

    [ https://issues.apache.org/jira/browse/TIKA-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135116#comment-17135116 ] 

ASF GitHub Bot commented on TIKA-3008:
--------------------------------------

deathy opened a new pull request #321:
URL: https://github.com/apache/tika/pull/321


   adds handling of superscript/subscript in Word parsers as described in TIKA-3008


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Word Doc/Docx Formatting Extraction - Superscript/Subscript
> -----------------------------------------------------------
>
>                 Key: TIKA-3008
>                 URL: https://issues.apache.org/jira/browse/TIKA-3008
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.23
>            Reporter: Cristian Vat
>            Priority: Major
>
> Word extraction from .doc/.docx doesn't handle Superscript/Subscript at all.
> This changes the actual text extracted since character runs are merged together if only sup/sub is the difference since it doesn't generate any tags in between.
> Found to be especially problematic in case of some legal documents where getting "according to Art 51" instead of "according to Art 5^1^" completely changes the meaning.
>  
> Problem seems to be both in old Word .doc and OOXML .docx formats parsing.
> Sub/sup can be present on actual character run or on the document style assigned to a character run.
>  
> I'm already working on fixes and test documents, will comment with work in progress branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)