You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/10/07 20:53:26 UTC

[jira] [Commented] (TIKA-1765) Some doc and docx store multiple authors as semi-colon delimited list

    [ https://issues.apache.org/jira/browse/TIKA-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947361#comment-14947361 ] 

Tim Allison commented on TIKA-1765:
-----------------------------------

Would anyone mind if I changed {{OfficeOpenXMLExtended.MANAGER}} to {{Property.externalTextBag}} from {{externalText}}?

The reason that I'd want to make manager multi-valued is that we can store multiple managers in MSOffice Word, Excel and PPT just as we can store multiple authors (semicolon delimited).

I tried to find any reference in ECMA to the standard for handling multiple authors, and all examples (that I found) show a single author. There's even less documentation for "manager".


> Some doc and docx store multiple authors as semi-colon delimited list
> ---------------------------------------------------------------------
>
>                 Key: TIKA-1765
>                 URL: https://issues.apache.org/jira/browse/TIKA-1765
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>
> It looks like doc and docx are storing multiple authors in a single author field delimited by semi-colons.  We should parse this value and add multiple authors where appropriate.
> Notes: when I tried to add an author with a semicolon in the name, the result was two authors...doesn't look like there is any escaping going on.
> We should check to see what's going on in the other MS formats and with other metadata items that are allowed to be multivalued in Dublin Core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)