You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (Commented) (JIRA)" <ji...@apache.org> on 2012/01/23 17:09:40 UTC

[jira] [Commented] (TIKA-845) Check for Existing Value in Multi-Value Fields in XML Metadata Handler

    [ https://issues.apache.org/jira/browse/TIKA-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191228#comment-13191228 ] 

Nick Burch commented on TIKA-845:
---------------------------------

I think the current logic isn't quite correct. Rather than ending up with a proper multivalued metadata object, we end up with a single string of comma separated values, which seems wrong to me

What I've done is fix up the logic, which allows for what looks to be a cleaner way to check for duplicates

I've also fixed up the single unit test that depending on the old comma concatination, DcXMLParserTest, to now check for the correct multivalued approach

I've committed this in r1234873.
                
> Check for Existing Value in Multi-Value Fields in XML Metadata Handler
> ----------------------------------------------------------------------
>
>                 Key: TIKA-845
>                 URL: https://issues.apache.org/jira/browse/TIKA-845
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Ray Gauss II
>             Fix For: 1.1
>
>         Attachments: xml-check-multi-value-existing.diff
>
>
> The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
> Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
> For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords.  If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira