You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ray Gauss II (Created) (JIRA)" <ji...@apache.org> on 2012/01/16 20:17:39 UTC

[jira] [Created] (TIKA-845) Check for Existing Value in Multi-Value Fields in XML Metadata Handler

Check for Existing Value in Multi-Value Fields in XML Metadata Handler
----------------------------------------------------------------------

                 Key: TIKA-845
                 URL: https://issues.apache.org/jira/browse/TIKA-845
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.0
            Reporter: Ray Gauss II
             Fix For: 1.1


The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.

Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.

For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords.  If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-845) Check for Existing Value in Multi-Value Fields in XML Metadata Handler

Posted by "Ray Gauss II (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ray Gauss II updated TIKA-845:
------------------------------

    Attachment: xml-check-multi-value-existing.diff

Patch to check for existing multi-value.
                
> Check for Existing Value in Multi-Value Fields in XML Metadata Handler
> ----------------------------------------------------------------------
>
>                 Key: TIKA-845
>                 URL: https://issues.apache.org/jira/browse/TIKA-845
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Ray Gauss II
>             Fix For: 1.1
>
>         Attachments: xml-check-multi-value-existing.diff
>
>
> The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
> Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
> For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords.  If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-845) Check for Existing Value in Multi-Value Fields in XML Metadata Handler

Posted by "Ray Gauss II (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191805#comment-13191805 ] 

Ray Gauss II commented on TIKA-845:
-----------------------------------

I was following precedence there and actually not even calling that code since ElementMetadataHandler correctly stores as a multivalued object, but you're right and your changes look spot on.
                
> Check for Existing Value in Multi-Value Fields in XML Metadata Handler
> ----------------------------------------------------------------------
>
>                 Key: TIKA-845
>                 URL: https://issues.apache.org/jira/browse/TIKA-845
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Ray Gauss II
>             Fix For: 1.1
>
>         Attachments: xml-check-multi-value-existing.diff
>
>
> The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
> Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
> For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords.  If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-845) Check for Existing Value in Multi-Value Fields in XML Metadata Handler

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191228#comment-13191228 ] 

Nick Burch commented on TIKA-845:
---------------------------------

I think the current logic isn't quite correct. Rather than ending up with a proper multivalued metadata object, we end up with a single string of comma separated values, which seems wrong to me

What I've done is fix up the logic, which allows for what looks to be a cleaner way to check for duplicates

I've also fixed up the single unit test that depending on the old comma concatination, DcXMLParserTest, to now check for the correct multivalued approach

I've committed this in r1234873.
                
> Check for Existing Value in Multi-Value Fields in XML Metadata Handler
> ----------------------------------------------------------------------
>
>                 Key: TIKA-845
>                 URL: https://issues.apache.org/jira/browse/TIKA-845
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Ray Gauss II
>             Fix For: 1.1
>
>         Attachments: xml-check-multi-value-existing.diff
>
>
> The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
> Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
> For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords.  If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-845) Check for Existing Value in Multi-Value Fields in XML Metadata Handler

Posted by "Nick Burch (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-845.
-----------------------------

    Resolution: Fixed
    
> Check for Existing Value in Multi-Value Fields in XML Metadata Handler
> ----------------------------------------------------------------------
>
>                 Key: TIKA-845
>                 URL: https://issues.apache.org/jira/browse/TIKA-845
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Ray Gauss II
>             Fix For: 1.1
>
>         Attachments: xml-check-multi-value-existing.diff
>
>
> The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
> Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
> For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords.  If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira