You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ray Gauss II (Created) (JIRA)" <ji...@apache.org> on 2012/01/16 20:17:39 UTC
[jira] [Created] (TIKA-845) Check for Existing Value in Multi-Value
Fields in XML Metadata Handler
Check for Existing Value in Multi-Value Fields in XML Metadata Handler
----------------------------------------------------------------------
Key: TIKA-845
URL: https://issues.apache.org/jira/browse/TIKA-845
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.0
Reporter: Ray Gauss II
Fix For: 1.1
The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords. If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-845) Check for Existing Value in Multi-Value
Fields in XML Metadata Handler
Posted by "Ray Gauss II (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ray Gauss II updated TIKA-845:
------------------------------
Attachment: xml-check-multi-value-existing.diff
Patch to check for existing multi-value.
> Check for Existing Value in Multi-Value Fields in XML Metadata Handler
> ----------------------------------------------------------------------
>
> Key: TIKA-845
> URL: https://issues.apache.org/jira/browse/TIKA-845
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.0
> Reporter: Ray Gauss II
> Fix For: 1.1
>
> Attachments: xml-check-multi-value-existing.diff
>
>
> The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
> Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
> For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords. If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-845) Check for Existing Value in
Multi-Value Fields in XML Metadata Handler
Posted by "Ray Gauss II (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191805#comment-13191805 ]
Ray Gauss II commented on TIKA-845:
-----------------------------------
I was following precedence there and actually not even calling that code since ElementMetadataHandler correctly stores as a multivalued object, but you're right and your changes look spot on.
> Check for Existing Value in Multi-Value Fields in XML Metadata Handler
> ----------------------------------------------------------------------
>
> Key: TIKA-845
> URL: https://issues.apache.org/jira/browse/TIKA-845
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.0
> Reporter: Ray Gauss II
> Fix For: 1.1
>
> Attachments: xml-check-multi-value-existing.diff
>
>
> The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
> Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
> For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords. If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-845) Check for Existing Value in
Multi-Value Fields in XML Metadata Handler
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191228#comment-13191228 ]
Nick Burch commented on TIKA-845:
---------------------------------
I think the current logic isn't quite correct. Rather than ending up with a proper multivalued metadata object, we end up with a single string of comma separated values, which seems wrong to me
What I've done is fix up the logic, which allows for what looks to be a cleaner way to check for duplicates
I've also fixed up the single unit test that depending on the old comma concatination, DcXMLParserTest, to now check for the correct multivalued approach
I've committed this in r1234873.
> Check for Existing Value in Multi-Value Fields in XML Metadata Handler
> ----------------------------------------------------------------------
>
> Key: TIKA-845
> URL: https://issues.apache.org/jira/browse/TIKA-845
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.0
> Reporter: Ray Gauss II
> Fix For: 1.1
>
> Attachments: xml-check-multi-value-existing.diff
>
>
> The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
> Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
> For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords. If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-845) Check for Existing Value in
Multi-Value Fields in XML Metadata Handler
Posted by "Nick Burch (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-845.
-----------------------------
Resolution: Fixed
> Check for Existing Value in Multi-Value Fields in XML Metadata Handler
> ----------------------------------------------------------------------
>
> Key: TIKA-845
> URL: https://issues.apache.org/jira/browse/TIKA-845
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.0
> Reporter: Ray Gauss II
> Fix For: 1.1
>
> Attachments: xml-check-multi-value-existing.diff
>
>
> The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.
> Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.
> For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords. If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira