You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ray Gauss II (JIRA)" <ji...@apache.org> on 2012/07/03 05:25:02 UTC
[jira] [Resolved] (TIKA-930) Consolidation of Some Tika Core
Properties
[ https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ray Gauss II resolved TIKA-930.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.2
Fixed in r1356560.
This ended up being a fairly large commit. Feel free to revert or re-open this issue if I've messed something up.
I've included the commit message here as it describes the majority of the changes:
- Added the Dublin Core Terms namespace and prefix
- Changed DublinCore.CREATOR to multi-valued property
- Consolidated TikaCoreProperties.AUTHOR to TikaCoreProperties.CREATOR
- Removed TikaCoreProperties.LAST_AUTHOR and added TikaCoreProperties.MODIFIER
- Added DublinCore.CREATED
- Consolidated TikaCoreProperties.DATE and TikaCoreProperties.CREATION_DATE to TikaCoreProperties.CREATED
- Consolidated TikaCoreProperties.SAVE_DATE to TikaCoreProperties.MODIFIED
- Updated DublinCore.MODIFIED to correct terms namespace
- Added OpenOfficeXMLCore.SUBJECT
- Consolidated TikaCoreProperties.SUBJECT to TikaCoreProperties.KEYWORDS
- Added several temporary transition properties to TikaCoreProperties to ease migrating previous use of 'subject' to more specific properties and maintain backwards compatibility
* For most mail-related parsers/handlers, transition subject to dc:title
* For most office-related parsers/handlers, transition subject to OO cp:subject
- Added TikaCoreProperties.CREATOR_TOOL
- Added TikaCoreProperties.METADATA_DATE
- Added TikaCoreProperties.RATING
- Changed XMP to use common namespace delimiter
- Added Open Office word processing namespace and prefix to OfficeOpenXMLExtended
- Added OfficeOpenXMLExtended.COMMENTS
- Added TikaCoreProperties.COMMENTS which is a composite of OfficeOpenXMLExtended.COMMENTS, ClimateForecast.COMMENT and MSOffice.COMMENTS
- Deprecated MSOffice.Comments
- Changed OpenDocumentMetaParser to accommodate TikaCoreProperties since the XML it processes treats dc:date and dc:subject differently than DcXMLParser
- Change nextMetadata in TextExtractor to a Property rather than String key
- Changed DcXmlParser to use namespace already defined in DublinCore
- Updated parsers to reflect TikaCoreProperties changes
- Updated tika-xmp to reflect TikaCoreProperties changes
- Registered dcterms namespace in XMPMetadataTest
- Updated tests to reflect new changes and added some tests for backwards compatibility
> Consolidation of Some Tika Core Properties
> ------------------------------------------
>
> Key: TIKA-930
> URL: https://issues.apache.org/jira/browse/TIKA-930
> Project: Tika
> Issue Type: Improvement
> Components: metadata
> Affects Versions: 1.2
> Reporter: Ray Gauss II
> Fix For: 1.2
>
>
> There are a few properties in TikaCoreProperties which overlap and I think we should minimize ambiguity by consolidating them into a single composite property with the clearest name, the most general specification referenced as its primary property, and the others and deprecated strings as its secondaries.
> Here's the proposed pseudo-code for the changes:
> Remove TikaCoreProperties.SUBJECT
> TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, MSOffice.KEYWORDS, Metadata.SUBJECT }
> Remove TikaCoreProperties.DATE
> TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, MSOffice.CREATION_DATE, Metadata.DATE }
> Remove TikaCoreProperties.MODIFIED
> TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" }
> and an example of the Java changes:
> {code:title=TikaCoreProperties.java *Before*}
> /**
> * @see DublinCore#SUBJECT
> */
> public static final Property SUBJECT = Property.composite(DublinCore.SUBJECT,
> new Property[] { Property.internalText(Metadata.SUBJECT) });
>
> /**
> * @see Office#KEYWORDS
> */
> public static final Property KEYWORDS = Property.composite(Office.KEYWORDS,
> new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) });
> {code}
> would become
> {code:title= TikaCoreProperties.java *After*}
> /**
> * @see DublinCore#SUBJECT
> * @see Office#KEYWORDS
> */
> public static final Property KEYWORDS = Property.composite(DublinCore.SUBJECT,
> new Property[] {
> Office.KEYWORDS,
> Property.internalTextBag(MSOffice.KEYWORDS),
> Property.internalText(Metadata.SUBJECT)
> });
> {code}
> Since this would require a bit of refactoring for parsers that use the properties being removed I thought it best to get some feedback before working up a full patch.
> Does this seem like a reasonable approach?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira