You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2015/04/21 01:53:59 UTC

[jira] [Comment Edited] (TIKA-1607) Introduce new HashMap data structure for persitsence of Tika Metadata

    [ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503982#comment-14503982 ] 

Nick Burch edited comment on TIKA-1607 at 4/20/15 11:53 PM:
------------------------------------------------------------

Historically, we've always required that things on Metadata be a String, both key and value. Properties provide support for converting to/from Strings to more helpful types, but allow backwards compatible and simple fetching for people who don't want that

Based on the phone number example, this looks somewhat like the streams-style indexed metadata that we've been discussing for video and audio, eg "video stream 1 has width 640 + height 480, video stream 2 has width 320 + height 240, audio stream 1 is stereo + 44.1kHz + english" etc.

Maybe we should work to finish that indexed support off? We'd then keep strings everywhere in the metadata, we'd keep backwards compatibility, and we'd keep things consistent between different styles of metadata (video, audio, phone etc!)

The thread "How should video files with audio be handled by parsers?" from last summer outlines a plan, [~rgauss] was going to try and prototype it first before committing. (That thread already has an example of how contact cards with phone number based details might work, which ought to cover your phone number additional details info too!)


was (Author: gagravarr):
Historically, we've always required that things on Metadata be a String, both key and value. Properties provide support for converting to/from Strings to more helpful types, but allow backwards compatible and simple fetching for people who don't want that

Based on the phone number example, this looks somewhat like the streams-style indexed metadata that we've been discussing for video and audio, eg "video stream 1 has width 640 + height 480, video stream 2 has width 320 + height 240, audio stream 1 is stereo + 44.1kHz + english" etc.

Maybe we should work to finish that indexed support off? We'd then keep strings everywhere in the metadata, we'd keep backwards compatibility, and we'd keep things consistent between different styles of metadata (video, audio, phone etc!)

The thread "How should video files with audio be handled by parsers?" from last summer outlines a plan, [~rgauss] was going to try and prototype it first before committing.

> Introduce new HashMap<String, Object> data structure for persitsence of Tika Metadata
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.9
>
>
> I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling.
> Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the <String, Object> Mapping however is flexible enough to allow me to model Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)