You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2013/02/27 00:25:04 UTC

Improvement in Metadata Class

Hi,
(This is maybe traffic for dev@ but I hope it is OK here on user@)

1.
In Apache Nutch we are using the Metadata class [0] as follows
if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue;
TITLE value is deprecated and I want to upgrade API usage.
What should I be using?

2.
I would like to contribute to the Tika Java documentation for this as I am
not happy with the current Java documentation for this class.

3.
We also currently maintain a legacy Metadata package [1] within Nutch. This
is a multi-valued Metadata container including sets of constant fields for
Nutch webpage and host metadata.
How much of this stuff do we actually need (to be maintaining)? Should we
not be leveraging more of the stuff available within Apache Tika for
Metadata fields. Is this a case of the more the merrier here?

Thank you very much in advance. I look forward to hearing back from anyone
on this, I am at ApacheCon just now and will cook up a patch based on the
feedback. Thank you.

Lewis

[0]
http://tika.apache.org/1.3/api/index.html?org/apache/tika/metadata/Metadata.html
[1]
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/metadata/
-- 
*Lewis*

Re: Improvement in Metadata Class

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Very helpful as ever Nick.
See you later. Get the lager ready.
Lewis

On Tue, Feb 26, 2013 at 4:37 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Tue, 26 Feb 2013, Lewis John Mcgibbney wrote:
>
>> 1.
>> In Apache Nutch we are using the Metadata class [0] as follows
>> if (tikaMDName.equalsIgnoreCase(**Metadata.TITLE)) continue;
>> TITLE value is deprecated and I want to upgrade API usage.
>> What should I be using?
>>
>
> I was going to say "Just check the javadocs", then I spotted
>
>     // These properties are being moved to a new Tika core properties
> definition, javadocs will be added once it's available
>
> :(
>
> You'll want to look at TikaCoreProperties. In there, you can discover that
> the old Title is now an alias for new property based DublinCore.TITLE. For
> now, it'll probably be simplest for you to change something like
> Metadata.TITLE to TikaCoreProperties.TITLE
>
> You'll notice that the old one was just a string, the new one is Property
> based so you can get more information on what it contains and what'll be
> there.
>
>
>  2.
>> I would like to contribute to the Tika Java documentation for this as I am
>> not happy with the current Java documentation for this class.
>>
>
> TIKA-925, TIKA-928 and TIKA-930 (from a quick google) may help you
> understand why we made the changes. When it makes sense, some doc
> improvements for it would be amazing :)
>
>
>  Thank you very much in advance. I look forward to hearing back from anyone
>> on this, I am at ApacheCon just now and will cook up a patch based on the
>> feedback. Thank you.
>>
>
> There's a Tika meetup tonight in Galleria 2:
>   http://wiki.apache.org/**apachecon/ApacheMeetupsNA13<http://wiki.apache.org/apachecon/ApacheMeetupsNA13>
>
> Come along!
>
> Nick
>



-- 
*Lewis*

Re: Improvement in Metadata Class

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 26 Feb 2013, Lewis John Mcgibbney wrote:
> 1.
> In Apache Nutch we are using the Metadata class [0] as follows
> if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue;
> TITLE value is deprecated and I want to upgrade API usage.
> What should I be using?

I was going to say "Just check the javadocs", then I spotted

     // These properties are being moved to a new Tika core properties 
definition, javadocs will be added once it's available

:(

You'll want to look at TikaCoreProperties. In there, you can discover that 
the old Title is now an alias for new property based DublinCore.TITLE. For 
now, it'll probably be simplest for you to change something like 
Metadata.TITLE to TikaCoreProperties.TITLE

You'll notice that the old one was just a string, the new one is Property 
based so you can get more information on what it contains and what'll be 
there.

> 2.
> I would like to contribute to the Tika Java documentation for this as I am
> not happy with the current Java documentation for this class.

TIKA-925, TIKA-928 and TIKA-930 (from a quick google) may help you 
understand why we made the changes. When it makes sense, some doc 
improvements for it would be amazing :)

> Thank you very much in advance. I look forward to hearing back from anyone
> on this, I am at ApacheCon just now and will cook up a patch based on the
> feedback. Thank you.

There's a Tika meetup tonight in Galleria 2:
   http://wiki.apache.org/apachecon/ApacheMeetupsNA13

Come along!

Nick

Re: Improvement in Metadata Class

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Oh and thanks for taking the patch into Tika. I hope it will be a *bit*
clearer for folks in a similar position as us (in Nutch) to see exactly
what should be pulled from Tika.
Lewis

On Wed, Mar 6, 2013 at 10:49 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Chris,
> Thanks for the input.
> RE#3 Yeah, me and Sebastien are now discussing this and will address it
> within NUTCH-1537
> Thanks
> Lewis
>
>
> On Sun, Mar 3, 2013 at 9:41 PM, Mattmann, Chris A (388J) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>>  Hey Lewis,
>>
>>  RE: #3 — it would be great to get Nutch using Tika's metadata container
>> — I don't think we have anything special in Nutch that prevents it.
>> RE: #2 — I committed your Tika doc patch during ApacheCon NA 2013 so
>> thanks!
>>
>>  Thanks!
>>
>>  Cheers,
>> Chris
>>
>>
>>   From: Lewis John Mcgibbney <le...@gmail.com>
>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>> Date: Tuesday, February 26, 2013 3:25 PM
>> To: "user@tika.apache.org" <us...@tika.apache.org>
>> Subject: Improvement in Metadata Class
>>
>>  Hi,
>> (This is maybe traffic for dev@ but I hope it is OK here on user@)
>>
>> 1.
>> In Apache Nutch we are using the Metadata class [0] as follows
>> if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue;
>> TITLE value is deprecated and I want to upgrade API usage.
>> What should I be using?
>>
>> 2.
>> I would like to contribute to the Tika Java documentation for this as I
>> am not happy with the current Java documentation for this class.
>>
>> 3.
>> We also currently maintain a legacy Metadata package [1] within Nutch.
>> This is a multi-valued Metadata container including sets of constant fields
>> for Nutch webpage and host metadata.
>> How much of this stuff do we actually need (to be maintaining)? Should we
>> not be leveraging more of the stuff available within Apache Tika for
>> Metadata fields. Is this a case of the more the merrier here?
>>
>> Thank you very much in advance. I look forward to hearing back from
>> anyone on this, I am at ApacheCon just now and will cook up a patch based
>> on the feedback. Thank you.
>>
>> Lewis
>>
>> [0]
>> http://tika.apache.org/1.3/api/index.html?org/apache/tika/metadata/Metadata.html
>> [1]
>> http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/metadata/
>> --
>> *Lewis*
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Re: Improvement in Metadata Class

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Chris,
Thanks for the input.
RE#3 Yeah, me and Sebastien are now discussing this and will address it
within NUTCH-1537
Thanks
Lewis

On Sun, Mar 3, 2013 at 9:41 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

>  Hey Lewis,
>
>  RE: #3 — it would be great to get Nutch using Tika's metadata container
> — I don't think we have anything special in Nutch that prevents it.
> RE: #2 — I committed your Tika doc patch during ApacheCon NA 2013 so
> thanks!
>
>  Thanks!
>
>  Cheers,
> Chris
>
>
>   From: Lewis John Mcgibbney <le...@gmail.com>
> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> Date: Tuesday, February 26, 2013 3:25 PM
> To: "user@tika.apache.org" <us...@tika.apache.org>
> Subject: Improvement in Metadata Class
>
>  Hi,
> (This is maybe traffic for dev@ but I hope it is OK here on user@)
>
> 1.
> In Apache Nutch we are using the Metadata class [0] as follows
> if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue;
> TITLE value is deprecated and I want to upgrade API usage.
> What should I be using?
>
> 2.
> I would like to contribute to the Tika Java documentation for this as I am
> not happy with the current Java documentation for this class.
>
> 3.
> We also currently maintain a legacy Metadata package [1] within Nutch.
> This is a multi-valued Metadata container including sets of constant fields
> for Nutch webpage and host metadata.
> How much of this stuff do we actually need (to be maintaining)? Should we
> not be leveraging more of the stuff available within Apache Tika for
> Metadata fields. Is this a case of the more the merrier here?
>
> Thank you very much in advance. I look forward to hearing back from anyone
> on this, I am at ApacheCon just now and will cook up a patch based on the
> feedback. Thank you.
>
> Lewis
>
> [0]
> http://tika.apache.org/1.3/api/index.html?org/apache/tika/metadata/Metadata.html
> [1]
> http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/metadata/
> --
> *Lewis*
>



-- 
*Lewis*

Re: Improvement in Metadata Class

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Lewis,

RE: #3 — it would be great to get Nutch using Tika's metadata container — I don't think we have anything special in Nutch that prevents it.
RE: #2 — I committed your Tika doc patch during ApacheCon NA 2013 so thanks!

Thanks!

Cheers,
Chris


From: Lewis John Mcgibbney <le...@gmail.com>>
Reply-To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Date: Tuesday, February 26, 2013 3:25 PM
To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Subject: Improvement in Metadata Class

Hi,
(This is maybe traffic for dev@ but I hope it is OK here on user@)

1.
In Apache Nutch we are using the Metadata class [0] as follows
if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue;
TITLE value is deprecated and I want to upgrade API usage.
What should I be using?

2.
I would like to contribute to the Tika Java documentation for this as I am not happy with the current Java documentation for this class.

3.
We also currently maintain a legacy Metadata package [1] within Nutch. This is a multi-valued Metadata container including sets of constant fields for Nutch webpage and host metadata.
How much of this stuff do we actually need (to be maintaining)? Should we not be leveraging more of the stuff available within Apache Tika for Metadata fields. Is this a case of the more the merrier here?

Thank you very much in advance. I look forward to hearing back from anyone on this, I am at ApacheCon just now and will cook up a patch based on the feedback. Thank you.

Lewis

[0] http://tika.apache.org/1.3/api/index.html?org/apache/tika/metadata/Metadata.html
[1] http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/metadata/
--
Lewis