You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/11/30 17:58:12 UTC

Upgrading Solr to Tika 0.8

I'm trying to upgrade Solr's version of Tika to 0.8 (https://issues.apache.org/jira/browse/SOLR-2241), but am getting some new metadata, or, I should say modified metadata names.

Namely, before I had a key named Keywords and now it is named AAPL:Keywords.  I'm not sure if this is due to the fact that the file was generated on a Mac or if it is because I am running on a Mac, but either way it is a bit troublesome.  Any insight on how to fix/resolve this?

-Grant

Re: Upgrading Solr to Tika 0.8

Posted by ceesjm1 <sc...@microstormsoftware.com>.
Jukka Zitting <jz...@...> writes:

> 
> Hi,
> 
> From: Grant Ingersoll [mailto:gsingers <at> apache.org]
> > Hmm, it does look like I'm still getting the Keywords, but this
> > AAPL:Keywords is an additional one.  Looks like it is coming from
> > PDFBox.  I will update my tests.
> 
> 0.8 exposes quite a bit more document metadata, and in some cases these
additional fields duplicate
> previously exposed information. For backwards compatibility we didn't remove
the old metadata fields
> even in cases where the new field is more accurately named or formatted.
> 
> In Tika 1.0 we probably should review all such cases and drop the old metadata
fields to avoid confusion
> later on, so you may want to prepare for some extra upgrade work with 1.0.
> 
> BR,
> 
> Jukka Zitting
> 


Hi there,

Does this mean that a Solr upgrade to Tika 0.8 is fine then with the exception
that Tika will expose additional metadata?

Just about to attempt the upgrade...

Cheers,

Scot


RE: Upgrading Solr to Tika 0.8

Posted by Jukka Zitting <jz...@adobe.com>.
Hi,

From: Grant Ingersoll [mailto:gsingers@apache.org]
> Hmm, it does look like I'm still getting the Keywords, but this
> AAPL:Keywords is an additional one.  Looks like it is coming from
> PDFBox.  I will update my tests.

0.8 exposes quite a bit more document metadata, and in some cases these additional fields duplicate previously exposed information. For backwards compatibility we didn't remove the old metadata fields even in cases where the new field is more accurately named or formatted.

In Tika 1.0 we probably should review all such cases and drop the old metadata fields to avoid confusion later on, so you may want to prepare for some extra upgrade work with 1.0.

BR,

Jukka Zitting

Re: Upgrading Solr to Tika 0.8

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 30, 2010, at 11:58 AM, Grant Ingersoll wrote:

> I'm trying to upgrade Solr's version of Tika to 0.8 (https://issues.apache.org/jira/browse/SOLR-2241), but am getting some new metadata, or, I should say modified metadata names.
> 
> Namely, before I had a key named Keywords and now it is named AAPL:Keywords.  I'm not sure if this is due to the fact that the file was generated on a Mac or if it is because I am running on a Mac, but either way it is a bit troublesome.  Any insight on how to fix/resolve this?

Hmm, it does look like I'm still getting the Keywords, but this AAPL:Keywords is an additional one.  Looks like it is coming from PDFBox.  I will update my tests.

-Grant