You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@oodt.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2015/04/03 19:01:59 UTC

Re: Tika Based Metadata Extraction

Hi Tom,

On Friday, April 3, 2015, Tom Barber <to...@meteorite.bi> wrote:

> Hello Chaps and Chapesses,
>
> Somehow I've come this far and not done it but I was playing around with
> the crawler for my ApacheCon demo and came across the
> TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
> So I've put some stuff in a folder and can crawl and ingest it using the
> GenericFile element map, now in the past to map metadata I've written some
> class to pump the data around and add to that file,

To what file ?

> but I was wondering if, as I know what fields are coming out of Tika to
> just put them into the XML mapping file somehow so I can by pass having to
> write Java code?

Well Tika will make best effort to pull out as much metadata as possible.
Chris explains a good bit about this here

 https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help

I think that if custom extractions are required... You could most likely
extend the extractor interface and implement it but... This is Java code
which I assume you are trying to work around?

> This may be very obvious in which case I apologise but I can't find owt on
> the wiki so I figured I'd ask the gurus.
>
>

-- 
*Lewis*

Re: Tika Based Metadata Extraction

Posted by Tom Barber <to...@meteorite.bi>.

Indeed, fair enough chief i'll investigate your pointers.

Thanks

Tom

On Sat, Apr 04, 2015 at 09:30:07PM +0000, Mattmann, Chris A (3980) wrote:
>You’re on the right track Tom - I’m just trying to save you
>having to use the XMLValidationLayer - in reality you want something
>like that that will accept * patterns.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Tom Barber <to...@analytical-labs.com>
>Reply-To: "dev@oodt.apache.org" <de...@oodt.apache.org>
>Date: Saturday, April 4, 2015 at 2:37 AM
>To: "dev@oodt.apache.org" <de...@oodt.apache.org>
>Subject: Re: Tika Based Metadata Extraction
>
>>It seems to me (without looking at the source for chris' examples) that
>>either its more complex that I imaged or I'm just bad at explaining stuff.
>>
>>My understanding of  using the crawler, the TikaCmdLineMetExtractor
>>creates a met file on the fly?
>>
>>Within a met file is the metadata associated with a product you are about
>>to ingest.
>>
>>Those met files map to a product mapping file in the filemgr policy area.
>>So Tika extracts lots of metadata already, so does this get put in the
>>.met file where I can map it directly to a product-map-element file:
>>
>><type id="urn:oodt:ImageFile">
>>         <element id="urn:oodt:ProductReceivedTime"/>
>>         <element id="urn:oodt:ProductName"/>
>>         <element id="urn:oodt:ProductId"/>
>>         <element id="urn:oodt:ProductType"/>
>>         <element id="urn:oodt:ProductStructure"/>
>>         <element id="urn:oodt:Filename"/>
>>         <element id="urn:oodt:FileLocation"/>
>>         <element id="urn:oodt:MimeType"/>
>>         <element id="urn:test:DataVersion"/>
>>	<element id="urn:tika:SomejpegData"/>
>>     </type>
>>
>>I would have thought that would have made ingestion of extended metadata
>>without having to write code far easier but I couldn't find and example.
>>
>>Clearly by now I could have debugged the source code :) so I guess I'll
>>do that this evening and see who is correct (or how bad I am at
>>explaining stuff)
>>
>>
>>Tom
>>
>>
>>On Sat, Apr 04, 2015 at 05:16:53AM +0000, Mattmann, Chris A (3980) wrote:
>>>The suggestion I have would be to whip up a quick implementation
>>>of a LenientValidationLayer that takes in a Catalog implementation.
>>>If it’s the DataSource/MappedDataSource/ScienceData catalog, you:
>>>
>>>1. iterate over all product types and then get 1 hit from each,
>>>getting their metadata, and using that to “infer” what the elements
>>>are. I would do this statically 1x for each product type and update
>>>it based on a cache timeout (every 5 mins, or so)
>>>
>>>If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should
>>>be
>>>able to ask it for the TermVocabulary and/or all the fields present
>>>in the index. Single call. Easy.
>>>
>>>Another way to do it would be to build a Lucene/Solr, and a
>>>DataSource/Mapped/
>>>ScienceData Lenient Val Layer that simple takes a ref to the Catalog
>>>and/or
>>>Database, ignores having to go through the Catalog interface, and then
>>>simply gets the info you need (and lets all fields through and returns
>>>them the same).
>>>
>>>HTH,
>>>Chris
>>>
>>>
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Chris Mattmann, Ph.D.
>>>Chief Architect
>>>Instrument Software and Science Data Systems Section (398)
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>Office: 168-519, Mailstop: 168-527
>>>Email: chris.a.mattmann@nasa.gov
>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Adjunct Associate Professor, Computer Science Department
>>>University of Southern California, Los Angeles, CA 90089 USA
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Tom Barber <to...@meteorite.bi>
>>>Reply-To: "dev@oodt.apache.org" <de...@oodt.apache.org>
>>>Date: Friday, April 3, 2015 at 10:31 AM
>>>To: "dev@oodt.apache.org" <de...@oodt.apache.org>
>>>Subject: Re: Tika Based Metadata Extraction
>>>
>>>>Sorry the product element mapping file in my filemgr policy, by default
>>>>you
>>>>have the genericfike policy. So if i run tika app over  a jpeg file for
>>>>example i can see all the exif data etc in fields. Can i just map that
>>>>to
>>>>a
>>>>product type without writing code?
>>>>
>>>>Tom
>>>>On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <le...@gmail.com>
>>>>wrote:
>>>>
>>>>> Hi Tom,
>>>>>
>>>>> On Friday, April 3, 2015, Tom Barber <to...@meteorite.bi> wrote:
>>>>>
>>>>> > Hello Chaps and Chapesses,
>>>>> >
>>>>> > Somehow I've come this far and not done it but I was playing around
>>>>>with
>>>>> > the crawler for my ApacheCon demo and came across the
>>>>> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
>>>>> > So I've put some stuff in a folder and can crawl and ingest it using
>>>>>the
>>>>> > GenericFile element map, now in the past to map metadata I've
>>>>>written
>>>>> some
>>>>> > class to pump the data around and add to that file,
>>>>>
>>>>>
>>>>> To what file ?
>>>>>
>>>>>
>>>>> > but I was wondering if, as I know what fields are coming out of Tika
>>>>>to
>>>>> > just put them into the XML mapping file somehow so I can by pass
>>>>>having
>>>>> to
>>>>> > write Java code?
>>>>>
>>>>>
>>>>> Well Tika will make best effort to pull out as much metadata as
>>>>>possible.
>>>>> Chris explains a good bit about this here
>>>>>
>>>>>  https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
>>>>>
>>>>> I think that if custom extractions are required... You could most
>>>>>likely
>>>>> extend the extractor interface and implement it but... This is Java
>>>>>code
>>>>> which I assume you are trying to work around?
>>>>>
>>>>>
>>>>> > This may be very obvious in which case I apologise but I can't find
>>>>>owt
>>>>> on
>>>>> > the wiki so I figured I'd ask the gurus.
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>
>

Re: Tika Based Metadata Extraction

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

You’re on the right track Tom - I’m just trying to save you
having to use the XMLValidationLayer - in reality you want something
like that that will accept * patterns.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Tom Barber <to...@analytical-labs.com>
Reply-To: "dev@oodt.apache.org" <de...@oodt.apache.org>
Date: Saturday, April 4, 2015 at 2:37 AM
To: "dev@oodt.apache.org" <de...@oodt.apache.org>
Subject: Re: Tika Based Metadata Extraction

>It seems to me (without looking at the source for chris' examples) that
>either its more complex that I imaged or I'm just bad at explaining stuff.
>
>My understanding of  using the crawler, the TikaCmdLineMetExtractor
>creates a met file on the fly?
>
>Within a met file is the metadata associated with a product you are about
>to ingest.
>
>Those met files map to a product mapping file in the filemgr policy area.
>So Tika extracts lots of metadata already, so does this get put in the
>.met file where I can map it directly to a product-map-element file:
>
><type id="urn:oodt:ImageFile">
>         <element id="urn:oodt:ProductReceivedTime"/>
>         <element id="urn:oodt:ProductName"/>
>         <element id="urn:oodt:ProductId"/>
>         <element id="urn:oodt:ProductType"/>
>         <element id="urn:oodt:ProductStructure"/>
>         <element id="urn:oodt:Filename"/>
>         <element id="urn:oodt:FileLocation"/>
>         <element id="urn:oodt:MimeType"/>
>         <element id="urn:test:DataVersion"/>
>	<element id="urn:tika:SomejpegData"/>
>     </type>
>
>I would have thought that would have made ingestion of extended metadata
>without having to write code far easier but I couldn't find and example.
>
>Clearly by now I could have debugged the source code :) so I guess I'll
>do that this evening and see who is correct (or how bad I am at
>explaining stuff)
>
>
>Tom
>
>
>On Sat, Apr 04, 2015 at 05:16:53AM +0000, Mattmann, Chris A (3980) wrote:
>>The suggestion I have would be to whip up a quick implementation
>>of a LenientValidationLayer that takes in a Catalog implementation.
>>If it’s the DataSource/MappedDataSource/ScienceData catalog, you:
>>
>>1. iterate over all product types and then get 1 hit from each,
>>getting their metadata, and using that to “infer” what the elements
>>are. I would do this statically 1x for each product type and update
>>it based on a cache timeout (every 5 mins, or so)
>>
>>If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should
>>be
>>able to ask it for the TermVocabulary and/or all the fields present
>>in the index. Single call. Easy.
>>
>>Another way to do it would be to build a Lucene/Solr, and a
>>DataSource/Mapped/
>>ScienceData Lenient Val Layer that simple takes a ref to the Catalog
>>and/or
>>Database, ignores having to go through the Catalog interface, and then
>>simply gets the info you need (and lets all fields through and returns
>>them the same).
>>
>>HTH,
>>Chris
>>
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattmann@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Tom Barber <to...@meteorite.bi>
>>Reply-To: "dev@oodt.apache.org" <de...@oodt.apache.org>
>>Date: Friday, April 3, 2015 at 10:31 AM
>>To: "dev@oodt.apache.org" <de...@oodt.apache.org>
>>Subject: Re: Tika Based Metadata Extraction
>>
>>>Sorry the product element mapping file in my filemgr policy, by default
>>>you
>>>have the genericfike policy. So if i run tika app over  a jpeg file for
>>>example i can see all the exif data etc in fields. Can i just map that
>>>to
>>>a
>>>product type without writing code?
>>>
>>>Tom
>>>On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <le...@gmail.com>
>>>wrote:
>>>
>>>> Hi Tom,
>>>>
>>>> On Friday, April 3, 2015, Tom Barber <to...@meteorite.bi> wrote:
>>>>
>>>> > Hello Chaps and Chapesses,
>>>> >
>>>> > Somehow I've come this far and not done it but I was playing around
>>>>with
>>>> > the crawler for my ApacheCon demo and came across the
>>>> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
>>>> > So I've put some stuff in a folder and can crawl and ingest it using
>>>>the
>>>> > GenericFile element map, now in the past to map metadata I've
>>>>written
>>>> some
>>>> > class to pump the data around and add to that file,
>>>>
>>>>
>>>> To what file ?
>>>>
>>>>
>>>> > but I was wondering if, as I know what fields are coming out of Tika
>>>>to
>>>> > just put them into the XML mapping file somehow so I can by pass
>>>>having
>>>> to
>>>> > write Java code?
>>>>
>>>>
>>>> Well Tika will make best effort to pull out as much metadata as
>>>>possible.
>>>> Chris explains a good bit about this here
>>>>
>>>>  https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
>>>>
>>>> I think that if custom extractions are required... You could most
>>>>likely
>>>> extend the extractor interface and implement it but... This is Java
>>>>code
>>>> which I assume you are trying to work around?
>>>>
>>>>
>>>> > This may be very obvious in which case I apologise but I can't find
>>>>owt
>>>> on
>>>> > the wiki so I figured I'd ask the gurus.
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> *Lewis*
>>>>
>>

Re: Tika Based Metadata Extraction

Posted by Tom Barber <to...@analytical-labs.com>.

It seems to me (without looking at the source for chris' examples) that either its more complex that I imaged or I'm just bad at explaining stuff.

My understanding of  using the crawler, the TikaCmdLineMetExtractor creates a met file on the fly?

Within a met file is the metadata associated with a product you are about to ingest.

Those met files map to a product mapping file in the filemgr policy area. So Tika extracts lots of metadata already, so does this get put in the .met file where I can map it directly to a product-map-element file:

<type id="urn:oodt:ImageFile">
         <element id="urn:oodt:ProductReceivedTime"/>
         <element id="urn:oodt:ProductName"/>
         <element id="urn:oodt:ProductId"/>
         <element id="urn:oodt:ProductType"/>
         <element id="urn:oodt:ProductStructure"/>
         <element id="urn:oodt:Filename"/>
         <element id="urn:oodt:FileLocation"/>
         <element id="urn:oodt:MimeType"/>
         <element id="urn:test:DataVersion"/>
	<element id="urn:tika:SomejpegData"/>
     </type>

I would have thought that would have made ingestion of extended metadata without having to write code far easier but I couldn't find and example.

Clearly by now I could have debugged the source code :) so I guess I'll do that this evening and see who is correct (or how bad I am at explaining stuff)


Tom


On Sat, Apr 04, 2015 at 05:16:53AM +0000, Mattmann, Chris A (3980) wrote:
>The suggestion I have would be to whip up a quick implementation
>of a LenientValidationLayer that takes in a Catalog implementation.
>If it’s the DataSource/MappedDataSource/ScienceData catalog, you:
>
>1. iterate over all product types and then get 1 hit from each,
>getting their metadata, and using that to “infer” what the elements
>are. I would do this statically 1x for each product type and update
>it based on a cache timeout (every 5 mins, or so)
>
>If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should
>be
>able to ask it for the TermVocabulary and/or all the fields present
>in the index. Single call. Easy.
>
>Another way to do it would be to build a Lucene/Solr, and a
>DataSource/Mapped/
>ScienceData Lenient Val Layer that simple takes a ref to the Catalog and/or
>Database, ignores having to go through the Catalog interface, and then
>simply gets the info you need (and lets all fields through and returns
>them the same).
>
>HTH,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Tom Barber <to...@meteorite.bi>
>Reply-To: "dev@oodt.apache.org" <de...@oodt.apache.org>
>Date: Friday, April 3, 2015 at 10:31 AM
>To: "dev@oodt.apache.org" <de...@oodt.apache.org>
>Subject: Re: Tika Based Metadata Extraction
>
>>Sorry the product element mapping file in my filemgr policy, by default
>>you
>>have the genericfike policy. So if i run tika app over  a jpeg file for
>>example i can see all the exif data etc in fields. Can i just map that to
>>a
>>product type without writing code?
>>
>>Tom
>>On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <le...@gmail.com>
>>wrote:
>>
>>> Hi Tom,
>>>
>>> On Friday, April 3, 2015, Tom Barber <to...@meteorite.bi> wrote:
>>>
>>> > Hello Chaps and Chapesses,
>>> >
>>> > Somehow I've come this far and not done it but I was playing around
>>>with
>>> > the crawler for my ApacheCon demo and came across the
>>> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
>>> > So I've put some stuff in a folder and can crawl and ingest it using
>>>the
>>> > GenericFile element map, now in the past to map metadata I've written
>>> some
>>> > class to pump the data around and add to that file,
>>>
>>>
>>> To what file ?
>>>
>>>
>>> > but I was wondering if, as I know what fields are coming out of Tika
>>>to
>>> > just put them into the XML mapping file somehow so I can by pass
>>>having
>>> to
>>> > write Java code?
>>>
>>>
>>> Well Tika will make best effort to pull out as much metadata as
>>>possible.
>>> Chris explains a good bit about this here
>>>
>>>  https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
>>>
>>> I think that if custom extractions are required... You could most likely
>>> extend the extractor interface and implement it but... This is Java code
>>> which I assume you are trying to work around?
>>>
>>>
>>> > This may be very obvious in which case I apologise but I can't find
>>>owt
>>> on
>>> > the wiki so I figured I'd ask the gurus.
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>

Re: Tika Based Metadata Extraction

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

The suggestion I have would be to whip up a quick implementation
of a LenientValidationLayer that takes in a Catalog implementation.
If it’s the DataSource/MappedDataSource/ScienceData catalog, you:

1. iterate over all product types and then get 1 hit from each,
getting their metadata, and using that to “infer” what the elements
are. I would do this statically 1x for each product type and update
it based on a cache timeout (every 5 mins, or so)

If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should
be 
able to ask it for the TermVocabulary and/or all the fields present
in the index. Single call. Easy.

Another way to do it would be to build a Lucene/Solr, and a
DataSource/Mapped/
ScienceData Lenient Val Layer that simple takes a ref to the Catalog and/or
Database, ignores having to go through the Catalog interface, and then
simply gets the info you need (and lets all fields through and returns
them the same).

HTH,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Tom Barber <to...@meteorite.bi>
Reply-To: "dev@oodt.apache.org" <de...@oodt.apache.org>
Date: Friday, April 3, 2015 at 10:31 AM
To: "dev@oodt.apache.org" <de...@oodt.apache.org>
Subject: Re: Tika Based Metadata Extraction

>Sorry the product element mapping file in my filemgr policy, by default
>you
>have the genericfike policy. So if i run tika app over  a jpeg file for
>example i can see all the exif data etc in fields. Can i just map that to
>a
>product type without writing code?
>
>Tom
>On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <le...@gmail.com>
>wrote:
>
>> Hi Tom,
>>
>> On Friday, April 3, 2015, Tom Barber <to...@meteorite.bi> wrote:
>>
>> > Hello Chaps and Chapesses,
>> >
>> > Somehow I've come this far and not done it but I was playing around
>>with
>> > the crawler for my ApacheCon demo and came across the
>> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
>> > So I've put some stuff in a folder and can crawl and ingest it using
>>the
>> > GenericFile element map, now in the past to map metadata I've written
>> some
>> > class to pump the data around and add to that file,
>>
>>
>> To what file ?
>>
>>
>> > but I was wondering if, as I know what fields are coming out of Tika
>>to
>> > just put them into the XML mapping file somehow so I can by pass
>>having
>> to
>> > write Java code?
>>
>>
>> Well Tika will make best effort to pull out as much metadata as
>>possible.
>> Chris explains a good bit about this here
>>
>>  https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
>>
>> I think that if custom extractions are required... You could most likely
>> extend the extractor interface and implement it but... This is Java code
>> which I assume you are trying to work around?
>>
>>
>> > This may be very obvious in which case I apologise but I can't find
>>owt
>> on
>> > the wiki so I figured I'd ask the gurus.
>> >
>> >
>>
>>
>>
>> --
>> *Lewis*
>>

Re: Tika Based Metadata Extraction

Posted by Tom Barber <to...@meteorite.bi>.

Sorry the product element mapping file in my filemgr policy, by default you
have the genericfike policy. So if i run tika app over  a jpeg file for
example i can see all the exif data etc in fields. Can i just map that to a
product type without writing code?

Tom
On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <le...@gmail.com>
wrote:

> Hi Tom,
>
> On Friday, April 3, 2015, Tom Barber <to...@meteorite.bi> wrote:
>
> > Hello Chaps and Chapesses,
> >
> > Somehow I've come this far and not done it but I was playing around with
> > the crawler for my ApacheCon demo and came across the
> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
> > So I've put some stuff in a folder and can crawl and ingest it using the
> > GenericFile element map, now in the past to map metadata I've written
> some
> > class to pump the data around and add to that file,
>
>
> To what file ?
>
>
> > but I was wondering if, as I know what fields are coming out of Tika to
> > just put them into the XML mapping file somehow so I can by pass having
> to
> > write Java code?
>
>
> Well Tika will make best effort to pull out as much metadata as possible.
> Chris explains a good bit about this here
>
>  https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
>
> I think that if custom extractions are required... You could most likely
> extend the extractor interface and implement it but... This is Java code
> which I assume you are trying to work around?
>
>
> > This may be very obvious in which case I apologise but I can't find owt
> on
> > the wiki so I figured I'd ask the gurus.
> >
> >
>
>
>
> --
> *Lewis*
>