You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/05/02 20:01:21 UTC

Re: fields foreach document

Hi Eyeris,

the index-more plugin does this once you have your satisfactory
configuration. Markus committed some nice stuff making mimeType
extraction
configurable, allowing for back compat with the current behavior of
type indexed as a multi-valued field (splitting on the type/subtype),
but also
allowing the entire mime type to be indexed as a single value.

Have a poke around with the code for the plugin and you'll get to grips with it.

Lewis

On Thu, Apr 26, 2012 at 8:27 PM, Ing. Eyeris Rodriguez Rueda
<er...@uci.cu> wrote:
> hello, I'm using nutch with solr and i need to know for each type of document crawled by nutch(pdf,docx,ppt) which are the fields recognized on each document. I know that tika parser is incharged of parsing the documents founds on the crawl process but i need to know for all documents supported by nutch, which are its fields.
> Please some help will be appreciated.
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci

-- 
Lewis

RE: fields foreach document

Posted by "Ing. Eyeris Rodriguez Rueda" <er...@uci.cu>.

Thanks Markus and Lewis for your answer.
Problem solved with index-metadata plugins and tika parser plugins on the wiki http://wiki.apache.org/nutch/TikaPlugin 

_____________________________________________________________________
Ing. Eyeris Rodriguez Rueda
Teléfono:837-3370
Universidad de las Ciencias Informáticas
_____________________________________________________________________




10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: fields foreach document

Posted by Markus Jelsma <ma...@openindex.io>.

 The shipped schema.xml file should also contain all fields Nutch can 
 emit incl. the source plugin.

 On Wed, 2 May 2012 19:01:21 +0100, Lewis John Mcgibbney 
 <le...@gmail.com> wrote:
> Hi Eyeris,
>
> the index-more plugin does this once you have your satisfactory
> configuration. Markus committed some nice stuff making mimeType
> extraction
> configurable, allowing for back compat with the current behavior of
> type indexed as a multi-valued field (splitting on the type/subtype),
> but also
> allowing the entire mime type to be indexed as a single value.
>
> Have a poke around with the code for the plugin and you'll get to
> grips with it.
>
> Lewis
>
> On Thu, Apr 26, 2012 at 8:27 PM, Ing. Eyeris Rodriguez Rueda
> <er...@uci.cu> wrote:
>> hello, I'm using nutch with solr and i need to know for each type of 
>> document crawled by nutch(pdf,docx,ppt) which are the fields 
>> recognized on each document. I know that tika parser is incharged of 
>> parsing the documents founds on the crawl process but i need to know 
>> for all documents supported by nutch, which are its fields.
>> Please some help will be appreciated.
>>
>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
>> INFORMATICAS...
>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>
>> http://www.uci.cu
>> http://www.facebook.com/universidad.uci
>> http://www.flickr.com/photos/universidad_uci

-- 
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350