You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/05/02 20:01:21 UTC
Re: fields foreach document
Hi Eyeris,
the index-more plugin does this once you have your satisfactory
configuration. Markus committed some nice stuff making mimeType
extraction
configurable, allowing for back compat with the current behavior of
type indexed as a multi-valued field (splitting on the type/subtype),
but also
allowing the entire mime type to be indexed as a single value.
Have a poke around with the code for the plugin and you'll get to grips with it.
Lewis
On Thu, Apr 26, 2012 at 8:27 PM, Ing. Eyeris Rodriguez Rueda
<er...@uci.cu> wrote:
> hello, I'm using nutch with solr and i need to know for each type of document crawled by nutch(pdf,docx,ppt) which are the fields recognized on each document. I know that tika parser is incharged of parsing the documents founds on the crawl process but i need to know for all documents supported by nutch, which are its fields.
> Please some help will be appreciated.
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
--
Lewis
RE: fields foreach document
Posted by "Ing. Eyeris Rodriguez Rueda" <er...@uci.cu>.
Thanks Markus and Lewis for your answer.
Problem solved with index-metadata plugins and tika parser plugins on the wiki http://wiki.apache.org/nutch/TikaPlugin
_____________________________________________________________________
Ing. Eyeris Rodriguez Rueda
Teléfono:837-3370
Universidad de las Ciencias Informáticas
_____________________________________________________________________
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
Re: fields foreach document
Posted by Markus Jelsma <ma...@openindex.io>.
The shipped schema.xml file should also contain all fields Nutch can
emit incl. the source plugin.
On Wed, 2 May 2012 19:01:21 +0100, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi Eyeris,
>
> the index-more plugin does this once you have your satisfactory
> configuration. Markus committed some nice stuff making mimeType
> extraction
> configurable, allowing for back compat with the current behavior of
> type indexed as a multi-valued field (splitting on the type/subtype),
> but also
> allowing the entire mime type to be indexed as a single value.
>
> Have a poke around with the code for the plugin and you'll get to
> grips with it.
>
> Lewis
>
> On Thu, Apr 26, 2012 at 8:27 PM, Ing. Eyeris Rodriguez Rueda
> <er...@uci.cu> wrote:
>> hello, I'm using nutch with solr and i need to know for each type of
>> document crawled by nutch(pdf,docx,ppt) which are the fields
>> recognized on each document. I know that tika parser is incharged of
>> parsing the documents founds on the crawl process but i need to know
>> for all documents supported by nutch, which are its fields.
>> Please some help will be appreciated.
>>
>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>> INFORMATICAS...
>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>
>> http://www.uci.cu
>> http://www.facebook.com/universidad.uci
>> http://www.flickr.com/photos/universidad_uci
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350