You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Javier P. L." <li...@gmail.com> on 2006/11/07 11:23:24 UTC

Modifiying Nutch Indexer

Hi, 


I need to modify the Nutch Indexer class because for me it is very
useful to add some fields to the generated Lucene index. I was trying
and I find out that it is possible to add fields to the Document with
doc.addField() in the reduce function. My point is that for those fields
I need the html content of the webpage to process it, but it looks not
to be present yet in the Document because it throws a null pointer
exception with getField("content"), maybe that is not the correct way to
access it, or the correct place. So, How and where can I access to the
html content of the document to add a new field to the Lucene Document
and so on to the generated index?

Any advice will be very helpful, 


Thanks in advance. 

Javier.

Re: Modifiying Nutch Indexer

Posted by "Javier P. L." <li...@gmail.com>.

El mar, 07-11-2006 a las 15:01 +0200, Enis Soztutar escribió: 
> Javier P. L. wrote:
> > Hi, 
> >
> >
> > I need to modify the Nutch Indexer class because for me it is very
> > useful to add some fields to the generated Lucene index. I was trying
> > and I find out that it is possible to add fields to the Document with
> > doc.addField() in the reduce function. My point is that for those fields
> > I need the html content of the webpage to process it, but it looks not
> > to be present yet in the Document because it throws a null pointer
> > exception with getField("content"), maybe that is not the correct way to
> > access it, or the correct place. So, How and where can I access to the
> > html content of the document to add a new field to the Lucene Document
> > and so on to the generated index?
> >
> > Any advice will be very helpful, 
> >
> >
> > Thanks in advance. 
> >
> > Javier.
> >
> >
> >
> >   
> Hi,
> 
> You do not need to change the indexer code for adding new fields to the 
> index. You need to implement an indexing filter and add it to your 
> configuration during indexing. You can look at the codes of 
> index-basic(BasicIndexingFilter) and index-more(MoreIndexingFilter). 
> IndexingFilter interface has filter() method which takes document, 
> parse, url, CrawlDatum and inlinks as arguments, so you readily have the 
> content of the document to be indexed.
> 
> You can look at the tutorial on implementing a plugin from the wiki.
> 
> Best wishes.
> 
> 

Thanks for the help, I did what you said, but now I have a question,
from where can I extract the html code of the document, i.e. the
equivalent to bean.getContent(details) ?. Because I need it for the new
fields that I want to add in the index plugin. I tried from Parse, from
and from CrawlDatum, but the most that I got was the parsed text from
the html code. Does anyone know how to get it?. 


Thanks in advance,

Javier

Re: Modifiying Nutch Indexer

Posted by Enis Soztutar <en...@gmail.com>.

Javier P. L. wrote:
> Hi, 
>
>
> I need to modify the Nutch Indexer class because for me it is very
> useful to add some fields to the generated Lucene index. I was trying
> and I find out that it is possible to add fields to the Document with
> doc.addField() in the reduce function. My point is that for those fields
> I need the html content of the webpage to process it, but it looks not
> to be present yet in the Document because it throws a null pointer
> exception with getField("content"), maybe that is not the correct way to
> access it, or the correct place. So, How and where can I access to the
> html content of the document to add a new field to the Lucene Document
> and so on to the generated index?
>
> Any advice will be very helpful, 
>
>
> Thanks in advance. 
>
> Javier.
>
>
>
>   
Hi,

You do not need to change the indexer code for adding new fields to the 
index. You need to implement an indexing filter and add it to your 
configuration during indexing. You can look at the codes of 
index-basic(BasicIndexingFilter) and index-more(MoreIndexingFilter). 
IndexingFilter interface has filter() method which takes document, 
parse, url, CrawlDatum and inlinks as arguments, so you readily have the 
content of the document to be indexed.

You can look at the tutorial on implementing a plugin from the wiki.

Best wishes.