You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Earl Cahill <ca...@yahoo.com> on 2005/10/01 09:14:31 UTC

a few questions

Tempted to do each question as a separate email, but
here you go.

1.  Does nutch use pure lucene for its indexing?  Does
the nutch index = lucene + potentially ndfs?  If I am
going to run a search web service, I am just wondering
what advantages nutch would serve over lucene.

2.  Turns out I am going to write a web service for
search.  I have played with the nutch search example,
but if I want to do rather arbitrary key/value pairs
and have a web service return xml, I am guessing I am
going to have to write my own.  Is that right?  Is
there an easy way to get results in xml format? 
Guessing I need to build it all myself.

3.  In another project, I want to use ndfs to store
two+ distinct copies of a file, but I really don't
want anything else to do with nutch on the project. 
Is that possible?  Is there a clean break?  I want to
make a list of servers, then have an api call that
takes a file and stores 2+ copies across my servers,
and an api call that reads a file, with appropriate
failover.

4.  Guessing I write a plugin, but I want to interject
some code during the nutch crawl process that does
some analysis and actually does the index insertion
itself.  There any good docs on how to do such a
thing?

Thanks,
Earl


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: a few questions

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Earl Cahill wrote:
> Tempted to do each question as a separate email, but
> here you go.
> 
> 1.  Does nutch use pure lucene for its indexing?  Does
> the nutch index = lucene + potentially ndfs?  If I am
> going to run a search web service, I am just wondering
> what advantages nutch would serve over lucene.

Yes it uses lucene for indexing. It is not only lucene + ndfs as
it contains a fetcher, WebDB maintenance, map/reduce and plenty of plugins.

> 
> 2.  Turns out I am going to write a web service for
> search.  I have played with the nutch search example,
> but if I want to do rather arbitrary key/value pairs
> and have a web service return xml, I am guessing I am
> going to have to write my own.  Is that right?  Is
> there an easy way to get results in xml format? 
> Guessing I need to build it all myself.
There is a servlet OpenSearchServlet that return XML.

> 3.  In another project, I want to use ndfs to store
> two+ distinct copies of a file, but I really don't
> want anything else to do with nutch on the project. 
> Is that possible?  Is there a clean break?  I want to
> make a list of servers, then have an api call that
> takes a file and stores 2+ copies across my servers,
> and an api call that reads a file, with appropriate
> failover.
> 
I think that it is possible. Take a look at TestClient (not the best
name) to look at API usage.

> 4.  Guessing I write a plugin, but I want to interject
> some code during the nutch crawl process that does
> some analysis and actually does the index insertion
> itself.  There any good docs on how to do such a
> thing?

Plugin writting is easy - you can take a look at one of existing
ones as an example. But all depends what and where you want to change as
plugins are executed at the well specified points of processing so it 
might happen that there is no possibility of using a plugin for the 
thing you want to achieve.
Regards,
Piotr