You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Daniel Lopez <D....@uib.es> on 2006/12/03 16:18:24 UTC

Using Nutch

Hi there,

I just started playing with Nutch and I have still not decided yet if it
would be appropriate or not, hence my questions. I already have experience
with Lucene inside my own projects, so I think I could tweak it a bit. I
browsed the documentation I could find, the Wiki and the mail archives and
then I thought about checking with the people already using it to see if my
impression is correct. So, here we go:

.- I'm planning on using it just in a single node to crawl/search on our
different web servers, to provide a search facility inside our own pages,
not for the whole web, and I read that the 7.X branch might be more
appropriate as the 8.X seemed to be more focused on multinode sites and
that might cause performance problems. Is that still true? Should I stick
to the 7.X branch?

.- I would like to be able to crawl/index/search the documents using
specific analyzers, due to documents being LATIN-1. I already applied an
appropriate analyzer in my programms but I'm not sure if Nutch allows to
change it easily, through some property, or I have to get into the code and
do it myself. I have no problem with that but the less I deviate from a
standard Nutch installation, the better, I guess. The same goes for the
Indexer and the searching possibilities. I would like to use something else
than a Boolean query. Can those things be tweaked through properties?

.- Lastly, the search interface is not exactly what I want and I'm also not
too keen on plain JSPs with the scripting inside. I thought I might as well
replicate the functionality using a framework we use, based on XML so we
have the UI and the rest separated... Are there any plans to develop the
search UI further, or should I simply look at the JSPs and replicate, more
or less, their behaviour. In that case, any special tips for that?

.- Anyone using Nutch in a similar scenario has any special tips/advice?

Thanks for any insight you can provide, I do have plenty of experience with
Java on the server side and Open Source, but I'd rather not duplicate work
if I can help it and I'd like to stick as close to the "standard" Nutch as
possible.

Cheers!
D.

Re: Using Nutch

Posted by Nitin Borwankar <ni...@borwankar.com>.

Hi Daniel,

A newbie to Nutch here myself.
Answers to some of your questions.

a) 0.7.2 for single site or small number of sites - yes, that's the 
better way to go for now.
b) re: Analyzer not sure, have not used anything custom
c) Re: the webapp - it is completely independent of Nutch which is 
crawler + indexer. The app is just meant for testing your set up easily, 
IMHO.
    To use your own UI, take a look at the contents of search.jsp - you 
want to use NutchBean and if you are familiar with Lucene,  the Hits class.
    From there you can take off on your own.  I am pretty sure most 
people using Nutch beyond the simple examples are using their own UI.
    You can also look at the OpenSearch servlet which serves up results 
in OpenSearch XML format.  So you can completely decouple the index from 
the UI.
    You can initiate search queries from a PHP, or Python or Ruby or 
..... whatever UI as long as you know how to wrap the XML in the style 
sheet of your choice.

Hope that helps -- sorry don't have more info on the Analyzer question.


Nitin Borwankar
http://tagschema.com


Daniel Lopez wrote:

>Hi there,
>
>I just started playing with Nutch and I have still not decided yet if it
>would be appropriate or not, hence my questions. I already have experience
>with Lucene inside my own projects, so I think I could tweak it a bit. I
>browsed the documentation I could find, the Wiki and the mail archives and
>then I thought about checking with the people already using it to see if my
>impression is correct. So, here we go:
>
>.- I'm planning on using it just in a single node to crawl/search on our
>different web servers, to provide a search facility inside our own pages,
>not for the whole web, and I read that the 7.X branch might be more
>appropriate as the 8.X seemed to be more focused on multinode sites and
>that might cause performance problems. Is that still true? Should I stick
>to the 7.X branch?
>
>.- I would like to be able to crawl/index/search the documents using
>specific analyzers, due to documents being LATIN-1. I already applied an
>appropriate analyzer in my programms but I'm not sure if Nutch allows to
>change it easily, through some property, or I have to get into the code and
>do it myself. I have no problem with that but the less I deviate from a
>standard Nutch installation, the better, I guess. The same goes for the
>Indexer and the searching possibilities. I would like to use something else
>than a Boolean query. Can those things be tweaked through properties?
>
>.- Lastly, the search interface is not exactly what I want and I'm also not
>too keen on plain JSPs with the scripting inside. I thought I might as well
>replicate the functionality using a framework we use, based on XML so we
>have the UI and the rest separated... Are there any plans to develop the
>search UI further, or should I simply look at the JSPs and replicate, more
>or less, their behaviour. In that case, any special tips for that?
>
>.- Anyone using Nutch in a similar scenario has any special tips/advice?
>
>Thanks for any insight you can provide, I do have plenty of experience with
>Java on the server side and Open Source, but I'd rather not duplicate work
>if I can help it and I'd like to stick as close to the "standard" Nutch as
>possible.
>
>Cheers!
>D.
>  
>