You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by spamsucks <sp...@rhoderunner.com> on 2007/02/01 22:14:35 UTC

Implement crawler with custom lucene VS use nutch?

I posted this on the lucene list a week ago and haven't heard anything, so 
please don't give me the cross-post slap;)

I am successfully using lucene in our application to index 12 different
types of objects located in a database, and their relationships to each
other to provide some nice search functionality for our website.  We are
building lots of lucene queries programmatically to filter based upon
categories, regions, zip codes, scoring, long/lats...

My problem is that there is content that is not in the database which we
have a lot of... (about 3000+ pages) that we need to also include in the
search results.  It's a whole lot of jsp's.

As I see this, I can either
a) Migrate this application to nutch
b) Write/Implement a web crawler to crawl our site and inject the crawl 
results into
our lucene index.

I am leaning towards option B, since I think it
would only take me a couple of days of implement/write a simple crawler and 
I wouldn't
have to change much else.

Can anyone think of any points/counterpoints for using Nutch vs. writing a
crawler to extend our already used lucene framework?

Thanks.

Re: Implement crawler with custom lucene VS use nutch?

Posted by "Markus N." <sp...@yahoo.de>.

Maybe "regain" might be a solution for you? 

http://regain.sourceforge.net/?lang=en.

Regards 
Markus


rhodebump wrote:
> 
> I posted this on the lucene list a week ago and haven't heard anything, so 
> please don't give me the cross-post slap;)
> 
> I am successfully using lucene in our application to index 12 different
> types of objects located in a database, and their relationships to each
> other to provide some nice search functionality for our website.  We are
> building lots of lucene queries programmatically to filter based upon
> categories, regions, zip codes, scoring, long/lats...
> 
> My problem is that there is content that is not in the database which we
> have a lot of... (about 3000+ pages) that we need to also include in the
> search results.  It's a whole lot of jsp's.
> 
> As I see this, I can either
> a) Migrate this application to nutch
> b) Write/Implement a web crawler to crawl our site and inject the crawl 
> results into
> our lucene index.
> 
> I am leaning towards option B, since I think it
> would only take me a couple of days of implement/write a simple crawler
> and 
> I wouldn't
> have to change much else.
> 
> Can anyone think of any points/counterpoints for using Nutch vs. writing a
> crawler to extend our already used lucene framework?
> 
> Thanks. 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Implement-crawler-with-custom-lucene-VS--use-nutch--tf3157478.html#a8804698
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Implement crawler with custom lucene VS use nutch?

Posted by Iain <ia...@idcl.co.uk>.

Gospedetic and Hatchers book includes some code which will process html.

If you already know what the html files are (you have access to the web site
as a file system), then you should be able to knock something up quickly.

If you have to extract the links and crawl that way, it's probably easier to
use nutch and if necessary reprocess the indexes into your own format.

IAIN

---------------
Iain Downs (Microsoft MVP)
Commercial Software Therapist
E:  iain@idcl.co.uk     T:+44 (0) 1423 872988
W: www.idcl.co.uk
http://mvp.support.microsoft.com
----------------------------------------------------------------------------
---------
Iain Downs Consulting Limited registered in England, Registration Number
329448
Registered Address - 82 St. John Street, London EC1M 4JN.  VAT Number - GB
697 3775 64
-----Original Message-----
From: spamsucks [mailto:spamsucks@rhoderunner.com] 
Sent: 01 February 2007 21:15
To: nutch-user@lucene.apache.org
Subject: Implement crawler with custom lucene VS use nutch?

I posted this on the lucene list a week ago and haven't heard anything, so 
please don't give me the cross-post slap;)

I am successfully using lucene in our application to index 12 different
types of objects located in a database, and their relationships to each
other to provide some nice search functionality for our website.  We are
building lots of lucene queries programmatically to filter based upon
categories, regions, zip codes, scoring, long/lats...

My problem is that there is content that is not in the database which we
have a lot of... (about 3000+ pages) that we need to also include in the
search results.  It's a whole lot of jsp's.

As I see this, I can either
a) Migrate this application to nutch
b) Write/Implement a web crawler to crawl our site and inject the crawl 
results into
our lucene index.

I am leaning towards option B, since I think it
would only take me a couple of days of implement/write a simple crawler and 
I wouldn't
have to change much else.

Can anyone think of any points/counterpoints for using Nutch vs. writing a
crawler to extend our already used lucene framework?

Thanks.