You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Tuan Jean Tee <tu...@minterellison.com> on 2004/04/22 05:26:44 UTC

what web crawler work best with Lucene?

Have anyone implemented any open source web crawler with Lucene? I have
a dynamic website and are looking at putting in a search tools. Your
advice is very much appreciated.

Thank you.


IMPORTANT -

This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: what web crawler work best with Lucene?

Posted by Michael Wechner <mi...@wyona.com>.
Tuan Jean Tee wrote:

>Have anyone implemented any open source web crawler with Lucene? I have
>a dynamic website and are looking at putting in a search tools. Your
>advice is very much appreciated.
>  
>
there is a crawler included within Apache Lenya 
http://cocoon.apache.org/lenya/

src/java/org/apache/lenya/search/crawler/*

or you might try LARM

http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html


HTH

Michi


>Thank you.
>
>
>IMPORTANT -
>
>This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>  
>


-- 
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com              http://cocoon.apache.org/lenya/
michael.wechner@wyona.com                        michi@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: what web crawler work best with Lucene?

Posted by Sebastian Ho <se...@bii.a-star.edu.sg>.
U can also use the html parser API available from Java 1.4.2. I tried it
last week with a simple program which retrieve html files and displaying
all HREF links in it. I done it within a day.

sebastian


On Thu, 2004-04-22 at 12:18, Stephane James Vaucher wrote:
> How big is the site?
> 
> I mostly use an inhouse solution, but I've used HttpUnit for web scrapping
> small sites (because of its high-level api).
> 
> Here is a hello world example:
> http://wiki.apache.org/jakarta-lucene/HttpUnitExample
> 
> For a small/simple site, small modifications to this class could suffice.
> IT WILL NOT function on large sites because of memory problems.
> 
> For larger sites, there are questions like:
> 
> - memory:
> For example, spidering all links on every page can lead to visiting too
> many links. Keeping all visited links in memory can be problematic
> 
> - noise
> If you get every page on your web site, you might be adding noise to the
> search engine. Spider navigation rules can help out, like saying that you
> should only follow links/index documents of a specific form like
> www.mysite.com/news/article.jsp?articleid=xxx
> 
> - speed:
> Too much speed can be bad if you doing 100 hits/sec on a site could hurt
> it (especially if it's not you who are the webmaster)
> Too little speed can be bad if you want to make sure you quickly get new
> pages.
> 
> - categorisation:
> You might want to separate information in your index. For example, you
> might want a user to do a search in the documentation section or in the
> press release section. This categorisation can be done by specifying
> sections to the site, or a subsequent analysis of available docs.
> 
> -up-to-date information
> You'll want to think of your update schedule, so that if you add a new
> page, it gets indexed quickly. This problem also occurs when you modify an
> existing page, you might want the modification to be detected rapidly.
> 
> HTH,
> sv
> 
> On Thu, 22 Apr 2004, Tuan Jean Tee wrote:
> 
> > Have anyone implemented any open source web crawler with Lucene? I have
> > a dynamic website and are looking at putting in a search tools. Your
> > advice is very much appreciated.
> >
> > Thank you.
> >
> >
> > IMPORTANT -
> >
> > This email and any attachments are confidential and may be privileged in
> > which case neither is intended to be waived. If you have received this
> > message in error, please notify us and remove it from your system. It is
> > your responsibility to check any attachments for viruses and defects
> > before opening or sending them on. Where applicable, liability is
> > limited by the Solicitors Scheme approved under the Professional
> > Standards Act 1994 (NSW). Minter Ellison collects personal information
> > to provide and market our services. For more information about use,
> > disclosure and access, see our privacy policy at www.minterellison.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: what web crawler work best with Lucene?

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.
How big is the site?

I mostly use an inhouse solution, but I've used HttpUnit for web scrapping
small sites (because of its high-level api).

Here is a hello world example:
http://wiki.apache.org/jakarta-lucene/HttpUnitExample

For a small/simple site, small modifications to this class could suffice.
IT WILL NOT function on large sites because of memory problems.

For larger sites, there are questions like:

- memory:
For example, spidering all links on every page can lead to visiting too
many links. Keeping all visited links in memory can be problematic

- noise
If you get every page on your web site, you might be adding noise to the
search engine. Spider navigation rules can help out, like saying that you
should only follow links/index documents of a specific form like
www.mysite.com/news/article.jsp?articleid=xxx

- speed:
Too much speed can be bad if you doing 100 hits/sec on a site could hurt
it (especially if it's not you who are the webmaster)
Too little speed can be bad if you want to make sure you quickly get new
pages.

- categorisation:
You might want to separate information in your index. For example, you
might want a user to do a search in the documentation section or in the
press release section. This categorisation can be done by specifying
sections to the site, or a subsequent analysis of available docs.

-up-to-date information
You'll want to think of your update schedule, so that if you add a new
page, it gets indexed quickly. This problem also occurs when you modify an
existing page, you might want the modification to be detected rapidly.

HTH,
sv

On Thu, 22 Apr 2004, Tuan Jean Tee wrote:

> Have anyone implemented any open source web crawler with Lucene? I have
> a dynamic website and are looking at putting in a search tools. Your
> advice is very much appreciated.
>
> Thank you.
>
>
> IMPORTANT -
>
> This email and any attachments are confidential and may be privileged in
> which case neither is intended to be waived. If you have received this
> message in error, please notify us and remove it from your system. It is
> your responsibility to check any attachments for viruses and defects
> before opening or sending them on. Where applicable, liability is
> limited by the Solicitors Scheme approved under the Professional
> Standards Act 1994 (NSW). Minter Ellison collects personal information
> to provide and market our services. For more information about use,
> disclosure and access, see our privacy policy at www.minterellison.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org