You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Friaa Nafaa <fr...@excite.com> on 2002/11/04 11:49:42 UTC
Indexing distant web sites
Hello,is there any way to index web sites by lucene, assuming we know only the url of the site ? :-->In local use we passe to lucene the full arborexcence or directory of our site (contain all the documents) and we begin the indexing operation, but when I would like to index a distant site on the web... what i do ?For exemple I installed Lucene on my computer and I would like to index the site : http://www.excite.com ...Thanks
_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!
Re: Indexing distant web sites
Posted by Karl Marx <ka...@gan.no>.
As stated in the official FAQ Lucene doesn't implement a web-crawler,
you can however use a self-made crawler or customate a crawler
framework like websphinx (http://www-2.cs.cmu.edu/~rcm/websphinx/) to
retrieve html documents from a site and then feed them to Lucene.
mvh karl øie
On Monday, Nov 4, 2002, at 11:49 Europe/Oslo, Friaa Nafaa wrote:
> Hello,is there any way to index web sites by lucene, assuming we know
> only the url of the site ? :-->In local use we passe to lucene the
> full arborexcence or directory of our site (contain all the documents)
> and we begin the indexing operation, but when I would like to index a
> distant site on the web... what i do ?For exemple I installed Lucene
> on my computer and I would like to index the site :
> http://www.excite.com ...Thanks
>
> _______________________________________________
> Join Excite! - http://www.excite.com
> The most personalized portal on the Web!
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>