You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gena Batsyan <gb...@gmail.com> on 2009/06/03 12:09:36 UTC

indexing/crawling HTML + solr

Hi!

to be short, where to start with the subject?

Any pointers to some [semi-]functional solutions that crawl the web as a 
normal crawler, take care about html parsing, etc, and feed the crawled 
stuff as solr-documents per <add>  ?

regards!



Re: indexing/crawling HTML + solr

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Gena,

Besides droids (simpler, smaller components you can put together) there is also Nutch, a bigger beast for large scale crawling that index crawled pages into Solr - http://lucene.apache.org/nutch .

Otis


----- Original Message ----
> From: Gena Batsyan <gb...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, June 3, 2009 6:09:36 AM
> Subject: indexing/crawling HTML + solr
> 
> Hi!
> 
> to be short, where to start with the subject?
> 
> Any pointers to some [semi-]functional solutions that crawl the web as a normal 
> crawler, take care about html parsing, etc, and feed the crawled stuff as 
> solr-documents per   ?
> 
> regards!


Re: indexing/crawling HTML + solr

Posted by Olivier Dobberkau <ol...@dkd.de>.
Hi

Have à Look at the droids project in The incubator.

Olivier

Von meinem iPhone gesendet


Am 03.06.2009 um 12:09 schrieb Gena Batsyan <gb...@gmail.com>:

> Hi!
>
> to be short, where to start with the subject?
>
> Any pointers to some [semi-]functional solutions that crawl the web  
> as a normal crawler, take care about html parsing, etc, and feed the  
> crawled stuff as solr-documents per <add>  ?
>
> regards!
>