You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gena Batsyan <gb...@gmail.com> on 2009/06/03 12:09:36 UTC
indexing/crawling HTML + solr
Hi!
to be short, where to start with the subject?
Any pointers to some [semi-]functional solutions that crawl the web as a
normal crawler, take care about html parsing, etc, and feed the crawled
stuff as solr-documents per <add> ?
regards!
Re: indexing/crawling HTML + solr
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Gena,
Besides droids (simpler, smaller components you can put together) there is also Nutch, a bigger beast for large scale crawling that index crawled pages into Solr - http://lucene.apache.org/nutch .
Otis
----- Original Message ----
> From: Gena Batsyan <gb...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, June 3, 2009 6:09:36 AM
> Subject: indexing/crawling HTML + solr
>
> Hi!
>
> to be short, where to start with the subject?
>
> Any pointers to some [semi-]functional solutions that crawl the web as a normal
> crawler, take care about html parsing, etc, and feed the crawled stuff as
> solr-documents per ?
>
> regards!
Re: indexing/crawling HTML + solr
Posted by Olivier Dobberkau <ol...@dkd.de>.
Hi
Have à Look at the droids project in The incubator.
Olivier
Von meinem iPhone gesendet
Am 03.06.2009 um 12:09 schrieb Gena Batsyan <gb...@gmail.com>:
> Hi!
>
> to be short, where to start with the subject?
>
> Any pointers to some [semi-]functional solutions that crawl the web
> as a normal crawler, take care about html parsing, etc, and feed the
> crawled stuff as solr-documents per <add> ?
>
> regards!
>