You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Frank van Lingen <fr...@vanlingen.name> on 2010/01/25 21:55:22 UTC

solr application for website crawling and indexing html, pdf, word, ... files

I recently started working with solr and find it easy to setup and tinker with.

I now want to scale up my setup and was wondering if there is an
application/component that can do the following (I was not able to
find documentation on this on the solr site):

-Can I send solr an xml document with a url (html, pdf, word, ppt,
etc..) and solr indexes it after analyzing (can it analyze pdf and
other documents?). Solr would use some generic basic fields like
header and content when analyzing the files.

-Can I send solr a site url and it indexes the whole site?

If the answer to the above is yes; are there some examples? If the
answer is no; Is there a simple (basic) extractor for html, pdf, word,
etc.. files that would translates this in a basic xml document (e.g.
with field names, url, header and content) that solr can ingest, or
preferably an application that does this for a whole site?

The idea is to configure solr for generic indexing and search of a website.

Frank.

Re: solr application for website crawling and indexing html, pdf, word, ... files

Posted by mike anderson <sa...@gmail.com>.

I think you might be looking for Apache Tika.


On Mon, Jan 25, 2010 at 3:55 PM, Frank van Lingen <fr...@vanlingen.name>wrote:

> I recently started working with solr and find it easy to setup and tinker
> with.
>
> I now want to scale up my setup and was wondering if there is an
> application/component that can do the following (I was not able to
> find documentation on this on the solr site):
>
> -Can I send solr an xml document with a url (html, pdf, word, ppt,
> etc..) and solr indexes it after analyzing (can it analyze pdf and
> other documents?). Solr would use some generic basic fields like
> header and content when analyzing the files.
>
> -Can I send solr a site url and it indexes the whole site?
>
> If the answer to the above is yes; are there some examples? If the
> answer is no; Is there a simple (basic) extractor for html, pdf, word,
> etc.. files that would translates this in a basic xml document (e.g.
> with field names, url, header and content) that solr can ingest, or
> preferably an application that does this for a whole site?
>
> The idea is to configure solr for generic indexing and search of a website.
>
> Frank.
>

Re: solr application for website crawling and indexing html, pdf, word, ... files

Posted by Markus Jelsma <ma...@buyways.nl>.

Hello Frank,

Answers are inline:

Frank van Lingen said:
> I recently started working with solr and find it easy to setup and
> tinker with.
>
> I now want to scale up my setup and was wondering if there is an
> application/component that can do the following (I was not able to find
> documentation on this on the solr site):
>
> -Can I send solr an xml document with a url (html, pdf, word, ppt,
> etc..) and solr indexes it after analyzing (can it analyze pdf and other
> documents?). Solr would use some generic basic fields like
> header and content when analyzing the files.

Yes you can! Solr has an integration with Tika [1], yet another Apache
Lucene project. It can index many different formats. Please see the Solr
Cell wiki for more information [2].
>
> -Can I send solr a site url and it indexes the whole site?

No you can't. But there is yet another fine Apache Lucene project called
Nutch [3]. It offers a very convenient API and is very flexible. Since
version 1.0 Nutch can integrate more easily with a standby Solr index, and
together with Tika you can index almost anything you want with the
greatest ease.

You can find information on Nutch [4], also, our friends at
LucidImagination have written a very decent article on this subject [5].
You will find what you're looking for.

Cheers


>
> If the answer to the above is yes; are there some examples? If the
> answer is no; Is there a simple (basic) extractor for html, pdf, word,
> etc.. files that would translates this in a basic xml document (e.g.
> with field names, url, header and content) that solr can ingest, or
> preferably an application that does this for a whole site?
>
> The idea is to configure solr for generic indexing and search of a
> website.
>
> Frank.

[1]: http://lucene.apache.org/tika/index.html
[2]: http://wiki.apache.org/solr/ExtractingRequestHandler
[3]: http://lucene.apache.org/nutch/
[4]: http://wiki.apache.org/nutch/RunningNutchAndSolr
[5]: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/