You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Teruhiko Kurosaka <Ku...@basistech.com> on 2007/07/03 20:56:23 UTC

Indexing HTML and other doc types

Solr looks very good for indexing and searching strcutured data. 
But I noticed there is no tool in the Solr distribution with which documents
of other doc types can be indexed.  Are there other side projects that 
develop Solr clients for indexing documents of other doc types?

Or is the generic full-text search really a wrong area to apply Solr, and
should I be using something like Nutch?
-kuro 

Re: Indexing HTML and other doc types

Posted by Peter Manis <ma...@digital39.com>.
A coworker of mine posted the code that we used for adding pdf, doc, xls,
etc documents into solr.  You can find the files at the following location.

https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Just apply the patch and put the lib files in the lib directory, run `ant
compile`, yada yada and you should be good to go.  If the build fails update
to revision 552853, that is the latest revision I have compiled with the
patch so I know it works.  Usually if the build fails it is something
unrelated to Eric's code and will be fixed in a new few revisions.

.

Peter Manis

On 7/3/07, Teruhiko Kurosaka <Ku...@basistech.com> wrote:
>
> Solr looks very good for indexing and searching strcutured data.
> But I noticed there is no tool in the Solr distribution with which
> documents
> of other doc types can be indexed.  Are there other side projects that
> develop Solr clients for indexing documents of other doc types?
>
> Or is the generic full-text search really a wrong area to apply Solr, and
> should I be using something like Nutch?
> -kuro
>