You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bruno <br...@objectconsulting.com.au> on 2006/08/30 08:42:39 UTC

document support for file system crawling

Hi there,

browsing through the message thread I tried to find a trail addressing file
system crawls. I want to implement an enterprise search over a networked
filesystem, crawling all sorts of documents, such as html, doc, ppt and pdf.
Nutch provides plugins enabling it to read proprietary formats. 
Is there support for the same functionality in solr?

Bruno
-- 
View this message in context: http://www.nabble.com/document-support-for-file-system-crawling-tf2188066.html#a6053318
Sent from the Solr - User forum at Nabble.com.


Re: document support for file system crawling

Posted by Chris Hostetter <ho...@fucit.org>.
: the text out of these types of documents.  You could borrow the
: document parsing pieces from Lucene's contrib and Nutch and glue them
: together into your client that speaks to Solr, or perhaps Solr isn't
: the right approach for your needs?   It certainly is possible to add
: these capabilities into Solr, but it would be awkward to have to
: stream binary data into XML documents such that Solr could parse them
: on the server side.

Agreed.  Solr's focus is in indexing "Structured Data".  The support for
dynamic fields certainly allows you do deal with complex structured data,
and somewhat heterogeneous structured data -- but it's still structured
data.  If your goal is to do a lot of crawling of disparat physical
documents, extract the text, and build a "path,title,content" index
then Nutch is probably your best bet.


-Hoss


Re: document support for file system crawling

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 30, 2006, at 2:42 AM, Bruno wrote:
> browsing through the message thread I tried to find a trail  
> addressing file
> system crawls. I want to implement an enterprise search over a  
> networked
> filesystem, crawling all sorts of documents, such as html, doc, ppt  
> and pdf.
> Nutch provides plugins enabling it to read proprietary formats.
> Is there support for the same functionality in solr?

No.  Solr is strictly a search server that takes plain text for the  
fields of documents added to it.  The client is responsible parsing  
the text out of these types of documents.  You could borrow the  
document parsing pieces from Lucene's contrib and Nutch and glue them  
together into your client that speaks to Solr, or perhaps Solr isn't  
the right approach for your needs?   It certainly is possible to add  
these capabilities into Solr, but it would be awkward to have to  
stream binary data into XML documents such that Solr could parse them  
on the server side.

	Erik