You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Arne Muller <ar...@gmail.com> on 2007/07/26 14:03:44 UTC

Solr newbe

Hello,

I've just started with Lucene to index a file server and aiming to index
lotus notes and some tables from relational databases.

After some research, I came (so far) to the conclusion that I'm re-inventing
the wheel, and that it may be better to use solr or nutch as lucene
front-ends.

I was wondering if the following workflow is kind of state of the art, or
whether you've something to comment or add:

Since there web-crawling is not necessary (my documents are on a file
server), I was thinking of using solr and to implement a basic program that
traverses the file system as a scheduled task (or cron job). The program
will not "crawl", i.e. it will not follow any references within the
documents to continue its search elsewhere. It just processes each found
file, and if the file's modification date is newer than 24h it adds it to
solr.

Most of the files are doc, xls, ppt and pdf. My "traverser" will therefore
use apache POI and PDFBox to extract the contents from the documents,
creating an appropriate XML stream with the solr "add" tag and sending it to
solr.

I was wondering if there are already tools for this kind of task available
(this time I'd like to avoid reinventing the wheel ;-)?

   thanks a lot for your help,
   +kind regards,

  Arne

RE: Solr newbe

Posted by Darren Hartford <dh...@ghsinc.com>.
One side-note is various content management tools already handle a lot
of data extraction (POI/PDFBox/etc).

In the case of Jakarta Slide and Apache Jackrabbit, both use Lucene
under the covers to index this data.

Not sure if you want to take the approach of putting your documents as
'managed' under a content management tool, but that's another avenue to
investigate.

-D 

> -----Original Message-----
> From: Arne Muller [mailto:arne.muller@gmail.com] 
> Sent: Thursday, July 26, 2007 8:04 AM
> To: Lucene
> Subject: Solr newbe
> 
> Hello,
> 
> I've just started with Lucene to index a file server and 
> aiming to index lotus notes and some tables from relational databases.
> 
> After some research, I came (so far) to the conclusion that 
> I'm re-inventing the wheel, and that it may be better to use 
> solr or nutch as lucene front-ends.
> 
> I was wondering if the following workflow is kind of state of 
> the art, or whether you've something to comment or add:
> 
> Since there web-crawling is not necessary (my documents are 
> on a file server), I was thinking of using solr and to 
> implement a basic program that traverses the file system as a 
> scheduled task (or cron job). The program will not "crawl", 
> i.e. it will not follow any references within the documents 
> to continue its search elsewhere. It just processes each 
> found file, and if the file's modification date is newer than 
> 24h it adds it to solr.
> 
> Most of the files are doc, xls, ppt and pdf. My "traverser" 
> will therefore use apache POI and PDFBox to extract the 
> contents from the documents, creating an appropriate XML 
> stream with the solr "add" tag and sending it to solr.
> 
> I was wondering if there are already tools for this kind of 
> task available (this time I'd like to avoid reinventing the wheel ;-)?
> 
>    thanks a lot for your help,
>    +kind regards,
> 
>   Arne
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org