You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Scott Purcell <sp...@vertisinc.com> on 2005/03/01 21:47:25 UTC

Investingating Lucene For Project

I am looking for a solution to a problem I am having. We have a web-based asset management solution where we manage customers assets.
 
We have had requests from some clients who would like the ability to "index"  PDF files, now and possibly other text files in the future. The PDF files live on a server and are in a structured environment. I would like to somehow index the content inside the PDF and be able to run searches on that information from a web-form. The result MUST BE  a text snippet (that being some text prior to the searched word and after the searched word). 
Does this make sense? And can Lucene do this?
 
If the product can do this, how is the best way to get rolling on a project of this nature? Purchase an example book, or are there simple examples one can pick up on? Does Lucene have a large learning curve? or reasonably quick?
 
If all the above will work, what kind of license does this require? I have not been able to find a link to that yet on the jakarta site.
 
I sincerely appreciate any input into this.
 
Sincerely
Scott 
 

Re: Investingating Lucene For Project

Posted by Ben Litchfield <be...@csh.rit.edu>.
See inlined comments below.

> We have had requests from some clients who would like the ability to
> "index"  PDF files, now and possibly other text files in the future. The
> PDF files live on a server and are in a structured environment. I would
> like to somehow index the content inside the PDF and be able to run
> searches on that information from a web-form. The result MUST BE a text
> snippet (that being some text prior to the searched word and after the
> searched word).  Does this make sense? And can Lucene do this?


Lucene indexes text documents, so you will need to convert your PDF to a
text document.  PDFBox (http://www.pdfbox.org/) can do that, PDFBox
provides a summary of the document, which is just the first x number of
characters.  If you wanted a smarter summary you would need to create that
yourself.

> If the product can do this, how is the best way to get rolling on a
> project of this nature? Purchase an example book, or are there simple
> examples one can pick up on? Does Lucene have a large learning curve? or
> reasonably quick?

There are tutorials available on the website, and I would recommend
the "Lucene in Action" book.  There is a learning curve for lucene, but it
sounds like your requirements are pretty basic so it shouldn't be that
hard.



> If all the above will work, what kind of license does this require? I
> have not been able to find a link to that yet on the jakarta site.

http://www.apache.org/licenses/LICENSE-2.0

Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org