You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by jw...@dicarta.com on 2006/02/09 03:52:03 UTC

Build vs. Buy?

I'm trying to upgrade our search functionality (currently, RTF/text
only, and exact phrase match only) at my company, and have run into some
concerns.  Our 4 main formats are:

 

RTF - javax.swing looks fine, we use those classes already.

 

MS Word - I know that POI exists, but development on the Word portion
seems to have stopped, and there are a lot of nasty looking bugs in
their DB.  Since we're involved in dealing with contracts, many of our
Word files are large and complicated.  How has everyone's experience
with POI's Word parsing been?

 

PDF - Looks like PDFBox has memory issues.  Frankly, this is not a
problem in anything other than indexing.  Minor, but still a concern.

 

Word Perfect - There doesn't seem to be any converters for this format?

 

I would hate to have to recommend to my boss to shell out $10k to $25k
(or more!) in licensing fees for a commercial search engine just because
I can't parse the files and the commercial ones can, but that is still
cheaper than dedicating two engineers for 6 months if we have to write
parsers for Word, PDF and Word Perfect if we go with Lucene (frankly,
there's less risk too, considering how complicated parsing would be.)  I
know that Lucene doesn't deal with file formats, but the basic fact is,
to use Lucene, you have to present it text strings, and there's no way
to get that without dealing with file formats.  

 

What is the experience of people on the list with implementing parsers
for anything more than text, html and xml?

 

Thanks for any insights,

Jeff Wang

diCarta, Inc.

Re: Build vs. Buy?

Posted by jian chen <ch...@gmail.com>.

For reading word document as text, you can try AntiWord.

I have written a simplified Lucene that does Max words match.

For example, if you are searching for aa, bb, cc, then, the document that
contains all words (aa, bb, cc) will be definitely ranked higher than
documents containing either aa, bb or aa, cc or bb, cc.

I am going to put up the code as open source. If you are interested, you can
email me directly.

Jian

On 2/9/06, P. Alex. Salamanca R. <al...@gmail.com> wrote:
>
> On the other hand, if you want be the most cheapest, why don't give a
> chance
> to google search appliance?
>
>

Re: Build vs. Buy?

Posted by "P. Alex. Salamanca R." <al...@gmail.com>.

On the other hand, if you want be the most cheapest, why don't give a chance
to google search appliance?

RE: Build vs. Buy?

Posted by Gwyn Carwardine <gw...@carwardine.net>.

Have you considered running the .net version (dotLucene)? The converters for
Office and PDF are freely available and there is a cheap commercial IFilter
available for wordperfect files (and many others).

-Gwyn

-----Original Message-----
From: jwang@dicarta.com [mailto:jwang@dicarta.com] 
Sent: 09 February 2006 02:52
To: java-user@lucene.apache.org
Subject: Build vs. Buy?

I'm trying to upgrade our search functionality (currently, RTF/text
only, and exact phrase match only) at my company, and have run into some
concerns.  Our 4 main formats are:

RTF - javax.swing looks fine, we use those classes already.

MS Word - I know that POI exists, but development on the Word portion
seems to have stopped, and there are a lot of nasty looking bugs in
their DB.  Since we're involved in dealing with contracts, many of our
Word files are large and complicated.  How has everyone's experience
with POI's Word parsing been?

PDF - Looks like PDFBox has memory issues.  Frankly, this is not a
problem in anything other than indexing.  Minor, but still a concern.

Word Perfect - There doesn't seem to be any converters for this format?

I would hate to have to recommend to my boss to shell out $10k to $25k
(or more!) in licensing fees for a commercial search engine just because
I can't parse the files and the commercial ones can, but that is still
cheaper than dedicating two engineers for 6 months if we have to write
parsers for Word, PDF and Word Perfect if we go with Lucene (frankly,
there's less risk too, considering how complicated parsing would be.)  I
know that Lucene doesn't deal with file formats, but the basic fact is,
to use Lucene, you have to present it text strings, and there's no way
to get that without dealing with file formats.  

What is the experience of people on the list with implementing parsers
for anything more than text, html and xml?

Thanks for any insights,

Jeff Wang

diCarta, Inc.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org