You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2004/02/17 14:53:11 UTC

ppt text extraction - Re: SearchBlox J2EE Search Component Version 1.2 released

Eric Jain wrote:

>>- Support for PowerPoint documents
>>    
>>
>
>May I ask how you extract text from PowerPoint documents? Any open
>source tool, or your own code?
>  
>

FYI I recently discovered "ppthtml" in this package: 
http://chicago.sourceforge.net/xlhtml/

Also "antiword" seems to work well for word docs.

Also also also....I use a utility from xpdf 
(http://www.foolabs.com/xpdf/) for pdf text
extraction.

When you get down to it, I have found that "portable c" tools (above) 
work better
than the pure java ones avail.  To be fair however I have found that POI 
does work fine
for XLS docs.

 - Dave

>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: ppt text extraction - Re: SearchBlox J2EE Search Component Version 1.2 released

Posted by Ryan Ackley <sa...@cfl.rr.com>.
> When you get down to it, I have found that "portable c" tools (above)
> work better
> than the pure java ones avail.  To be fair however I have found that POI
> does work fine
> for XLS docs.

Gee thanks, your so generous with your praise.

I would recommend the OpenOffice SDK if you don't mind "portable c". It
supports all the possible MS Office formats going back to the dark ages. It
has a built-in Java programming interface, you don't have to compile it
yourself using cygwin, and it has a huge team of developers working 40+
hours a week to squash any bugs.

-Ryan


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org