You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by petite_abeille <pe...@mac.com> on 2003/10/30 19:20:43 UTC

Exotic format indexing?

Hello,

Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a 
popular question on this list...

The traditional approach seems to be to try to find some kind of format 
specific reader to properly extract the textual part of such documents 
for indexing. The drawback of such an approach is that its complicated 
and cumborsome: many different formats, not that many Java libraries to 
understand them all.

An alternative to such a mess could be perhaps to convert those 
multitude of formats into something more or less standard and then 
extract the text from that. But again, this doesn't seem to be such a 
straightforward proposition. For example, one could image "printing" 
every document to PDF and then convert the resulting PDF to text. Not a 
piece of cake in Java.

Finally, a while back, somebody on this list mentioned quiet a 
different approach: simply read the raw binary document and go fishing 
for what looks like text. I would like to try that :)

Does anyone remember this proposal? Has anyone tried such an approach?

Thanks for any pointers.

Cheers,

PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Exotic format indexing?

Posted by Ryan Ackley <sa...@cfl.rr.com>.

> Finally, a while back, somebody on this list mentioned quiet a
> different approach: simply read the raw binary document and go fishing
> for what looks like text. I would like to try that :)

I have tried that approach and it works ok. You end up with a bunch of junk
in with the useful stuff. It can clutter up your index and make searching
slower. There are a lot of file formats that don't store all of the text as
sequential text so it won't work. PDF is one, I know that PowerPoint is
another.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Exotic format indexing?

Posted by petite_abeille <pe...@mac.com>.

On Oct 30, 2003, at 20:48, Ben Litchfield wrote:

> Unfortunately, it is not quite so easy.  I am not sure about Word
> documents

The raw text is visible.

> but PDFs usually have there contents compressed

Yep. PDF is really an image format ;)

> so a raw
> "fishing" around for text would be pointless.

That's alright. I can handle PDF separately if the need arise.

>  Your best bet is to use a
> package like the one from textmining.org that handles various formats 
> for
> you.

Perhaps. But I'm only looking for a "good enough" solution, not a 
perfect one :)

Cheers,

PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: 182 file formats for lucene!!! was: Re: Exotic format indexing?

Posted by petite_abeille <pe...@mac.com>.

Hi Stefan,

On Oct 30, 2003, at 21:02, Stefan Groschupf wrote:

> just to let you know, i had implement for the nutch project a plugin 
> that can parse 182 file formats including m$ office.
> I simply use open office and use the available java api.

Yes, I saw that. Great work :)

Unfortunately, using OpenOffice is not an option in my case :(

Cheers,

PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

182 file formats for lucene!!! was: Re: Exotic format indexing?

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi there,

just to let you know, i had implement for the nutch project a plugin 
that can parse 182 file formats including m$ office.
I simply use open office and use the available java api.

It is really straight forward to use.

Found some info's and a link to the open source code here:
http://sourceforge.net/tracker/index.php?func=detail&aid=828517&group_id=59548&atid=491356

Feel free to recycle the code and give me any feedback.
Hope it will help to free some information from some strange commercial 
formats, since information should be free. ;)

Cheers
Stefan







Ben Litchfield wrote:

>Unfortunately, it is not quite so easy.  I am not sure about Word
>documents but PDFs usually have there contents compressed so a raw
>"fishing" around for text would be pointless.  Your best bet is to use a
>package like the one from textmining.org that handles various formats for
>you.
>
>Ben
>
>
>On Thu, 30 Oct 2003, petite_abeille wrote:
>
>  
>
>>Hello,
>>
>>Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
>>popular question on this list...
>>
>>The traditional approach seems to be to try to find some kind of format
>>specific reader to properly extract the textual part of such documents
>>for indexing. The drawback of such an approach is that its complicated
>>and cumborsome: many different formats, not that many Java libraries to
>>understand them all.
>>
>>An alternative to such a mess could be perhaps to convert those
>>multitude of formats into something more or less standard and then
>>extract the text from that. But again, this doesn't seem to be such a
>>straightforward proposition. For example, one could image "printing"
>>every document to PDF and then convert the resulting PDF to text. Not a
>>piece of cake in Java.
>>
>>Finally, a while back, somebody on this list mentioned quiet a
>>different approach: simply read the raw binary document and go fishing
>>for what looks like text. I would like to try that :)
>>
>>Does anyone remember this proposal? Has anyone tried such an approach?
>>
>>Thanks for any pointers.
>>
>>Cheers,
>>
>>PA.
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Exotic format indexing?

Posted by Ben Litchfield <be...@csh.rit.edu>.

Unfortunately, it is not quite so easy.  I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
"fishing" around for text would be pointless.  Your best bet is to use a
package like the one from textmining.org that handles various formats for
you.

Ben


On Thu, 30 Oct 2003, petite_abeille wrote:

> Hello,
>
> Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
> popular question on this list...
>
> The traditional approach seems to be to try to find some kind of format
> specific reader to properly extract the textual part of such documents
> for indexing. The drawback of such an approach is that its complicated
> and cumborsome: many different formats, not that many Java libraries to
> understand them all.
>
> An alternative to such a mess could be perhaps to convert those
> multitude of formats into something more or less standard and then
> extract the text from that. But again, this doesn't seem to be such a
> straightforward proposition. For example, one could image "printing"
> every document to PDF and then convert the resulting PDF to text. Not a
> piece of cake in Java.
>
> Finally, a while back, somebody on this list mentioned quiet a
> different approach: simply read the raw binary document and go fishing
> for what looks like text. I would like to try that :)
>
> Does anyone remember this proposal? Has anyone tried such an approach?
>
> Thanks for any pointers.
>
> Cheers,
>
> PA.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org