You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Joerg Hohwiller <jo...@j-hohwiller.de> on 2007/01/08 21:14:10 UTC

help with POI & Co.

Hi there,

I am a newbie to this list.

For my open-source project I wrote a search solution using lucene
that can extract text content from binary files.
For MS-Office files I use POI.

The POI basics seem to work fine and stable but the problem is about
the parts build ontop used to extract the text.

For msword I tried HWPF but the result was really bad. I discovered
tm-extractors from textmining.org what is not perfect but quite useful.
Somehow this stuff seems to be related to POI but I can not get many infos
since the site www.textmining.org was hacked a long time ago and
so the project seems to be quite dead.
>From the sources I found in the maven repository it was written by
Ryan Ackley. I have modified the sources so that the constructor can also
take a POIFilesystem and not only a File.
There are still some bugs. I would fix them but would I be allowed to
create a new release of this stuff and publish it with my project?
Or is there a way how to submit a patch to extmining.org?

For powerpoint I tried HSLF what could not parse most of the documents.
For this one I wrote my own solution that seems to accept all
documents but causes strange duplications of text passages. Maybe someone
out there has some knowledge to help me with that.

For excel I tried HSSF what throws an exception for every document I read.
Maybe I use the API in a wrong way. Since writing the powerpoint parser myself
was a real pain (these formats are so ugly), I do not want to go through hell
again for excel.

Please help me, if you have any hints...

You can find my work at:
http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/

Best regards
  Jörg

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: help with POI & Co.

Posted by Joerg Hohwiller <jo...@j-hohwiller.de>.
Hi Nick,

> You shouldn't really have any problems with HSSF. There are lots of
> examples for hssf, did you follow them?
I had a look at the examples and rewrote my code completely switching to event
listener mode. This works a lot better and consumes less memory.
I still have some ArrayIndexOutOfBoundsExceptions when UnkownRecord's are created.
I promise to try the latest version from trunk and if I still get those bugs,
I will open an issue and maybe supply a patch and an example document but the
problem is that the errors are manly in documents that contain information not
intendet to the public - I will see what I can do for you...

BTW: do you agree about what David said about the roadmap for a new POI release?
> 
> Nick
Best Regards
  Jörg

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: help with POI & Co.

Posted by Joerg Hohwiller <jo...@j-hohwiller.de>.
David Fisher schrieb:
> On Jan 9, 2007, at 2:28 PM, Joerg Hohwiller wrote:
>> Besides I used the official POI release which is very old. I did NOT
>> try the
>> HEAD from svn.
> 
> Jörg:
Hi David,
> 
> You want poi3_alpha3, or something like that. (I have Yegor do all my
> POI work for me :-)
> 
> Use the latest. I'm sure Nick is talking about it and definitely *not*
> the ancient, decrepit, last official release.
Okay, I guessed it.
> 
> A new official release is "in the works" when the POI guys work out some
> details with the Jakarta overseers. I *think* the community will vote to
> release current stuff soon (in the next week / month?)
That would have been my next question. But I have heared things like this from
other projects (e.g. maven plugins) and I waited and waited and finally a year
passed. Besides I was a little confused that the latest release is about 2,5
years old. That means something odd has happend to the community in that time.

For my personal needs it suites well if I can use a nightly build version but
for real usage I tend to use an official version.
Good luck to all POI activists for the next release...
> 
> Regards,
> Dave Fisher
Regards
  Jörg

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: help with POI & Co.

Posted by David Fisher <df...@jmlafferty.com>.
On Jan 9, 2007, at 2:28 PM, Joerg Hohwiller wrote:
> Besides I used the official POI release which is very old. I did  
> NOT try the
> HEAD from svn.

Jörg:

You want poi3_alpha3, or something like that. (I have Yegor do all my  
POI work for me :-)

Use the latest. I'm sure Nick is talking about it and definitely  
*not* the ancient, decrepit, last official release.

A new official release is "in the works" when the POI guys work out  
some details with the Jakarta overseers. I *think* the community will  
vote to release current stuff soon (in the next week / month?)

Regards,
Dave Fisher



---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: help with POI & Co.

Posted by Joerg Hohwiller <jo...@j-hohwiller.de>.
Nick Burch schrieb:
> On Tue, 9 Jan 2007, Joerg Hohwiller wrote:
>> Besides I used the official POI release which is very old. I did NOT
>> try the
>> HEAD from svn.
> 
> You should probably try with the svn head, you will generally have more
> luck with HWPF and HSLF from there.
Okay, thanks for the tip.
> 
>> I did NOT even open most of the documents. The constructor caused an
>> exception. Something like illegal fileformat or magic-number or
>> something.
> 
> I use hslf for a web spider that tries lots of random documents, and
> it's ok on almost all of them, so it's odd that you're having such problems
> 
>>> (Normally you want to catch CorruptPowerPointFileException and
>>> EncryptedPowerPointFileException, and skip over them, and catch
>>> ArrayIndexOutOfBoundsException, and report bugs for those)
>>
>> If an ArrayIndexOutOfBoundException is thrown by a method where the
>> user did not supply an index as parameter the implementation looks
>> like a hack to me. Same applies to NullPointerExceptions.
> 
> These two are caused by powerpoint files containing things that we
> didn't know they might, and which our test documents don't. If you
> report bugs for them, and include the problem document, we can try and
> figure out which of our assumptions on the file format are wrong, and
> work to fix them.
I already debugged into it. It occured when an UnknownRecord was created.
Generally not a good idea to assume anything about you dont even know.
I such situations you should always check indices and length before
accessing or copying arrays.
Besides i have seen printStackTrace() calls which is genrally sick for a
library. Please use nested exceptions for situations like this.
I hope this is already fixed in the last 2,5 years since the relase...
> 
>> My problem is that I extract many parts of text twice from the file.
>> It seems to me that they are really in there twice even though not
>> visible to the powerpoint application user.
> 
> Yup, that's to be expected on quicksaved files.
> QuickButCruddyTextExtractor will do something similar.
okay.
> 
> Your only option if you want to avoid that is to implement all the
> PersistPtr stuff, then parse SlideListWithTexts, and DoTheRightThing(tm)
> with it all. At which point, you've re-implemented most of hslf....
Sounds like some hints on that. I will have a look at it and also compare this
option with using the latest trunk. Thanks!
> 
> Nick
Regards
  Jörg

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: help with POI & Co.

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 9 Jan 2007, Joerg Hohwiller wrote:
> Besides I used the official POI release which is very old. I did NOT try the
> HEAD from svn.

You should probably try with the svn head, you will generally have more 
luck with HWPF and HSLF from there.

> I did NOT even open most of the documents. The constructor caused an 
> exception. Something like illegal fileformat or magic-number or 
> something.

I use hslf for a web spider that tries lots of random documents, and it's 
ok on almost all of them, so it's odd that you're having such problems

>> (Normally you want to catch CorruptPowerPointFileException and
>> EncryptedPowerPointFileException, and skip over them, and catch
>> ArrayIndexOutOfBoundsException, and report bugs for those)
>
> If an ArrayIndexOutOfBoundException is thrown by a method where the user 
> did not supply an index as parameter the implementation looks like a 
> hack to me. Same applies to NullPointerExceptions.

These two are caused by powerpoint files containing things that we didn't 
know they might, and which our test documents don't. If you report bugs 
for them, and include the problem document, we can try and figure out 
which of our assumptions on the file format are wrong, and work to fix 
them.

> My problem is that I extract many parts of text twice from the file. It 
> seems to me that they are really in there twice even though not visible 
> to the powerpoint application user.

Yup, that's to be expected on quicksaved files. 
QuickButCruddyTextExtractor will do something similar.

Your only option if you want to avoid that is to implement all the 
PersistPtr stuff, then parse SlideListWithTexts, and DoTheRightThing(tm) 
with it all. At which point, you've re-implemented most of hslf....

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: help with POI & Co.

Posted by Joerg Hohwiller <jo...@j-hohwiller.de>.
Nick Burch schrieb:
> On Mon, 8 Jan 2007, Joerg Hohwiller wrote:
>> For msword I tried HWPF but the result was really bad.
> 
> Were you using org.apache.poi.hwpf.extractor.WordExtractor ? It doesn't
> filter out all the "text" entries that aren't really text, but any
> patches to fix that would be appreciated :)
That is what I tried.
Well it throw exceptions for most of the documents.
My problem is that I have a hughe repository with very old to very new
documents. This technically means that you can find all sins of the office
history in the documents I need to read...
I read that textmining also supports older versions of word that are not
supported by HWPF.
Besides I used the official POI release which is very old. I did NOT try the
HEAD from svn.
> 
> For spidering, it's normally fine to use, since it doesn't normally
> matter if you get a few "bonus" words through for some of the special
> fields.
> 
>> I have modified the sources so that the constructor can also take a
>> POIFilesystem and not only a File. There are still some bugs. I would
>> fix them but would I be allowed to create a new release of this stuff
>> and publish it with my project? Or is there a way how to submit a
>> patch to textmining.org?
> 
> textmining.org belongs to Ryan Ackley, who used to contribute to POI,
> until he went to work for a company that licenses the file format
> documentation from Microsoft. You'll need to contact him yourself with
> any patches.
I will see what I can do...
> 
>> For powerpoint I tried HSLF what could not parse most of the documents.
> 
> That's odd. I have almost no trouble using
> org.apache.poi.hslf.extractor.PowerPointExtractor on a wide range of
> powerpoint documents. What problems did you hit?
I did NOT even open most of the documents. The constructor caused an exception.
Something like illegal fileformat or magic-number or something.
> 
> (Normally you want to catch CorruptPowerPointFileException and
> EncryptedPowerPointFileException, and skip over them, and catch
> ArrayIndexOutOfBoundsException, and report bugs for those)
If an ArrayIndexOutOfBoundException is thrown by a method where the user
did not supply an index as parameter the implementation looks like a hack to me.
Same applies to NullPointerExceptions.
I got all of these...
The POIFilesystem and the stuff to extract the metadata seems to be very stable
to me. But I did not make good experience with the rest of POI.
Anyhow I now have written a PPT extractor from scratch that is only based on
POIFilesystem but NOT on the HSLF stuff. The advantage is that I have support
for low memory footprint: my class can be configured not to extend a specific
buffer size for allocation so users do NOT get OutOfMemoryError if there was an
evil file that was to big and especially even those evil files are parsed but
only as much data is extracted as allowed by the configured buffer size.

My problem is that I extract many parts of text twice from the file.
It seems to me that they are really in there twice even though not visible
to the powerpoint application user.

If someone can help me with that I would be very pleased for any hit:

http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/mmm-search-parser-ppt/src/main/java/net/sf/mmm/search/parser/impl/ContentParserPpt.java

> 
>> For excel I tried HSSF what throws an exception for every document I
>> read.
> 
> You shouldn't really have any problems with HSSF. There are lots of
> examples for hssf, did you follow them?
I suppose NOT. I will look at them.

This is my code:
http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/mmm-search-parser-xls/src/main/java/net/sf/mmm/search/parser/impl/ContentParserXls.java

After I checked my mistakes I will send you the stacktraces of remaining problems.
> 
> Nick
Thanks
  Jörg

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: help with POI & Co.

Posted by Nick Burch <ni...@torchbox.com>.
On Mon, 8 Jan 2007, Joerg Hohwiller wrote:
> For msword I tried HWPF but the result was really bad.

Were you using org.apache.poi.hwpf.extractor.WordExtractor ? It doesn't 
filter out all the "text" entries that aren't really text, but any patches 
to fix that would be appreciated :)

For spidering, it's normally fine to use, since it doesn't normally matter 
if you get a few "bonus" words through for some of the special fields.

> I have modified the sources so that the constructor can also take a 
> POIFilesystem and not only a File. There are still some bugs. I would 
> fix them but would I be allowed to create a new release of this stuff 
> and publish it with my project? Or is there a way how to submit a patch 
> to textmining.org?

textmining.org belongs to Ryan Ackley, who used to contribute to POI, 
until he went to work for a company that licenses the file format 
documentation from Microsoft. You'll need to contact him yourself with any 
patches.

> For powerpoint I tried HSLF what could not parse most of the documents.

That's odd. I have almost no trouble using 
org.apache.poi.hslf.extractor.PowerPointExtractor on a wide range of 
powerpoint documents. What problems did you hit?

(Normally you want to catch CorruptPowerPointFileException and 
EncryptedPowerPointFileException, and skip over them, and catch 
ArrayIndexOutOfBoundsException, and report bugs for those)

> For excel I tried HSSF what throws an exception for every document I read.

You shouldn't really have any problems with HSSF. There are lots of 
examples for hssf, did you follow them?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/