You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by TKDD <my...@gmail.com> on 2006/11/17 02:04:20 UTC

StringIndexOutOfBoundException while parsing msword

hi,list

i am using nutch-0.8.1 which use poi as its msword parsing solution.
it works well while dealing with English doc, even the doc file is 
pretty large.

but it comes StringIndexOutOfBoundException when the doc(only one page) 
is written in Chinese characters.

i try to isolate the problem, and find out that if i use 
HWPFDocument.getRange().text() to read a local Chinese file, it's ok. 
But in nutch's way, 
DocumentInputStream->CHPBinTable->ComplexFileTable->TextPieceTable...,finally 
it will meet StringIndexOutOfBoundException because the parameter in 
TextPiece.substring() is negative.

I am going to do some futher study on this but wonder if anyone else has 
had similar
experiences?

thanks


TKDD

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: StringIndexOutOfBoundException while parsing msword

Posted by Rainer Schwarze <rs...@admadic.de>.

At 03:48 20.11.2006, TKDD wrote:
>hi,Rainer,
[...]
>by the way, what is fast-saved word files? thanks

Sorry for my late reply - I was strongly focused on the private part
of my life.

In Word one can switch between "fast saving" and the normal way (I
believe fast save is activated by default, but that may have changed
with the newer versions). If fast saving is off, Word creates a nice
structured file when the document is saved. If fast saving is on,
Word appends new data and changed data to the file (simply speaking).
So in fast saved files information is distributed all over the file
and the text and the formatting information is stored in "arbitrary"
order. The setting is in the "Save"(?)-tab in the options window of
Word (I don't know how it looks like after Word2000).

best wishes,
Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: StringIndexOutOfBoundException while parsing msword

Posted by TKDD <my...@gmail.com>.

hi,Rainer,

i heard about the 2-byte characters problem. but when i deal with the 
Chinese doc in the way below,it works well

/*
  WordExtractor wextractor = new 
WordExtractor((ByteArrayInputStream)input);
  return wextractor.getTextFromPieces();
*/

so i rewrite the extractText method in Nutch using this way.it's ok now.

according to your comment, it still has existing problems dealing with 
2-byte characters. I really want to know are they going to cause some 
potential problems in my Chinese word parser, though it seems work well now.

by the way, what is fast-saved word files? thanks

best regards,

TKDD

Rainer Schwarze wrote:
> Document with 2-byte characters (that's how Chinese characters are
> probably stored) are not correctly handled by HWPF.
> There are a lot of places in the code which need to be adjusted to
> let that work well.
>
> One more thing you need to consider: HWPF cannot handle "fast saved"
> Word files. If the documents you need to parse are "fast saved" this
> adds an extra level of complexity.
>
> Which information from the Word files do you need to parse?
>
> Best wishes,
> Rainer
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: StringIndexOutOfBoundException while parsing msword

Posted by Rainer Schwarze <rs...@admadic.de>.

Hi,

At 02:04 17.11.2006, TKDD wrote:
[...]
>but it comes StringIndexOutOfBoundException when the doc(only one page) 
>is written in Chinese characters.

Document with 2-byte characters (that's how Chinese characters are
probably stored) are not correctly handled by HWPF.
There are a lot of places in the code which need to be adjusted to
let that work well.

One more thing you need to consider: HWPF cannot handle "fast saved"
Word files. If the documents you need to parse are "fast saved" this
adds an extra level of complexity.

Which information from the Word files do you need to parse?

Best wishes,
Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/