You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2009/07/08 11:53:37 UTC

DO NOT REPLY [Bug 47496] New: Strange MS Word file reading behavior

https://issues.apache.org/bugzilla/show_bug.cgi?id=47496

           Summary: Strange MS Word file reading behavior
           Product: POI
           Version: 3.5-dev
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: critical
          Priority: P1
         Component: HWPF
        AssignedTo: dev@poi.apache.org
        ReportedBy: andremoniy@gmail.com
                CC: andremoniy@gmail.com


There is very strange behavior. There are some doc-files in russian encoding
which contain particularly only tables. When reading them, HWPF returns a half
of document in the correct representation (each logical element in a
appropriate paragraph... character run...), and the second half of document is
represented in one paragraph and in the one character run. One of the effects
of this behavior is incorrect work of the TableIterator which returns only one
half of the all document's tables.
The debugging shows, that there are some strange breakthroughs in start<->end
values, when reading Plex of CPs. Here are printout of debug info (derived from
manually injected code lines in recompiled PlexOfCps class):
1. Creating TextPieceTable (in ComplexFileTable analyzing):
-----------------------------
start = 16474 size=448 sizeOfStruct=8
-----------------------------
Start -> 0 to end <-256
Start -> 256 to end <-1280
Start -> 1280 to end <-2048
Start -> 2048 to end <-3072
Start -> 3072 to end <-3840
Start -> 3840 to end <-4864
...
Start -> 25856 to end <-26368
Start -> 26368 to end <-27136
Start -> 27136 to end <-27648
Start -> 27648 to end <-28928
Start -> 28928 to end <-29184
Start -> 29184 to end <-58063 <--- !!! HERE !!!

2. Creating PAPBinTable:
-----------------------------
start = 7117 size=5020 sizeOfStruct=4
-----------------------------
Start -> 2048 to end <-2338
Start -> 2338 to end <-2546
Start -> 2546 to end <-2556
...
Start -> 59556 to end <-59694
Start -> 59694 to end <-59708
Start -> 59708 to end <-60402
Start -> 60402 to end <-264814  <--- !!! HERE !!!
Start -> 264814 to end <-264828
Start -> 264828 to end <-265600
Start -> 265600 to end <-265604
...
Start -> 320214 to end <-321000
Start -> 321000 to end <-321936
Start -> 321936 to end <-321950


Unfortunately, I can't attach this document files because of private
information containing in this files.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 47496] Strange MS Word file reading behavior

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47496





--- Comment #1 from weizi <we...@hotmail.com>  2009-07-24 02:02:54 PST ---
Created an attachment (id=24031)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=24031)
number of paragraph is 16

Number of paragraph is 16 in this doc .
HWPFDocument daDoc = new HWPFDocument(new FileInputStream("test.doc"));
            Range wordRange = daDoc.getRange();
            wordRange.numParagraphs()is 14
A property claimed to start before zero, at -256! Resetting it to zero, and
hoping for the best
papformatteddiskpag_70=-256 -> 54 = true
A property claimed to start before zero, at -256! Resetting it to zero, and
hoping for the best
papformatteddiskpag_70=54 -> 66 = true
papformatteddiskpag_70=66 -> 67 = true
papformatteddiskpag_70=67 -> 77 = true
papformatteddiskpag_70=77 -> 81 = true
papformatteddiskpag_70=81 -> 95 = true
papformatteddiskpag_70=95 -> 113 = true
papformatteddiskpag_70=113 -> 145 = true
papformatteddiskpag_70=145 -> 173 = true
papformatteddiskpag_70=173 -> 198 = true
papformatteddiskpag_70=198 -> 217 = true
papformatteddiskpag_70=217 -> 230 = true
papformatteddiskpag_70=230 -> 243 = true
papformatteddiskpag_70=243 -> 1052 = true   --here
papformatteddiskpag_70=1052 -> 1078 = true
papformatteddiskpag_70=1078 -> 1117 = true
in the range method of findRange has List rpl parameter, size of rpl is
16.rpl[13]._cpStart=243;rpl[13]._cpEnd=1052;range._end=336.rpl[13]._cpEnd=1052>range._end=336;return
14.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 47496] Strange MS Word file reading behavior

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47496

Nick Burch <ni...@alfresco.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |LATER

--- Comment #4 from Nick Burch <ni...@alfresco.com> 2011-02-25 16:53:49 EST ---
I believe that the HWPF unicode related fixes in the last 18 months should have
fixed these problems. Please re-open the bug if you're still hitting the issues
with a recent nightly / 3.8 beta 1 (which is due out soon).

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 47496] Strange MS Word file reading behavior

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47496

--- Comment #3 from inthendsun@gmail.com 2009-09-20 17:56:04 PDT ---
anybody know how to fix the num of paragraph ?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: DO NOT REPLY [Bug 47496] New: Strange MS Word file reading behavior

Posted by weizi <we...@hotmail.com>.
This reasion is that address of TextPiece is logic sequence in TextPieceTable
.But address of PAPBinTable or address of CHPBinTable is physical address
,this address is not always sequence .Address of PAPBinTable or address of
CHPBinTable is out-of-correspondence position address of TextPiece.

-- 
View this message in context: http://www.nabble.com/DO-NOT-REPLY--Bug-47496--New%3A-Strange-MS-Word-file-reading-behavior-tp24388573p25167253.html
Sent from the POI - Dev mailing list archive at Nabble.com.

DO NOT REPLY [Bug 47496] Strange MS Word file reading behavior

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47496



--- Comment #2 from weizi <we...@hotmail.com> 2009-08-27 01:18:26 PDT ---
This reasion is that address of TextPiece is logic sequence in TextPieceTable
.But address of PAPBinTable or address of CHPBinTable is physical address ,this
address is not always sequence .Address of PAPBinTable or address of
CHPBinTable is out-of-correspondence position address of TextPiece.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org