You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2009/10/28 15:00:03 UTC

DO NOT REPLY [Bug 48075] New: Broken paragraph to text mapping in some documents

https://issues.apache.org/bugzilla/show_bug.cgi?id=48075

           Summary: Broken paragraph to text mapping in some documents
           Product: POI
           Version: 3.5-dev
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
        AssignedTo: dev@poi.apache.org
        ReportedBy: max.valjanski@gmail.com


WordExtractor.getParagraphText() extracts incomplete and broken text data from
attached document. Hovever, WordExtractor.getTextFromPieces() extracts complete
correct text (the same as in MS Office).

It seems that there is a problem in paragraph to text mapping.

Problem exists on few documents from the same source, text extraction from many
other documents works fine.

POI version poi-3.6-beta1-20091002 (svn trunk)

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some documents

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075

--- Comment #2 from Maxim Valyanskiy <ma...@gmail.com> 2010-08-04 07:52:38 EDT ---
Paragraph offsets (FC) in PAPX in this file are 2048 bytes larger than real
character data in text pieces. Hm.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some documents

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075

Maxim Valyanskiy <ma...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|3.5-dev                     |3.6-dev

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some documents

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075

--- Comment #6 from Sergey Vladimirov <vl...@gmail.com> 2011-07-12 10:40:03 UTC ---
Maxim,

No, it doesn't look like quick-saved:

[FIB]
...
         .fComplex                 = false
...
[/FIB]

Although it was quick-saved 15 times, currently it states as fully-saved file.
Also there is no additional grpprl(s) in CPL section, i.e. there is no SPRM(s)
quicksave data.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some documents

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075

--- Comment #1 from Maxim Valyanskiy <ma...@gmail.com> 2009-10-28 07:01:05 UTC ---
Created an attachment (id=24433)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=24433)
document

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some documents

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075

--- Comment #4 from Sergey Vladimirov <vl...@gmail.com> 2011-07-11 16:58:17 UTC ---
This file seems so very wrong to me. OpenOffice or LibreOffice can't even show
it correctly.

More detailed, it have 2 TextPieces:

TextPiece from 0 to 1199 (PieceDescriptor (pos: 2048; unicode))
TextPiece from 1199 to 2377 (PieceDescriptor (pos: 4608; unicode))

but all CHPX are reffers to second text piece:

* CHPX from 1024 to 1037 (in bytes 4096 to 4122)
* CHPX from 1037 to 1038 (in bytes 4122 to 4124)
* ...
* CHPX from 2142 to 2377 (in bytes 6494 to 11776)

as well as PAPX:
* PAPX from 1185 to 1199 (in bytes 4418 to 4478)
* PAPX from 2142 to 2377 (in bytes 6494 to 12102)

so it just bad file, AFAIK.

Apart from that, there is a table without single row or cell. I.e. there is a
PAPX with inTable=true, but no end cells marks.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some documents

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075

Maxim Valyanskiy <ma...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #3 from Maxim Valyanskiy <ma...@gmail.com> 2010-08-04 08:45:04 EDT ---
Fixed by workaround in r982238

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some documents

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075

--- Comment #5 from Maxim Valyanskiy <ma...@gmail.com> 2011-07-11 19:43:03 UTC ---
Sergey, can it be "autosaved" file? I seen some strange format violations in
such files

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org