You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2009/10/28 15:00:03 UTC
DO NOT REPLY [Bug 48075] New: Broken paragraph to text mapping in
some documents
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075
Summary: Broken paragraph to text mapping in some documents
Product: POI
Version: 3.5-dev
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: HWPF
AssignedTo: dev@poi.apache.org
ReportedBy: max.valjanski@gmail.com
WordExtractor.getParagraphText() extracts incomplete and broken text data from
attached document. Hovever, WordExtractor.getTextFromPieces() extracts complete
correct text (the same as in MS Office).
It seems that there is a problem in paragraph to text mapping.
Problem exists on few documents from the same source, text extraction from many
other documents works fine.
POI version poi-3.6-beta1-20091002 (svn trunk)
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some
documents
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075
--- Comment #2 from Maxim Valyanskiy <ma...@gmail.com> 2010-08-04 07:52:38 EDT ---
Paragraph offsets (FC) in PAPX in this file are 2048 bytes larger than real
character data in text pieces. Hm.
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some
documents
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075
Maxim Valyanskiy <ma...@gmail.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Version|3.5-dev |3.6-dev
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some
documents
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075
--- Comment #6 from Sergey Vladimirov <vl...@gmail.com> 2011-07-12 10:40:03 UTC ---
Maxim,
No, it doesn't look like quick-saved:
[FIB]
...
.fComplex = false
...
[/FIB]
Although it was quick-saved 15 times, currently it states as fully-saved file.
Also there is no additional grpprl(s) in CPL section, i.e. there is no SPRM(s)
quicksave data.
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some
documents
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075
--- Comment #1 from Maxim Valyanskiy <ma...@gmail.com> 2009-10-28 07:01:05 UTC ---
Created an attachment (id=24433)
--> (https://issues.apache.org/bugzilla/attachment.cgi?id=24433)
document
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some
documents
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075
--- Comment #4 from Sergey Vladimirov <vl...@gmail.com> 2011-07-11 16:58:17 UTC ---
This file seems so very wrong to me. OpenOffice or LibreOffice can't even show
it correctly.
More detailed, it have 2 TextPieces:
TextPiece from 0 to 1199 (PieceDescriptor (pos: 2048; unicode))
TextPiece from 1199 to 2377 (PieceDescriptor (pos: 4608; unicode))
but all CHPX are reffers to second text piece:
* CHPX from 1024 to 1037 (in bytes 4096 to 4122)
* CHPX from 1037 to 1038 (in bytes 4122 to 4124)
* ...
* CHPX from 2142 to 2377 (in bytes 6494 to 11776)
as well as PAPX:
* PAPX from 1185 to 1199 (in bytes 4418 to 4478)
* PAPX from 2142 to 2377 (in bytes 6494 to 12102)
so it just bad file, AFAIK.
Apart from that, there is a table without single row or cell. I.e. there is a
PAPX with inTable=true, but no end cells marks.
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some
documents
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075
Maxim Valyanskiy <ma...@gmail.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--- Comment #3 from Maxim Valyanskiy <ma...@gmail.com> 2010-08-04 08:45:04 EDT ---
Fixed by workaround in r982238
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 48075] Broken paragraph to text mapping in some
documents
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=48075
--- Comment #5 from Maxim Valyanskiy <ma...@gmail.com> 2011-07-11 19:43:03 UTC ---
Sergey, can it be "autosaved" file? I seen some strange format violations in
such files
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org