You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2012/09/02 16:53:00 UTC
[Bug 53816] New: Extracted word count is incorrect
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816
Priority: P2
Bug ID: 53816
Assignee: dev@poi.apache.org
Summary: Extracted word count is incorrect
Severity: normal
Classification: Unclassified
OS: Linux
Reporter: lucene@mikemccandless.com
Hardware: PC
Status: NEW
Version: 3.9-dev
Component: HPSF
Product: POI
Created attachment 29316
--> https://issues.apache.org/bugzilla/attachment.cgi?id=29316&action=edit
Word document showing incorrect PID_WORDCOUNT=11
I have a Word doc (attached) that has 6 words, plus an embedded PDF document
(not sure that's relevant). When I view the word count with Word it correctly
says 6. But when I run org.apache.poi.hpsf.extractor.HPSFPropertiesExtractor
the word count incorrectly says 11:
1 = 1252
PID_TITLE =
PID_SUBJECT =
PID_AUTHOR = IBMer
PID_KEYWORDS =
PID_TEMPLATE = Normal.dot
PID_LASTAUTHOR = IBMer
PID_REVNUMBER = 3
PID_APPNAME = Microsoft Office Word
PID_EDITTIME = Sun Dec 31 19:03:00 EST 1600
PID_CREATE_DTM = Tue Jul 17 07:16:00 EDT 2012
PID_LASTSAVE_DTM = Mon Jul 23 07:21:00 EDT 2012
PID_PAGECOUNT = 1
PID_WORDCOUNT = 11
PID_CHARCOUNT = 55
PID_SECURITY = 0
PID_CODEPAGE = 1252
PID_COMPANY = IBM
PID_LINECOUNT = 1
PID_PARCOUNT = 1
17 = 65
23 = 730895
PID_SCALE = false
PID_LINKSDIRTY = false
19 = false
22 = false
PID_DOCPARTS =
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 53816] Extracted word count is incorrect
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816
--- Comment #2 from mikemccand <lu...@mikemccandless.com> ---
PID_EDITTIME and PID_CREATE_DTM also seem to be wrong, at least when I compare
this to the Document Properties via Word.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 53816] Extracted word count is incorrect
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816
Nick Burch <ap...@gagravarr.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |NEEDINFO
--- Comment #5 from Nick Burch <ap...@gagravarr.org> ---
If you load the file in Word, and do a save-as, does that fix what POI sees?
How about just opening it in word and doing a save (no save-as)?
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 53816] Extracted word count is incorrect
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816
--- Comment #1 from mikemccand <lu...@mikemccandless.com> ---
I also have a Word document (unfortunately can't share), which doesn't have an
embedded document, that has 3 pages yet POI shows PID_PAGECOUNT=1.
Are there known cases where the properties will not be extracted correctly?
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 53816] Extracted word count is incorrect
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816
--- Comment #6 from mikemccand <lu...@mikemccandless.com> ---
First I tried "Save As.." to a new file, and then POI reports PID_WORDCOUNT =
13 (still wrong: should be 6 ... but curious that now it's wrong "differently"
(13 vs 11 before)).
Then I tried "Save" and then POI also reports PID_WORDCOUNT = 13 (still wrong).
I made another change (add space then remove it), saved again, and POI still
says PID_WORDCOUNT = 13.
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 53816] Extracted word count is incorrect
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816
--- Comment #4 from mikemccand <lu...@mikemccandless.com> ---
But what confuses me is when I display the document properties in Word, they
are correct. It's as if POI is somehow pulling from a different (stale) set of
properties stored in the Word doc or something...
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
[Bug 53816] Extracted word count is incorrect
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816
--- Comment #3 from Nick Burch <ap...@gagravarr.org> ---
POI will give you exactly what is stored in the file, without any changes. If
Word happens to store duff data, there's not a lot me can do about it :(
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org