You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2012/09/02 16:53:00 UTC

[Bug 53816] New: Extracted word count is incorrect

https://issues.apache.org/bugzilla/show_bug.cgi?id=53816

          Priority: P2
            Bug ID: 53816
          Assignee: dev@poi.apache.org
           Summary: Extracted word count is incorrect
          Severity: normal
    Classification: Unclassified
                OS: Linux
          Reporter: lucene@mikemccandless.com
          Hardware: PC
            Status: NEW
           Version: 3.9-dev
         Component: HPSF
           Product: POI

Created attachment 29316
  --> https://issues.apache.org/bugzilla/attachment.cgi?id=29316&action=edit
Word document showing incorrect PID_WORDCOUNT=11

I have a Word doc (attached) that has 6 words, plus an embedded PDF document
(not sure that's relevant).  When I view the word count with Word it correctly
says 6.  But when I run org.apache.poi.hpsf.extractor.HPSFPropertiesExtractor
the word count incorrectly says 11:

1 = 1252
PID_TITLE = 
PID_SUBJECT = 
PID_AUTHOR = IBMer
PID_KEYWORDS = 
PID_TEMPLATE = Normal.dot
PID_LASTAUTHOR = IBMer
PID_REVNUMBER = 3
PID_APPNAME = Microsoft Office Word
PID_EDITTIME = Sun Dec 31 19:03:00 EST 1600
PID_CREATE_DTM = Tue Jul 17 07:16:00 EDT 2012
PID_LASTSAVE_DTM = Mon Jul 23 07:21:00 EDT 2012
PID_PAGECOUNT = 1
PID_WORDCOUNT = 11
PID_CHARCOUNT = 55
PID_SECURITY = 0
PID_CODEPAGE = 1252
PID_COMPANY = IBM
PID_LINECOUNT = 1
PID_PARCOUNT = 1
17 = 65
23 = 730895
PID_SCALE = false
PID_LINKSDIRTY = false
19 = false
22 = false
PID_DOCPARTS =

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 53816] Extracted word count is incorrect

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816

--- Comment #2 from mikemccand <lu...@mikemccandless.com> ---
PID_EDITTIME and PID_CREATE_DTM also seem to be wrong, at least when I compare
this to the Document Properties via Word.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 53816] Extracted word count is incorrect

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816

Nick Burch <ap...@gagravarr.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #5 from Nick Burch <ap...@gagravarr.org> ---
If you load the file in Word, and do a save-as, does that fix what POI sees?
How about just opening it in word and doing a save (no save-as)?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 53816] Extracted word count is incorrect

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816

--- Comment #1 from mikemccand <lu...@mikemccandless.com> ---
I also have a Word document (unfortunately can't share), which doesn't have an
embedded document, that has 3 pages yet POI shows PID_PAGECOUNT=1.

Are there known cases where the properties will not be extracted correctly?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 53816] Extracted word count is incorrect

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816

--- Comment #6 from mikemccand <lu...@mikemccandless.com> ---
First I tried "Save As.." to a new file, and then POI reports PID_WORDCOUNT =
13 (still wrong: should be 6 ... but curious that now it's wrong "differently"
(13 vs 11 before)).

Then I tried "Save" and then POI also reports PID_WORDCOUNT = 13 (still wrong).
 I made another change (add space then remove it), saved again, and POI still
says PID_WORDCOUNT = 13.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 53816] Extracted word count is incorrect

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816

--- Comment #4 from mikemccand <lu...@mikemccandless.com> ---
But what confuses me is when I display the document properties in Word, they
are correct.  It's as if POI is somehow pulling from a different (stale) set of
properties stored in the Word doc or something...

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 53816] Extracted word count is incorrect

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=53816

--- Comment #3 from Nick Burch <ap...@gagravarr.org> ---
POI will give you exactly what is stored in the file, without any changes. If
Word happens to store duff data, there's not a lot me can do about it :(

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org