You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2015/01/19 23:50:37 UTC

[jira] [Comment Edited] (TIKA-1523) metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0

    [ https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283116#comment-14283116 ] 

Uwe Schindler edited comment on TIKA-1523 at 1/19/15 10:50 PM:
---------------------------------------------------------------

Yes. I extracts just the metadata with COM interface for the quickview windows component (you don't even need Word installed for that). So I think this is an issue with this old version of Word.

In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows "1". But here, the metadata as saved in the file is simply "1" or maybe nothing (see below). POI does not "reflow" the layout to calculate that information.

This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file "read only" (which it does because its downloaded from internet), it shows "" in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue.


was (Author: thetaphi):
Yes. I extracts just the metadata. So I think this is an issue with this old version of Word.

In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows "1". But here, the metadata as saved in the file is simply "1" or maybe nothing (see below). POI does not "reflow" the layout to calculate that information.

This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file "read only" (which it does because its downloaded from internet), it shows "" in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue.

> metadata extractor gets the wrong number of pages of some documents Microsoft Word 9.0
> --------------------------------------------------------------------------------------
>
>                 Key: TIKA-1523
>                 URL: https://issues.apache.org/jira/browse/TIKA-1523
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.7
>         Environment: Ubuntu
>            Reporter: Yamileydis Veranes
>            Assignee: Konstantin Gribov
>         Attachments: Sigmund Freud.doc, screenshot-1.png, screenshot-2.png
>
>
> When I extract the metadata from a Microsoft Word 9.0 document which has 10 pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)