You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/02/26 14:33:00 UTC

[jira] [Closed] (TIKA-2589) Wrong page count detection (docx from dotm template)

     [ https://issues.apache.org/jira/browse/TIKA-2589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison closed TIKA-2589.
-----------------------------
    Resolution: Not A Problem

Thank you for opening this issue.

MSWord calculates page counts dynamically and IMHO rarely stores the actual page count for a document, rather, it typically stores "1", which is incorrect.  If you add .zip to your file, unzip it, and look in docProps/app.xml, you'll see:

{noformat}
<Pages>1</Pages><Words>127171</Words><Characters>724878</Characters>
{noformat}

It is beyond the scope of Tika to calculate page counts dynamically, and so, we rely on whatever MSWord stored in the document.

> Wrong page count detection (docx from dotm template)
> ----------------------------------------------------
>
>                 Key: TIKA-2589
>                 URL: https://issues.apache.org/jira/browse/TIKA-2589
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.17
>         Environment: $ java -version
> java version "1.8.0_161"
> Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode
> OS Version: 6.1.7601 Service Pack 1 сборка 7601
>            Reporter: Leonid Korsakov
>            Priority: Major
>         Attachments: 262 страницы.docx
>
>
> I have docx file cteated from dotm template. When I call 
> {code:java}
> java -jar tika-app.jar -m path_to_file
> {code}
> i see xmpTPg:NPages: 1 but docx file contain 262 pages count



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)