You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Ensor, Neal" <En...@osti.gov> on 2010/07/22 17:22:30 UTC

PDF and MS Word Metadata question: page counts

Just a curiousity:  I'm currently using tika 0.7 for some simple text extraction, and noticed that for some reason I can't access page counts for either PDF or Word documents.

I know the information is available via underlying library calls (e.g., PDF box) and appears it should be available via extended information in the MS Office parser, but I don't see it in the metadata of any documents I tried.  My question is, was there some reason why page counts are omitted?  I hacked my local copy of PDFParser to provide such via the PDDocument.getNumberOfPages() call,  but was wondering if I missed something somewhere or there might be a reason to not provide such information.  For the Word documents, I guess since it should be provided, guess I'm out of luck there, but for my purposes, I'd like at least parsed PDF metadata to provide that information if possible...  Thanks!

Neal Ensor
ensorn@osti.gov

Re: PDF and MS Word Metadata question: page counts

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Jul 22, 2010 at 5:22 PM, Ensor, Neal <En...@osti.gov> wrote:
> I know the information is available via underlying library calls (e.g., PDF box) and
> appears it should be available via extended information in the MS Office parser,
> but I don't see it in the metadata of any documents I tried.  My question is, was
> there some reason why page counts are omitted?

The only reason is that nobody has yet gotten around to adding that
feature to Tika. :-)

> I hacked my local copy of PDFParser to provide such via the
> PDDocument.getNumberOfPages() call,  but was wondering if I missed something
> somewhere or there might be a reason to not provide such information.

It would be great if you wanted to share your changes by posting them
as an improvement request in
https://issues.apache.org/jira/browse/TIKA.

BR,

Jukka Zitting