You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Thinus Prinsloo <th...@pinkmatter.co.za> on 2012/05/29 15:30:43 UTC

Parse metadata only

Hey all - I hope this is the right place to ask.  Feel free to point me
somewhere else if needed.

 

I would like to parse the meta-data of a massive amount of PDF files only.
I do not want to extract the text, not yet anyway, only get meta-data
information such as "Creation-Date", etc.  Is it possible for Tika to
provide the meta-data without doing a parse on the whole document (with a
content handler, etc.)?

 

Thanks!

 

Regards,

Thinus

Re: Parse metadata only

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 29 May 2012, Thinus Prinsloo wrote:
> I would like to parse the meta-data of a massive amount of PDF files 
> only. I do not want to extract the text, not yet anyway, only get 
> meta-data information such as "Creation-Date", etc.  Is it possible for 
> Tika to provide the meta-data without doing a parse on the whole 
> document (with a content handler, etc.)?

At the moment, that's not possible. Most file formats don't have all their 
metadata in entirely separate places, so you end up having to process 
almost all of the file anyway. (There has been talk about implementing 
this in the past, but this problem has largely meant it hasn't been 
tackled)

If you don't want the text, you can just pass in a content handler that
ignores everything

Nick