You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Thinus Prinsloo <th...@pinkmatter.co.za> on 2012/05/29 15:30:43 UTC
Parse metadata only
Hey all - I hope this is the right place to ask. Feel free to point me
somewhere else if needed.
I would like to parse the meta-data of a massive amount of PDF files only.
I do not want to extract the text, not yet anyway, only get meta-data
information such as "Creation-Date", etc. Is it possible for Tika to
provide the meta-data without doing a parse on the whole document (with a
content handler, etc.)?
Thanks!
Regards,
Thinus
Re: Parse metadata only
Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 29 May 2012, Thinus Prinsloo wrote:
> I would like to parse the meta-data of a massive amount of PDF files
> only. I do not want to extract the text, not yet anyway, only get
> meta-data information such as "Creation-Date", etc. Is it possible for
> Tika to provide the meta-data without doing a parse on the whole
> document (with a content handler, etc.)?
At the moment, that's not possible. Most file formats don't have all their
metadata in entirely separate places, so you end up having to process
almost all of the file anyway. (There has been talk about implementing
this in the past, but this problem has largely meant it hasn't been
tackled)
If you don't want the text, you can just pass in a content handler that
ignores everything
Nick