You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Samuel Desseaux <sa...@ecp.fr> on 2013/10/23 14:53:31 UTC

extract metadata of pdf files with tika

Hi,

I'm a little newbie with tika and would need some help.

I have many pdf files which i would like to extract metadata, in order 
to have an xml file (which respect dublin core).

I've followed these links 
http://www.hascode.com/2012/12/content-detection-metadata-and-content-extraction-with-apache-tika/#Extracting_Metadata_from_a_PDF_using_a_concrete_Parser 
and http://tika.apache.org/0.8/api/org/apache/tika/metadata/Metadata.html

Do i have to write a program with tika to do it?

How could i do that?

Best regards

Samuel

Re: extract metadata of pdf files with tika

Posted by Nick Burch <ap...@gagravarr.org>.

On Wed, 23 Oct 2013, Samuel Desseaux wrote:
>> What language are you writing the rest of your solution in? How are you 
>> planning to transform and filter the metadata to get your xml?
>
> With java, i think.

The simplest way to get started them would be using the Tika Facade helper 
class:
http://tika.apache.org/1.4/api/org/apache/tika/Tika.html

That provides a simple way to call Tika and get back text + metadata. 
Later, you might want to call the parser(s) directly, but the above should 
get you going very quickly

Nick

Re: extract metadata of pdf files with tika

Posted by Samuel Desseaux <sa...@ecp.fr>.

Le 23/10/2013 15:12, Nick Burch a écrit :
> On Wed, 23 Oct 2013, Samuel Desseaux wrote:
>> I have many pdf files which i would like to extract metadata, in 
>> order to have an xml file (which respect dublin core).
>>
>> Do i have to write a program with tika to do it?
>
> What language are you writing the rest of your solution in? How are 
> you planning to transform and filter the metadata to get your xml?

With java, i think.
For your second question, no idea for the moment.

Samuel

Re: extract metadata of pdf files with tika

Posted by Nick Burch <ap...@gagravarr.org>.

On Wed, 23 Oct 2013, Samuel Desseaux wrote:
> I have many pdf files which i would like to extract metadata, in order to 
> have an xml file (which respect dublin core).
>
> Do i have to write a program with tika to do it?

What language are you writing the rest of your solution in? How are you 
planning to transform and filter the metadata to get your xml?

Nick