You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Samuel Desseaux <sa...@ecp.fr> on 2013/10/23 14:53:31 UTC
extract metadata of pdf files with tika
Hi,
I'm a little newbie with tika and would need some help.
I have many pdf files which i would like to extract metadata, in order
to have an xml file (which respect dublin core).
I've followed these links
http://www.hascode.com/2012/12/content-detection-metadata-and-content-extraction-with-apache-tika/#Extracting_Metadata_from_a_PDF_using_a_concrete_Parser
and http://tika.apache.org/0.8/api/org/apache/tika/metadata/Metadata.html
Do i have to write a program with tika to do it?
How could i do that?
Best regards
Samuel
Re: extract metadata of pdf files with tika
Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 23 Oct 2013, Samuel Desseaux wrote:
>> What language are you writing the rest of your solution in? How are you
>> planning to transform and filter the metadata to get your xml?
>
> With java, i think.
The simplest way to get started them would be using the Tika Facade helper
class:
http://tika.apache.org/1.4/api/org/apache/tika/Tika.html
That provides a simple way to call Tika and get back text + metadata.
Later, you might want to call the parser(s) directly, but the above should
get you going very quickly
Nick
Re: extract metadata of pdf files with tika
Posted by Samuel Desseaux <sa...@ecp.fr>.
Le 23/10/2013 15:12, Nick Burch a écrit :
> On Wed, 23 Oct 2013, Samuel Desseaux wrote:
>> I have many pdf files which i would like to extract metadata, in
>> order to have an xml file (which respect dublin core).
>>
>> Do i have to write a program with tika to do it?
>
> What language are you writing the rest of your solution in? How are
> you planning to transform and filter the metadata to get your xml?
With java, i think.
For your second question, no idea for the moment.
Samuel
Re: extract metadata of pdf files with tika
Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 23 Oct 2013, Samuel Desseaux wrote:
> I have many pdf files which i would like to extract metadata, in order to
> have an xml file (which respect dublin core).
>
> Do i have to write a program with tika to do it?
What language are you writing the rest of your solution in? How are you
planning to transform and filter the metadata to get your xml?
Nick