You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Can Duruk <ca...@duruk.net> on 2014/10/09 02:59:08 UTC

Customizing Metadata Keys

Hi all,

My question is regarding setting the metadata keys coming from the parsers
to my own keys.

For my application, I am using Tika to extract the metadata for a bunch of
files. I am using the embedded HTTP server which I modified for my needs to
return instead of CSV. (Hoping to submit that as a patch soon)

However, the keys in the JSON are all in different formats and I need them
to conform to my own requirements.

So for example in this redacted example this is what I get:

{
  "meta:author": "Maxim Valyanskiy",
}


However, what I need is this:

{
  "my_author_key": "Maxim Valyanskiy",
}

I have a bunch (several dozens) of these modifications I need to make on
the metadata keys in various places.

What is the best way to approach this problem? I've thought about extending
each of the parsers to but that seems a bit too decentralized. Ideally it'd
be something I can manage in a single file.

Thanks a lot in advance.

Re: Customizing Metadata Keys

Posted by Can Duruk <ca...@duruk.net>.
> I agree with Nick’s recommendation on post-parsing key mapping, and I’d
like to put in a plug for the RecursiveParserWrapper, which may be of use
for you.  I’ve been intending to add that to the app commandline and to
server…how are you handling embedded document metadata?  Would the wrapper
be of any use or do you not have any embedded docs in your doc set?

I haven't needed recursive parsing yet (currently extracting mostly photos
and videos) but I'm sure it will happen at some point. It's not urgent but
I am casting my vote to have it in at some point :)

> I’ve also been meaning to dump counts of metadata keys from the govdocs1
corpus, would that be of any use, or do you already know the keys that you
care about?

I have a list of keys that I *need* to support but I can definitely use the
dump for further analysis, informing future direction. Again, no urgency
though.

On Thu, Oct 9, 2014 at 11:44 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:
>
> I agree with Nick’s recommendation on post-parsing key mapping, and I’d
like to put in a plug for the RecursiveParserWrapper, which may be of use
for you.  I’ve been intending to add that to the app commandline and to
server…how are you handling embedded document metadata?  Would the wrapper
be of any use or do you not have any embedded docs in your doc set?
>
>
>
> I’ve also been meaning to dump counts of metadata keys from the govdocs1
corpus, would that be of any use, or do you already know the keys that you
care about?
>
>
>
> Cheers,
>
>
>
>          Tim
>
> From: Can Duruk [mailto:can@duruk.net]
> Sent: Thursday, October 09, 2014 12:13 PM
>
>
> To: user@tika.apache.org
> Subject: Re: Customizing Metadata Keys
>
>
>
> >I'd suggest you do the mapping from Tika keys to your keys in the server.
> >All the parsers should return consistent keys, so the "output" side is
> >the
> >best place to map.
>
> That seems to be the now-obvious solution, thanks for the suggestion.
>
> > Perhaps a re-mapping downstream ContentHandler
> > that takes in the Metadata object and will reformat
> > the <meta name=.. section of the XHTML?
>
>
>
> I've tried a way to add a step late in the pipeline I'm not super
familiar with the Tika codebase so got lost a bit. Any pointers (examples /
tutorials) you could guide me towards? Chapters in the Tika book? I want to
explore this if the server idea doesn't pan out.
>
> On Wed, Oct 8, 2014 at 10:25 PM, Chris Mattmann <ch...@gmail.com>
wrote:
> >
> > Perhaps a re-mapping downstream ContentHandler
> > that takes in the Metadata object and will reformat
> > the <meta name=.. section of the XHTML?
> >
> >
> > ------------------------
> > Chris Mattmann
> > chris.mattmann@gmail.com
> >
> >
> >
> >
> > -----Original Message-----
> > From: Nick Burch <ap...@gagravarr.org>
> > Reply-To: <us...@tika.apache.org>
> > Date: Thursday, October 9, 2014 at 12:32 PM
> > To: <us...@tika.apache.org>
> > Subject: Re: Customizing Metadata Keys
> >
> > >On Wed, 8 Oct 2014, Can Duruk wrote:
> > >> My question is regarding setting the metadata keys coming from the
> > >>parsers
> > >> to my own keys.
> > >>
> > >> For my application, I am using Tika to extract the metadata for a
bunch
> > >>of
> > >> files. I am using the embedded HTTP server which I modified for my
> > >>needs to
> > >> return instead of CSV. (Hoping to submit that as a patch soon)
> > >>
> > >> However, the keys in the JSON are all in different formats and I need
> > >>them
> > >> to conform to my own requirements.
> > >
> > >I'd suggest you do the mapping from Tika keys to your keys in the
server.
> > >All the parsers should return consistent keys, so the "output" side is
> > >the
> > >best place to map. Trying to do it in each parser would be much more
> > >work.
> > >Just put the mapping in between where you call the parser, and where
you
> > >output
> > >
> > >Nick
> >
> >

RE: Customizing Metadata Keys

Posted by "Allison, Timothy B." <ta...@mitre.org>.
I agree with Nick’s recommendation on post-parsing key mapping, and I’d like to put in a plug for the RecursiveParserWrapper, which may be of use for you.  I’ve been intending to add that to the app commandline and to server…how are you handling embedded document metadata?  Would the wrapper be of any use or do you not have any embedded docs in your doc set?

I’ve also been meaning to dump counts of metadata keys from the govdocs1 corpus, would that be of any use, or do you already know the keys that you care about?

Cheers,

         Tim
From: Can Duruk [mailto:can@duruk.net]
Sent: Thursday, October 09, 2014 12:13 PM
To: user@tika.apache.org
Subject: Re: Customizing Metadata Keys

>I'd suggest you do the mapping from Tika keys to your keys in the server.
>All the parsers should return consistent keys, so the "output" side is
>the
>best place to map.

That seems to be the now-obvious solution, thanks for the suggestion.

> Perhaps a re-mapping downstream ContentHandler
> that takes in the Metadata object and will reformat
> the <meta name=.. section of the XHTML?

I've tried a way to add a step late in the pipeline I'm not super familiar with the Tika codebase so got lost a bit. Any pointers (examples / tutorials) you could guide me towards? Chapters in the Tika book? I want to explore this if the server idea doesn't pan out.

On Wed, Oct 8, 2014 at 10:25 PM, Chris Mattmann <ch...@gmail.com>> wrote:
>
> Perhaps a re-mapping downstream ContentHandler
> that takes in the Metadata object and will reformat
> the <meta name=.. section of the XHTML?
>
>
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com<ma...@gmail.com>
>
>
>
>
> -----Original Message-----
> From: Nick Burch <ap...@gagravarr.org>>
> Reply-To: <us...@tika.apache.org>>
> Date: Thursday, October 9, 2014 at 12:32 PM
> To: <us...@tika.apache.org>>
> Subject: Re: Customizing Metadata Keys
>
> >On Wed, 8 Oct 2014, Can Duruk wrote:
> >> My question is regarding setting the metadata keys coming from the
> >>parsers
> >> to my own keys.
> >>
> >> For my application, I am using Tika to extract the metadata for a bunch
> >>of
> >> files. I am using the embedded HTTP server which I modified for my
> >>needs to
> >> return instead of CSV. (Hoping to submit that as a patch soon)
> >>
> >> However, the keys in the JSON are all in different formats and I need
> >>them
> >> to conform to my own requirements.
> >
> >I'd suggest you do the mapping from Tika keys to your keys in the server.
> >All the parsers should return consistent keys, so the "output" side is
> >the
> >best place to map. Trying to do it in each parser would be much more
> >work.
> >Just put the mapping in between where you call the parser, and where you
> >output
> >
> >Nick
>
>

Re: Customizing Metadata Keys

Posted by Can Duruk <ca...@duruk.net>.
>I'd suggest you do the mapping from Tika keys to your keys in the server.
>All the parsers should return consistent keys, so the "output" side is
>the
>best place to map.

That seems to be the now-obvious solution, thanks for the suggestion.

> Perhaps a re-mapping downstream ContentHandler
> that takes in the Metadata object and will reformat
> the <meta name=.. section of the XHTML?

I've tried a way to add a step late in the pipeline I'm not super familiar
with the Tika codebase so got lost a bit. Any pointers (examples /
tutorials) you could guide me towards? Chapters in the Tika book? I want to
explore this if the server idea doesn't pan out.

On Wed, Oct 8, 2014 at 10:25 PM, Chris Mattmann <ch...@gmail.com>
wrote:
>
> Perhaps a re-mapping downstream ContentHandler
> that takes in the Metadata object and will reformat
> the <meta name=.. section of the XHTML?
>
>
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
>
>
>
>
> -----Original Message-----
> From: Nick Burch <ap...@gagravarr.org>
> Reply-To: <us...@tika.apache.org>
> Date: Thursday, October 9, 2014 at 12:32 PM
> To: <us...@tika.apache.org>
> Subject: Re: Customizing Metadata Keys
>
> >On Wed, 8 Oct 2014, Can Duruk wrote:
> >> My question is regarding setting the metadata keys coming from the
> >>parsers
> >> to my own keys.
> >>
> >> For my application, I am using Tika to extract the metadata for a bunch
> >>of
> >> files. I am using the embedded HTTP server which I modified for my
> >>needs to
> >> return instead of CSV. (Hoping to submit that as a patch soon)
> >>
> >> However, the keys in the JSON are all in different formats and I need
> >>them
> >> to conform to my own requirements.
> >
> >I'd suggest you do the mapping from Tika keys to your keys in the server.
> >All the parsers should return consistent keys, so the "output" side is
> >the
> >best place to map. Trying to do it in each parser would be much more
> >work.
> >Just put the mapping in between where you call the parser, and where you
> >output
> >
> >Nick
>
>

Re: Customizing Metadata Keys

Posted by Chris Mattmann <ch...@gmail.com>.
Perhaps a re-mapping downstream ContentHandler
that takes in the Metadata object and will reformat
the <meta name=.. section of the XHTML?


------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Nick Burch <ap...@gagravarr.org>
Reply-To: <us...@tika.apache.org>
Date: Thursday, October 9, 2014 at 12:32 PM
To: <us...@tika.apache.org>
Subject: Re: Customizing Metadata Keys

>On Wed, 8 Oct 2014, Can Duruk wrote:
>> My question is regarding setting the metadata keys coming from the
>>parsers
>> to my own keys.
>>
>> For my application, I am using Tika to extract the metadata for a bunch
>>of
>> files. I am using the embedded HTTP server which I modified for my
>>needs to
>> return instead of CSV. (Hoping to submit that as a patch soon)
>>
>> However, the keys in the JSON are all in different formats and I need
>>them
>> to conform to my own requirements.
>
>I'd suggest you do the mapping from Tika keys to your keys in the server.
>All the parsers should return consistent keys, so the "output" side is
>the 
>best place to map. Trying to do it in each parser would be much more
>work. 
>Just put the mapping in between where you call the parser, and where you
>output
>
>Nick



Re: Customizing Metadata Keys

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 8 Oct 2014, Can Duruk wrote:
> My question is regarding setting the metadata keys coming from the parsers
> to my own keys.
>
> For my application, I am using Tika to extract the metadata for a bunch of
> files. I am using the embedded HTTP server which I modified for my needs to
> return instead of CSV. (Hoping to submit that as a patch soon)
>
> However, the keys in the JSON are all in different formats and I need them
> to conform to my own requirements.

I'd suggest you do the mapping from Tika keys to your keys in the server. 
All the parsers should return consistent keys, so the "output" side is the 
best place to map. Trying to do it in each parser would be much more work. 
Just put the mapping in between where you call the parser, and where you 
output

Nick