You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/05/28 19:32:10 UTC

[DISCUSS] Centralizing JSON handling of Metadata

All,

  Nick recommended I put the question to the dev list for discussion.  It might be useful to centralize our json handling of Metadata.  We are now currently using different libraries and doing different things in CLI and in tika-server.  

 1) Do we want to centralize json handling of Metadata?

 2) If so, where?  Core?  I share Nick's hesitance to add a dependency to core.  OTOH, GSON is only 186k, but this would add potential for jar conflicts with folks integrating Tika, and it doesn't feel like a core function to me...it is a handy decorator for applications.

 3) Wherever it goes, what package do we want to put it in?  I like Nick's recommendations, with a slight preference for the second (oat.utils.json).

Thank you!

          Best,

                  Tim

-----Original Message-----
From: Nick Burch (JIRA) [mailto:jira@apache.org] 
Sent: Wednesday, May 28, 2014 12:41 PM
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata


    [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011287#comment-14011287 ] 

Nick Burch commented on TIKA-1311:
----------------------------------

If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal, so we might want to run the plan past the dev list first to see what people think (core tends to try to have a very minimal set of deps, unlike the other modules)

Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise utils.json

> Centralize JSON handling of Metadata
> ------------------------------------
>
>                 Key: TIKA-1311
>                 URL: https://issues.apache.org/jira/browse/TIKA-1311
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to centralize JSON handling of Metadata, potentially putting it in core.  On a recent bug fix (TIKA-1291), the same recommendation was repeated especially noting that we now handle JSON/Metadata differently in CLI and server.
> Let's centralize JSON handling in core and use GSON.  We should add a serializer and a deserializer so that users don't have to reinvent that wheel.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

RE: [DISCUSS] Centralizing JSON handling of Metadata

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 28 May 2014, Ray Gauss II wrote:
> However, that sort of modularization is probably a broader discussion 
> than what we need for this particular issue, so between those two I’d 
> vote for tika-serialization.

Tika-CLI and Tika-Server will likely want to depend on all of the 
serialisation methods. So, I'd suggest we go for a single component for 
now, tika-seriali{s,z}ation seems good to me. Later on, we can always 
split that into something like:

   tika-serialisation
     (no code)
     depends on:
       tika-serialisation-json
       tika-serialisation-mongo
       tika-serialisation-blah

If and when there's a strong enough use case for the splitting!

Nick

RE: [DISCUSS] Centralizing JSON handling of Metadata

Posted by Ray Gauss II <ra...@alfresco.com>.
I’ve used Jackson a bit but I don’t have a strong preference either.

I’m generally a fan of splitting things up into very small projects to keep the dependency hierarchy as clean as possible.  In this example, if we decided to do a direct serialization to, say, a Mongo DBObject in the future the json project wouldn’t need to bring in Mongo dependencies.  Apache Camel does a good job of segmenting things [1].

However, that sort of modularization is probably a broader discussion than what we need for this particular issue, so between those two I’d vote for tika-serialization.

Regards,

Ray


[1] https://git-wip-us.apache.org/repos/asf?p=camel.git;a=tree;f=components;h=1132bd1bb98a446aec97d5c7bc4d032276a65d83;hb=HEAD


On May 28, 2014 at 8:42:03 PM, Allison, Timothy B. (tallison@mitre.org) wrote:
> Thank you, Ray!
>  
> In almost reverse order, I've been using Jackson for this already, but I used GSON in TIKA-1291  
> because that's what CLI was already using. In GSON's favor, the jar is a bit smaller, but  
> I have no real preference or reason to pick one over the other. I'm not a json-blackbelt  
> (or, I guess that would be blckbelt), so I'm happy to go with either.
>  
> A new compilation unit makes sense. I'm wondering if we want to be that specific? tika-serialization?  
> Or, maybe just tika-utils?
>  
> Package name looks good to me.
>  
> Thanks, again!
>  
> Best,
>  
> Tim
>  
> -----Original Message-----
> From: Ray Gauss II [mailto:ray.gauss@alfresco.com]
> Sent: Wednesday, May 28, 2014 3:07 PM
> To: dev@tika.apache.org; Allison, Timothy B.
> Subject: Re: [DISCUSS] Centralizing JSON handling of Metadata
>  
> Hi Tim,
>  
> 1) Sounds good to me.
>  
> 2) I do think we want core as lean as possible, so my vote would be for a separate project/module,  
> similar to what was done with tika-xmp. Perhaps something like tika-serialization-json  
> to indicate other formats may follow in the same precedence?
>  
> 3) Similar to above, perhaps org.apache.tika.metadata.serialization.json?
>  
> Just curious, any particular reason for GSON over Jackson?
>  
> Regards,
>  
> Ray
>  
>  
> On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. (tallison@mitre.org) wrote:
> > All,
> >
> > Nick recommended I put the question to the dev list for discussion. It might be useful  
> > to centralize our json handling of Metadata. We are now currently using different libraries  
> > and doing different things in CLI and in tika-server.
> >
> > 1) Do we want to centralize json handling of Metadata?
> >
> > 2) If so, where? Core? I share Nick's hesitance to add a dependency to core. OTOH, GSON  
> > is only 186k, but this would add potential for jar conflicts with folks integrating  
> Tika,
> > and it doesn't feel like a core function to me...it is a handy decorator for applications.  
> >
> > 3) Wherever it goes, what package do we want to put it in? I like Nick's recommendations,  
> > with a slight preference for the second (oat.utils.json).
> >
> > Thank you!
> >
> > Best,
> >
> > Tim
> >
> > -----Original Message-----
> > From: Nick Burch (JIRA) [mailto:jira@apache.org]
> > Sent: Wednesday, May 28, 2014 12:41 PM
> > To: dev@tika.apache.org
> > Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata
> >
> >
> > [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011287#comment-14011287  
> > ]
> >
> > Nick Burch commented on TIKA-1311:
> > ----------------------------------
> >
> > If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal,  
> > so we might want to run the plan past the dev list first to see what people think (core tends  
> > to try to have a very minimal set of deps, unlike the other modules)
> >
> > Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise  
> > utils.json
> >
> > > Centralize JSON handling of Metadata
> > > ------------------------------------
> > >
> > > Key: TIKA-1311
> > > URL: https://issues.apache.org/jira/browse/TIKA-1311
> > > Project: Tika
> > > Issue Type: Task
> > > Reporter: Tim Allison
> > > Priority: Minor
> > >
> > > When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to  
> > centralize JSON handling of Metadata, potentially putting it in core. On a recent bug  
> > fix (TIKA-1291), the same recommendation was repeated especially noting that we now  
> > handle JSON/Metadata differently in CLI and server.
> > > Let's centralize JSON handling in core and use GSON. We should add a serializer and  
> a
> > deserializer so that users don't have to reinvent that wheel.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.2#6252)
> >
>  
>  


RE: [DISCUSS] Centralizing JSON handling of Metadata

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Ray! 

In almost reverse order, I've been using Jackson for this already, but I used GSON in TIKA-1291 because that's what CLI was already using.  In GSON's favor, the jar is a bit smaller, but I have no real preference or reason to pick one over the other.  I'm not a json-blackbelt (or, I guess that would be blckbelt), so I'm happy to go with either.

A new compilation unit makes sense. I'm wondering if we want to be that specific?  tika-serialization? Or, maybe just tika-utils?

Package name looks good to me. 

Thanks, again!

Best,

        Tim

-----Original Message-----
From: Ray Gauss II [mailto:ray.gauss@alfresco.com] 
Sent: Wednesday, May 28, 2014 3:07 PM
To: dev@tika.apache.org; Allison, Timothy B.
Subject: Re: [DISCUSS] Centralizing JSON handling of Metadata

Hi Tim,

1) Sounds good to me.

2) I do think we want core as lean as possible, so my vote would be for a separate project/module, similar to what was done with tika-xmp.  Perhaps something like tika-serialization-json to indicate other formats may follow in the same precedence?

3) Similar to above, perhaps org.apache.tika.metadata.serialization.json?

Just curious, any particular reason for GSON over Jackson?

Regards,

Ray


On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. (tallison@mitre.org) wrote:
> All,
>  
> Nick recommended I put the question to the dev list for discussion. It might be useful  
> to centralize our json handling of Metadata. We are now currently using different libraries  
> and doing different things in CLI and in tika-server.
>  
> 1) Do we want to centralize json handling of Metadata?
>  
> 2) If so, where? Core? I share Nick's hesitance to add a dependency to core. OTOH, GSON  
> is only 186k, but this would add potential for jar conflicts with folks integrating Tika,  
> and it doesn't feel like a core function to me...it is a handy decorator for applications.  
>  
> 3) Wherever it goes, what package do we want to put it in? I like Nick's recommendations,  
> with a slight preference for the second (oat.utils.json).
>  
> Thank you!
>  
> Best,
>  
> Tim
>  
> -----Original Message-----
> From: Nick Burch (JIRA) [mailto:jira@apache.org]
> Sent: Wednesday, May 28, 2014 12:41 PM
> To: dev@tika.apache.org
> Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata
>  
>  
> [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011287#comment-14011287  
> ]
>  
> Nick Burch commented on TIKA-1311:
> ----------------------------------
>  
> If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal,  
> so we might want to run the plan past the dev list first to see what people think (core tends  
> to try to have a very minimal set of deps, unlike the other modules)
>  
> Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise  
> utils.json
>  
> > Centralize JSON handling of Metadata
> > ------------------------------------
> >
> > Key: TIKA-1311
> > URL: https://issues.apache.org/jira/browse/TIKA-1311
> > Project: Tika
> > Issue Type: Task
> > Reporter: Tim Allison
> > Priority: Minor
> >
> > When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to  
> centralize JSON handling of Metadata, potentially putting it in core. On a recent bug  
> fix (TIKA-1291), the same recommendation was repeated especially noting that we now  
> handle JSON/Metadata differently in CLI and server.
> > Let's centralize JSON handling in core and use GSON. We should add a serializer and a  
> deserializer so that users don't have to reinvent that wheel.
>  
>  
>  
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>  


Re: [DISCUSS] Centralizing JSON handling of Metadata

Posted by Ray Gauss II <ra...@alfresco.com>.
Hi Tim,

1) Sounds good to me.

2) I do think we want core as lean as possible, so my vote would be for a separate project/module, similar to what was done with tika-xmp.  Perhaps something like tika-serialization-json to indicate other formats may follow in the same precedence?

3) Similar to above, perhaps org.apache.tika.metadata.serialization.json?

Just curious, any particular reason for GSON over Jackson?

Regards,

Ray


On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. (tallison@mitre.org) wrote:
> All,
>  
> Nick recommended I put the question to the dev list for discussion. It might be useful  
> to centralize our json handling of Metadata. We are now currently using different libraries  
> and doing different things in CLI and in tika-server.
>  
> 1) Do we want to centralize json handling of Metadata?
>  
> 2) If so, where? Core? I share Nick's hesitance to add a dependency to core. OTOH, GSON  
> is only 186k, but this would add potential for jar conflicts with folks integrating Tika,  
> and it doesn't feel like a core function to me...it is a handy decorator for applications.  
>  
> 3) Wherever it goes, what package do we want to put it in? I like Nick's recommendations,  
> with a slight preference for the second (oat.utils.json).
>  
> Thank you!
>  
> Best,
>  
> Tim
>  
> -----Original Message-----
> From: Nick Burch (JIRA) [mailto:jira@apache.org]
> Sent: Wednesday, May 28, 2014 12:41 PM
> To: dev@tika.apache.org
> Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata
>  
>  
> [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011287#comment-14011287  
> ]
>  
> Nick Burch commented on TIKA-1311:
> ----------------------------------
>  
> If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal,  
> so we might want to run the plan past the dev list first to see what people think (core tends  
> to try to have a very minimal set of deps, unlike the other modules)
>  
> Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise  
> utils.json
>  
> > Centralize JSON handling of Metadata
> > ------------------------------------
> >
> > Key: TIKA-1311
> > URL: https://issues.apache.org/jira/browse/TIKA-1311
> > Project: Tika
> > Issue Type: Task
> > Reporter: Tim Allison
> > Priority: Minor
> >
> > When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to  
> centralize JSON handling of Metadata, potentially putting it in core. On a recent bug  
> fix (TIKA-1291), the same recommendation was repeated especially noting that we now  
> handle JSON/Metadata differently in CLI and server.
> > Let's centralize JSON handling in core and use GSON. We should add a serializer and a  
> deserializer so that users don't have to reinvent that wheel.
>  
>  
>  
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>