You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@streams.apache.org by Steve Blackmon <sb...@apache.org> on 2016/04/21 19:13:12 UTC

Source and Resource generation from jsonschemas

tl;dr We should build a suite of maven-plugins to generate new categories of source and resource artifacts. for starters we need our own jsonschema to java pojo plugin

For a while I’ve been working on stories to add the ability to generate new types of sources and resources from jsonschemas, including the activity streams schemas maintained by the project.

https://issues.apache.org/jira/browse/STREAMS-389

https://issues.apache.org/jira/browse/STREAMS-398

I've gotten pretty deep into this and believe strongly at this point that diversifying the type of artifacts our project can generate off schemas will add a powerful and valuable set of use cases.  There’s a lot of working being done in spark and flink to enable, simplify, and optimize working with data when quality POJOs and scala case classes are available on the class path.

There are a series of other popular big data technologies where having an explicit definition of object structure makes working with data easier (hadoop, pig, elasticsearch, kafka, just to name a few).  Making it simple to generate those artifacts using CLIs or maven plugins off in-house schemas, mixing in schemas from streams providers and processors, or linked externally on the web could be the killer app streams has been missing.

To really pursue this it makes sense that we would build up core utilities for resolving and managing the object types defined and referenced across groups of schemas and external dependencies.  To date we've relied entirely on org.jsonschema:jsonschema2pojo and org:jsonschema:jsonschema2pojo-maven-plugin to handle this conversion of schemas to POJOs.  I think we need to bring that core capability in-house to have full control of it’s behavior and output.

Questions for the list:

Does this challenge resonate with you / your organization?

Do you have any concern about shifting project attention toward plugins and tools for data definition?

Are you comfortable / uncomfortable with seeing the core streams POJOs used throughout our providers and processors change as part of this effort?

Steve Blackmon

sblackmon@apache.org

Re: Source and Resource generation from jsonschemas

Posted by Steve Blackmon <sb...@apache.org>.
Hey Ryan,

All of the projects mentioned in that thread are for serializing / deserializing JSON to/from case classes that you’ve already built by hand, or for accessing JSON directly without spec’ing out case classes at all. 

I’m proposing a maven plugin that inspects all of the jsonschemas in a module and whatever schemas they extend, and generates traits and case classes into which JSON can be loaded / unloaded.  These classes would be natively compatible with spark sql, play, and other frameworks that are optimized for operating on instances of case classes.

Also, we’d able to generate org.apache.streams.scala.json as a complement to the existing org.apache.streams.pojo.json off the activity streams POJOs and use them to work with activity streams data in those framework - without the compute/memory overhead and code ugliness of constantly converting between scala primitives/arrays/maps, and java primitives/arrays/maps.

If you run across any Apache licensed libraries out there that tackle these problems, I’d love to have a look at them.

Steve Blackmon

sblackmon@apache.org

On Mon, Apr 25, 2016 at 11:29 AM Ryan Ebanks

<
mailto:Ryan Ebanks <ry...@gmail.com>
> wrote:

I think being able to generate case classes from json schema is valuable.

However there are already projects that attempt to do this. See this stack

overflow question/answer.
http://stackoverflow.com/questions/23531065/scala-parse-json-directly-into-a-case-class
What will streams do that will be better/different than these projects?

On Thu, Apr 21, 2016 at 12:13 PM, Steve Blackmon <
mailto:sblackmon@apache.org
>

wrote:

> tl;dr We should build a suite of maven-plugins to generate new categories

> of source and resource artifacts. for starters we need our own jsonschema

> to java pojo plugin

>

> For a while I’ve been working on stories to add the ability to generate

> new types of sources and resources from jsonschemas, including the activity

> streams schemas maintained by the project.

>

>

> 1. [image: New Feature] STREAMS-389

> Support generation of scala source from jsonschemas

> <
https://issues.apache.org/jira/browse/STREAMS-389
>

>

>

> 1. [image: New Feature] STREAMS-398

> Support generation of hive table definitions from jsonschema

> <
https://issues.apache.org/jira/browse/STREAMS-398
>

>

>

> I've gotten pretty deep into this and believe strongly at this point that

> diversifying the type of artifacts our project can generate off schemas

> will add a powerful and valuable set of use cases. There’s a lot of

> working being done in spark and flink to enable, simplify, and optimize

> working with data when quality POJOs and scala case classes are available

> on the class path.

>

> There are a series of other popular big data technologies where having an

> explicit definition of object structure makes working with data easier

> (hadoop, pig, elasticsearch, kafka, just to name a few). Making it simple

> to generate those artifacts using CLIs or maven plugins off in-house

> schemas, mixing in schemas from streams providers and processors, or linked

> externally on the web could be the killer app streams has been missing.

>

> To really pursue this it makes sense that we would build up core utilities

> for resolving and managing the object types defined and referenced across

> groups of schemas and external dependencies. To date we've relied entirely

> on org.jsonschema:jsonschema2pojo and

> org:jsonschema:jsonschema2pojo-maven-plugin to handle this conversion of

> schemas to POJOs. I think we need to bring that core capability in-house

> to have full control of it’s behavior and output.

>

> Questions for the list:

> Does this challenge resonate with you / your organization?

> Do you have any concern about shifting project attention toward plugins

> and tools for data definition?

> Are you comfortable / uncomfortable with seeing the core streams POJOs

> used throughout our providers and processors change as part of this effort?

>

> Steve Blackmon

>
mailto:sblackmon@apache.org
>

Re: Source and Resource generation from jsonschemas

Posted by Ryan Ebanks <ry...@gmail.com>.
I think being able to generate case classes from json schema is valuable.
However there are already projects that attempt to do this.  See this stack
overflow question/answer.
http://stackoverflow.com/questions/23531065/scala-parse-json-directly-into-a-case-class

What will streams do that will be better/different than these projects?

On Thu, Apr 21, 2016 at 12:13 PM, Steve Blackmon <sb...@apache.org>
wrote:

> tl;dr We should build a suite of maven-plugins to generate new categories
> of source and resource artifacts. for starters we need our own jsonschema
> to java pojo plugin
>
> For a while I’ve been working on stories to add the ability to generate
> new types of sources and resources from jsonschemas, including the activity
> streams schemas maintained by the project.
>
>
>    1. [image: New Feature] STREAMS-389
>    Support generation of scala source from jsonschemas
>    <https://issues.apache.org/jira/browse/STREAMS-389>
>
>
>    1. [image: New Feature] STREAMS-398
>    Support generation of hive table definitions from jsonschema
>    <https://issues.apache.org/jira/browse/STREAMS-398>
>
>
> I've gotten pretty deep into this and believe strongly at this point that
> diversifying the type of artifacts our project can generate off schemas
> will add a powerful and valuable set of use cases.  There’s a lot of
> working being done in spark and flink to enable, simplify, and optimize
> working with data when quality POJOs and scala case classes are available
> on the class path.
>
> There are a series of other popular big data technologies where having an
> explicit definition of object structure makes working with data easier
> (hadoop, pig, elasticsearch, kafka, just to name a few).  Making it simple
> to generate those artifacts using CLIs or maven plugins off in-house
> schemas, mixing in schemas from streams providers and processors, or linked
> externally on the web could be the killer app streams has been missing.
>
> To really pursue this it makes sense that we would build up core utilities
> for resolving and managing the object types defined and referenced across
> groups of schemas and external dependencies.  To date we've relied entirely
> on org.jsonschema:jsonschema2pojo and
> org:jsonschema:jsonschema2pojo-maven-plugin to handle this conversion of
> schemas to POJOs.  I think we need to bring that core capability in-house
> to have full control of it’s behavior and output.
>
> Questions for the list:
> Does this challenge resonate with you / your organization?
> Do you have any concern about shifting project attention toward plugins
> and tools for data definition?
> Are you comfortable / uncomfortable with seeing the core streams POJOs
> used throughout our providers and processors change as part of this effort?
>
> Steve Blackmon
> sblackmon@apache.org
>