You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Mike Thomsen <mi...@gmail.com> on 2024/03/02 15:18:00 UTC

[DISCUSS] New schema repository idea (with proof of concept)

I've had this project on the back burner for a while and wanted to share it
with the team. It's a schema repository implementation that is designed to
take a JAR file with POJOs and use Jackson's schema generation API to
generate Avro schemas from those on startup. It also uses (via Jackson)
Avro annotations to help specify particular implementation details where
necessary. The code can be found here. Haven't worked on it lately, but it
should easily run on 1.25:

https://github.com/MikeThomsen/nifi-pojo-schema-repository-bundle

I am planning to get the repo ready for a PR unless someone raises reasons
why including it might be a poor fit. I think for a lot of teams this might
be a killer feature because it would allow them to use Avro with existing
enterprise POJOs and stuff like that without having to write them by hand.

Thoughts?

Re: [DISCUSS] New schema repository idea (with proof of concept)

Posted by Mark Payne <ma...@hotmail.com>.
Thanks for the color, Mike. I agree with David on this one. The solution you propose sounds quite complicated to use. And largely outside of the scope of the vast majority of NiFi users. While there are certainly Java developers using it, a large majority of the user base are more data engineer types of roles who are not likely to have POJOs. So while I can appreciate the desire to contribute something like this, I feel like the burden of maintaining it would outweigh the benefit.

I think what would make more sense for this situation would be to have some sort of utility outside of NiFi that could take a POJO and generate Avro schemas. Those schemas could then be imported directly info NiFi via the existing Avro Schema Registry or other schema registry that NiFi integrates with if that makes sense for the user.

Thanks
-Mark


> On Mar 7, 2024, at 5:48 PM, Mike Thomsen <mi...@gmail.com> wrote:
> 
> Sure. The JAR is not actually scanned. You have to use dynamic properties
> to map a schema name to a particular fully qualified class. I would assume
> that most people using it are just taking a JAR that was generated or
> written by hand to be a client library with only (or mainly) POJOs
> representing the model. You can see a simple example here:
> 
> https://github.com/MikeThomsen/nifi-pojo-schema-repository-bundle/tree/main/test-pojos/src/main/java/org/apache/nifi/pojo/complex
> 
> Nothing fancy per se. It's just POJOs with some standard annotations from
> the Avro lib.
> 
> I would imagine this repo would be a big help for teams that can't/won't
> commit to a contract first design but have to work with NiFi and other big
> data systems.
> 
> On Thu, Mar 7, 2024 at 5:39 PM David Handermann <ex...@apache.org>
> wrote:
> 
>> Mike,
>> 
>> Thanks for the reply. I agree that file and property-based registries
>> are useful, so the main question seems to be a compiled-code-derived
>> registry as you have described.
>> 
>> It seems that the general use case could still be supported through
>> file-backed registry, but without requiring the dynamic class loading
>> associated with a custom JAR.
>> 
>> Loading code from a JAR also presents greater security risks than
>> loading schema files, so if this were to be supported, it would
>> require additional permission restrictions.
>> 
>> To help think through this a bit more, can you describe the use case a
>> bit more? How would someone prepare a JAR for referencing in this
>> proposed registry?
>> 
>> Regards,
>> David Handermann
>> 
>> On Thu, Mar 7, 2024 at 4:30 PM Mike Thomsen <mi...@gmail.com>
>> wrote:
>>> 
>>> You raise some good points, but I think there's still ample room for
>>> file-based schema registries within NiFi. With regard to the the edge
>> cases
>>> with schema generation, I think an argument can also be made for "not
>>> letting the perfect be the enemy of the good."
>>> 
>>> On Wed, Mar 6, 2024 at 9:34 AM David Handermann <
>> exceptionfactory@apache.org>
>>> wrote:
>>> 
>>>> Mike,
>>>> 
>>>> Thanks for raising this question, and providing the example repository.
>>>> 
>>>> Although it sounds like a POJO-based repository could be useful in
>>>> some cases, it does not seem like something that should be included
>>>> for community support.
>>>> 
>>>> Part of the value of a Schema Registry is a shared location for data
>>>> description. Although supporting property or file-based Schema
>>>> Registries is useful in NiFi itself, the general pattern is
>>>> externalized storage and maintenance of schema definitions.
>>>> 
>>>> From another angle, this could be similar to code-first versus
>>>> contract-first API development. Each approach has its positives and
>>>> negatives. When it comes to a Schema Registry, however, it seems like
>>>> the contract needs to be defined outside of code.
>>>> 
>>>> Introspecting JAR files also raises questions about what to include or
>>>> exclude, and how to handle edge cases for certain class definitions.
>>>> This seems like the more significant problem. For this reason, it
>>>> seems better to rely on external operations to produce Avro schema
>>>> definitions, rather than supporting that in NiFi itself.
>>>> 
>>>> Those are my initial thoughts, perhaps others can provide additional
>>>> perspective.
>>>> 
>>>> Regards,
>>>> David Handermann
>>>> 
>>>> On Sat, Mar 2, 2024 at 9:18 AM Mike Thomsen <mi...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> I've had this project on the back burner for a while and wanted to
>> share
>>>> it
>>>>> with the team. It's a schema repository implementation that is
>> designed
>>>> to
>>>>> take a JAR file with POJOs and use Jackson's schema generation API to
>>>>> generate Avro schemas from those on startup. It also uses (via
>> Jackson)
>>>>> Avro annotations to help specify particular implementation details
>> where
>>>>> necessary. The code can be found here. Haven't worked on it lately,
>> but
>>>> it
>>>>> should easily run on 1.25:
>>>>> 
>>>>> https://github.com/MikeThomsen/nifi-pojo-schema-repository-bundle
>>>>> 
>>>>> I am planning to get the repo ready for a PR unless someone raises
>>>> reasons
>>>>> why including it might be a poor fit. I think for a lot of teams this
>>>> might
>>>>> be a killer feature because it would allow them to use Avro with
>> existing
>>>>> enterprise POJOs and stuff like that without having to write them by
>>>> hand.
>>>>> 
>>>>> Thoughts?
>>>> 
>> 


Re: [DISCUSS] New schema repository idea (with proof of concept)

Posted by Mike Thomsen <mi...@gmail.com>.
Sure. The JAR is not actually scanned. You have to use dynamic properties
to map a schema name to a particular fully qualified class. I would assume
that most people using it are just taking a JAR that was generated or
written by hand to be a client library with only (or mainly) POJOs
representing the model. You can see a simple example here:

https://github.com/MikeThomsen/nifi-pojo-schema-repository-bundle/tree/main/test-pojos/src/main/java/org/apache/nifi/pojo/complex

Nothing fancy per se. It's just POJOs with some standard annotations from
the Avro lib.

I would imagine this repo would be a big help for teams that can't/won't
commit to a contract first design but have to work with NiFi and other big
data systems.

On Thu, Mar 7, 2024 at 5:39 PM David Handermann <ex...@apache.org>
wrote:

> Mike,
>
> Thanks for the reply. I agree that file and property-based registries
> are useful, so the main question seems to be a compiled-code-derived
> registry as you have described.
>
> It seems that the general use case could still be supported through
> file-backed registry, but without requiring the dynamic class loading
> associated with a custom JAR.
>
> Loading code from a JAR also presents greater security risks than
> loading schema files, so if this were to be supported, it would
> require additional permission restrictions.
>
> To help think through this a bit more, can you describe the use case a
> bit more? How would someone prepare a JAR for referencing in this
> proposed registry?
>
> Regards,
> David Handermann
>
> On Thu, Mar 7, 2024 at 4:30 PM Mike Thomsen <mi...@gmail.com>
> wrote:
> >
> > You raise some good points, but I think there's still ample room for
> > file-based schema registries within NiFi. With regard to the the edge
> cases
> > with schema generation, I think an argument can also be made for "not
> > letting the perfect be the enemy of the good."
> >
> > On Wed, Mar 6, 2024 at 9:34 AM David Handermann <
> exceptionfactory@apache.org>
> > wrote:
> >
> > > Mike,
> > >
> > > Thanks for raising this question, and providing the example repository.
> > >
> > > Although it sounds like a POJO-based repository could be useful in
> > > some cases, it does not seem like something that should be included
> > > for community support.
> > >
> > > Part of the value of a Schema Registry is a shared location for data
> > > description. Although supporting property or file-based Schema
> > > Registries is useful in NiFi itself, the general pattern is
> > > externalized storage and maintenance of schema definitions.
> > >
> > > From another angle, this could be similar to code-first versus
> > > contract-first API development. Each approach has its positives and
> > > negatives. When it comes to a Schema Registry, however, it seems like
> > > the contract needs to be defined outside of code.
> > >
> > > Introspecting JAR files also raises questions about what to include or
> > > exclude, and how to handle edge cases for certain class definitions.
> > > This seems like the more significant problem. For this reason, it
> > > seems better to rely on external operations to produce Avro schema
> > > definitions, rather than supporting that in NiFi itself.
> > >
> > > Those are my initial thoughts, perhaps others can provide additional
> > > perspective.
> > >
> > > Regards,
> > > David Handermann
> > >
> > > On Sat, Mar 2, 2024 at 9:18 AM Mike Thomsen <mi...@gmail.com>
> > > wrote:
> > > >
> > > > I've had this project on the back burner for a while and wanted to
> share
> > > it
> > > > with the team. It's a schema repository implementation that is
> designed
> > > to
> > > > take a JAR file with POJOs and use Jackson's schema generation API to
> > > > generate Avro schemas from those on startup. It also uses (via
> Jackson)
> > > > Avro annotations to help specify particular implementation details
> where
> > > > necessary. The code can be found here. Haven't worked on it lately,
> but
> > > it
> > > > should easily run on 1.25:
> > > >
> > > > https://github.com/MikeThomsen/nifi-pojo-schema-repository-bundle
> > > >
> > > > I am planning to get the repo ready for a PR unless someone raises
> > > reasons
> > > > why including it might be a poor fit. I think for a lot of teams this
> > > might
> > > > be a killer feature because it would allow them to use Avro with
> existing
> > > > enterprise POJOs and stuff like that without having to write them by
> > > hand.
> > > >
> > > > Thoughts?
> > >
>

Re: [DISCUSS] New schema repository idea (with proof of concept)

Posted by David Handermann <ex...@apache.org>.
Mike,

Thanks for the reply. I agree that file and property-based registries
are useful, so the main question seems to be a compiled-code-derived
registry as you have described.

It seems that the general use case could still be supported through
file-backed registry, but without requiring the dynamic class loading
associated with a custom JAR.

Loading code from a JAR also presents greater security risks than
loading schema files, so if this were to be supported, it would
require additional permission restrictions.

To help think through this a bit more, can you describe the use case a
bit more? How would someone prepare a JAR for referencing in this
proposed registry?

Regards,
David Handermann

On Thu, Mar 7, 2024 at 4:30 PM Mike Thomsen <mi...@gmail.com> wrote:
>
> You raise some good points, but I think there's still ample room for
> file-based schema registries within NiFi. With regard to the the edge cases
> with schema generation, I think an argument can also be made for "not
> letting the perfect be the enemy of the good."
>
> On Wed, Mar 6, 2024 at 9:34 AM David Handermann <ex...@apache.org>
> wrote:
>
> > Mike,
> >
> > Thanks for raising this question, and providing the example repository.
> >
> > Although it sounds like a POJO-based repository could be useful in
> > some cases, it does not seem like something that should be included
> > for community support.
> >
> > Part of the value of a Schema Registry is a shared location for data
> > description. Although supporting property or file-based Schema
> > Registries is useful in NiFi itself, the general pattern is
> > externalized storage and maintenance of schema definitions.
> >
> > From another angle, this could be similar to code-first versus
> > contract-first API development. Each approach has its positives and
> > negatives. When it comes to a Schema Registry, however, it seems like
> > the contract needs to be defined outside of code.
> >
> > Introspecting JAR files also raises questions about what to include or
> > exclude, and how to handle edge cases for certain class definitions.
> > This seems like the more significant problem. For this reason, it
> > seems better to rely on external operations to produce Avro schema
> > definitions, rather than supporting that in NiFi itself.
> >
> > Those are my initial thoughts, perhaps others can provide additional
> > perspective.
> >
> > Regards,
> > David Handermann
> >
> > On Sat, Mar 2, 2024 at 9:18 AM Mike Thomsen <mi...@gmail.com>
> > wrote:
> > >
> > > I've had this project on the back burner for a while and wanted to share
> > it
> > > with the team. It's a schema repository implementation that is designed
> > to
> > > take a JAR file with POJOs and use Jackson's schema generation API to
> > > generate Avro schemas from those on startup. It also uses (via Jackson)
> > > Avro annotations to help specify particular implementation details where
> > > necessary. The code can be found here. Haven't worked on it lately, but
> > it
> > > should easily run on 1.25:
> > >
> > > https://github.com/MikeThomsen/nifi-pojo-schema-repository-bundle
> > >
> > > I am planning to get the repo ready for a PR unless someone raises
> > reasons
> > > why including it might be a poor fit. I think for a lot of teams this
> > might
> > > be a killer feature because it would allow them to use Avro with existing
> > > enterprise POJOs and stuff like that without having to write them by
> > hand.
> > >
> > > Thoughts?
> >

Re: [DISCUSS] New schema repository idea (with proof of concept)

Posted by Mike Thomsen <mi...@gmail.com>.
You raise some good points, but I think there's still ample room for
file-based schema registries within NiFi. With regard to the the edge cases
with schema generation, I think an argument can also be made for "not
letting the perfect be the enemy of the good."

On Wed, Mar 6, 2024 at 9:34 AM David Handermann <ex...@apache.org>
wrote:

> Mike,
>
> Thanks for raising this question, and providing the example repository.
>
> Although it sounds like a POJO-based repository could be useful in
> some cases, it does not seem like something that should be included
> for community support.
>
> Part of the value of a Schema Registry is a shared location for data
> description. Although supporting property or file-based Schema
> Registries is useful in NiFi itself, the general pattern is
> externalized storage and maintenance of schema definitions.
>
> From another angle, this could be similar to code-first versus
> contract-first API development. Each approach has its positives and
> negatives. When it comes to a Schema Registry, however, it seems like
> the contract needs to be defined outside of code.
>
> Introspecting JAR files also raises questions about what to include or
> exclude, and how to handle edge cases for certain class definitions.
> This seems like the more significant problem. For this reason, it
> seems better to rely on external operations to produce Avro schema
> definitions, rather than supporting that in NiFi itself.
>
> Those are my initial thoughts, perhaps others can provide additional
> perspective.
>
> Regards,
> David Handermann
>
> On Sat, Mar 2, 2024 at 9:18 AM Mike Thomsen <mi...@gmail.com>
> wrote:
> >
> > I've had this project on the back burner for a while and wanted to share
> it
> > with the team. It's a schema repository implementation that is designed
> to
> > take a JAR file with POJOs and use Jackson's schema generation API to
> > generate Avro schemas from those on startup. It also uses (via Jackson)
> > Avro annotations to help specify particular implementation details where
> > necessary. The code can be found here. Haven't worked on it lately, but
> it
> > should easily run on 1.25:
> >
> > https://github.com/MikeThomsen/nifi-pojo-schema-repository-bundle
> >
> > I am planning to get the repo ready for a PR unless someone raises
> reasons
> > why including it might be a poor fit. I think for a lot of teams this
> might
> > be a killer feature because it would allow them to use Avro with existing
> > enterprise POJOs and stuff like that without having to write them by
> hand.
> >
> > Thoughts?
>

Re: [DISCUSS] New schema repository idea (with proof of concept)

Posted by David Handermann <ex...@apache.org>.
Mike,

Thanks for raising this question, and providing the example repository.

Although it sounds like a POJO-based repository could be useful in
some cases, it does not seem like something that should be included
for community support.

Part of the value of a Schema Registry is a shared location for data
description. Although supporting property or file-based Schema
Registries is useful in NiFi itself, the general pattern is
externalized storage and maintenance of schema definitions.

From another angle, this could be similar to code-first versus
contract-first API development. Each approach has its positives and
negatives. When it comes to a Schema Registry, however, it seems like
the contract needs to be defined outside of code.

Introspecting JAR files also raises questions about what to include or
exclude, and how to handle edge cases for certain class definitions.
This seems like the more significant problem. For this reason, it
seems better to rely on external operations to produce Avro schema
definitions, rather than supporting that in NiFi itself.

Those are my initial thoughts, perhaps others can provide additional
perspective.

Regards,
David Handermann

On Sat, Mar 2, 2024 at 9:18 AM Mike Thomsen <mi...@gmail.com> wrote:
>
> I've had this project on the back burner for a while and wanted to share it
> with the team. It's a schema repository implementation that is designed to
> take a JAR file with POJOs and use Jackson's schema generation API to
> generate Avro schemas from those on startup. It also uses (via Jackson)
> Avro annotations to help specify particular implementation details where
> necessary. The code can be found here. Haven't worked on it lately, but it
> should easily run on 1.25:
>
> https://github.com/MikeThomsen/nifi-pojo-schema-repository-bundle
>
> I am planning to get the repo ready for a PR unless someone raises reasons
> why including it might be a poor fit. I think for a lot of teams this might
> be a killer feature because it would allow them to use Avro with existing
> enterprise POJOs and stuff like that without having to write them by hand.
>
> Thoughts?