You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Ryan Skraba <ry...@skraba.com> on 2019/07/15 13:30:00 UTC

Should a Schema be serializable in Java?

Hello!

I'm looking for any discussion or reference why the Schema object isn't
serializable -- I'm pretty sure this must have already been discussed (but
the keywords +avro +serializable +schema have MANY results in all the
searches I did: JIRA, stack overflow, mailing list, web)

In particular, I was at a demo today where we were asked why Schemas needed
to be passed as strings to run in distributed tasks.  I remember running
into this problem years ago with MapReduce, and again in Spark, and again
in Beam...

Is there any downside to making a Schema implement java.lang.Serializable?
The only thing I can think of is that the schema _should not_ be serialized
with the data, and making it non-serializable loosely enforces this (at the
cost of continually writing different flavours of "Avro holders" for when
you really do want to serialize it).

Willing to create a JIRA and work on the implementation, of course!

All my best, Ryan

Re: Should a Schema be serializable in Java?

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Thanks again Ryan,

I've merged to master and the 1.9 branch, so it should be in Avro 1.9.1.

Cheers, Fokko Driesprong


Op do 18 jul. 2019 om 15:18 schreef Zoltan Farkas <zo...@yahoo.com>:

> LGTM
>
> On Jul 18, 2019, at 8:24 AM, Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> Thank you Ryan, I have a few comments on Github. Looks good to me.
>
> Cheers, Fokko
>
> Op do 18 jul. 2019 om 11:58 schreef Ryan Skraba <ry...@skraba.com>:
>
>> Hello!  I'm motivated to see this happen :D
>>
>> +Zoltan, the original author.  I created a PR against apache/avro master
>> here: https://github.com/apache/avro/pull/589
>>
>> I cherry-picked the commit from your fork, and reapplied
>> spotless/checkstyle.  I hope this is the correct way to preserve authorship
>> and that I'm not stepping on any toes!
>>
>> Can someone take a look at the above PR?
>>
>> Best regards,
>>
>> Ryan
>>
>> On Tue, Jul 16, 2019 at 11:58 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>>> Yes probably it is overkill to warn given the examples you mention.
>>> Also your argument towards reusing the mature (and battle tested)
>>> combination of Schema.Parser + String serialization makes sense.
>>>
>>> Adding this to 1.9.1 will be an extra selling point for projects
>>> wanting to migrate to the latest version of Avro so it sounds good to
>>> me but you should add it to master and then we can cherry pick it from
>>> there.
>>>
>>>
>>> On Tue, Jul 16, 2019 at 11:16 AM Ryan Skraba <ry...@skraba.com> wrote:
>>> >
>>> > Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
>>> looking for.
>>> >
>>> > I agree that Java serialization shouldn't be used for anything
>>> cross-platform, or (in my opinion) used for any data persistence at all.
>>> Especially not for an Avro container file or sending binary data through a
>>> messaging system...
>>> >
>>> > But Java serialization is definitely useful and used for sending
>>> instances of "distributed work" implemented in Java from node to node in a
>>> cluster.  I'm not too worried about existing connectors -- we can see that
>>> each framework has "solved" the problem one at a time.  In addition to
>>> Flink, there's
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
>>> and
>>> https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
>>> .
>>> >
>>> > Specifically, I see the advantage for user-defined distributed
>>> functions that happen to carry along an Avro Schema -- and I can personally
>>> say that I've encountered this a lot in our code!
>>> >
>>> > That being said, I think it's probably overkill to warn the user about
>>> the perils of Java serialization (not being cross-language and requiring
>>> consistent JDKs and libraries across JVMs).  If an error occurs for one of
>>> those reasons, there's a larger problem for the dev to address, and it's
>>> just as likely to occur for any Java library in the job if the environment
>>> is bad.  Related, we've encountered similar issues with logical types
>>> existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
>>> the solution is "make sure you don't do that".  (Looking at you, guava and
>>> jackson!)
>>> >
>>> > The patch in question delegates serialization to the string form of
>>> the schema, so it's basically doing what all of the above Avro "holders"
>>> are doing -- I wouldn't object to having a sample schema available that
>>> fully exercises what a schema can hold, but I also think that Schema.Parser
>>> (used underneath) is currently pretty well tested and mature!
>>> >
>>> > Do you think this could be a candidate for 1.9.1 as a minor
>>> improvement?  I can't think of any reason that this wouldn't be backwards
>>> compatible.
>>> >
>>> > Ryan
>>> >
>>> > side note: I wrote java.lang.Serializable earlier, which probably
>>> didn't help my search for prior discussion... :/
>>> >
>>> > On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <ie...@gmail.com>
>>> wrote:
>>> >>
>>> >> This is a good idea even if it may have some issues that we should
>>> >> probably document and warn users about:
>>> >>
>>> >> 1. Java based serialization is really practical for JVM based systems,
>>> >> but we should probably add a warning or documentation because Java
>>> >> serialization is not deterministic between JVMs so this could be a
>>> >> source for issues (usually companies use the same version of the JVM
>>> >> so this is less critical, but this still can happen specially now with
>>> >> all the different versions of Java and OpenJDK based flavors).
>>> >>
>>> >> 2. This is not cross language compatible, the String based
>>> >> representation (or even an Avro based representation of Schema) can be
>>> >> used in every language.
>>> >>
>>> >> Even with these I think just for ease of use it is worth to make
>>> >> Schema Serializable. Is the plan to fully serialize it, or just to
>>> >> make it a String and serialize the String as done in the issue Doug
>>> >> mentioned?
>>> >> If we take the first approach we need to properly test with a Schema
>>> >> that has elements of the full specification that (de)-serialization
>>> >> works correctly. Does anyone know if we have already a test schema
>>> >> that covers the full ‘schema’ specification to reuse it if so?
>>> >>
>>> >> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <
>>> fokko@driesprong.frl> wrote:
>>> >> >
>>> >> > Correct me if I'm wrong here. But as far as I understood the way of
>>> >> > serializing the schema is using Avro, as it is part of the file. To
>>> avoid
>>> >> > confusion there should be one way of serializing.
>>> >> >
>>> >> > However, I'm not sure if this is worth the hassle of not simply
>>> >> > implementing serializable. Also Flink there is a rather far from
>>> optimal
>>> >> > implementation:
>>> >> >
>>> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
>>> >> > This converts it to JSON and back while distributing the schema to
>>> the
>>> >> > executors.
>>> >> >
>>> >> > Cheers, Fokko
>>> >> >
>>> >> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cutting@gmail.com
>>> >:
>>> >> >
>>> >> > > I can't think of a reason Schema should not implement
>>> Serializable.
>>> >> > >
>>> >> > > There's actually already an issue & patch for this:
>>> >> > >
>>> >> > > https://issues.apache.org/jira/browse/AVRO-1852
>>> >> > >
>>> >> > > Doug
>>> >> > >
>>> >> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com>
>>> wrote:
>>> >> > >
>>> >> > > > +dev@avro.apache.org
>>> >> > > >
>>> >> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com>
>>> wrote:
>>> >> > > > >
>>> >> > > > > Hello!
>>> >> > > > >
>>> >> > > > > I'm looking for any discussion or reference why the Schema
>>> object isn't
>>> >> > > > serializable -- I'm pretty sure this must have already been
>>> discussed
>>> >> > > (but
>>> >> > > > the keywords +avro +serializable +schema have MANY results in
>>> all the
>>> >> > > > searches I did: JIRA, stack overflow, mailing list, web)
>>> >> > > > >
>>> >> > > > > In particular, I was at a demo today where we were asked why
>>> Schemas
>>> >> > > > needed to be passed as strings to run in distributed tasks.  I
>>> remember
>>> >> > > > running into this problem years ago with MapReduce, and again
>>> in Spark,
>>> >> > > and
>>> >> > > > again in Beam...
>>> >> > > > >
>>> >> > > > > Is there any downside to making a Schema implement
>>> >> > > > java.lang.Serializable?  The only thing I can think of is that
>>> the schema
>>> >> > > > _should not_ be serialized with the data, and making it
>>> non-serializable
>>> >> > > > loosely enforces this (at the cost of continually writing
>>> different
>>> >> > > > flavours of "Avro holders" for when you really do want to
>>> serialize it).
>>> >> > > > >
>>> >> > > > > Willing to create a JIRA and work on the implementation, of
>>> course!
>>> >> > > > >
>>> >> > > > > All my best, Ryan
>>> >> > > >
>>> >> > >
>>>
>>
>

Re: Should a Schema be serializable in Java?

Posted by Zoltan Farkas <zo...@yahoo.com>.
LGTM

> On Jul 18, 2019, at 8:24 AM, Driesprong, Fokko <fo...@driesprong.frl> wrote:
> 
> Thank you Ryan, I have a few comments on Github. Looks good to me.
> 
> Cheers, Fokko
> 
> Op do 18 jul. 2019 om 11:58 schreef Ryan Skraba <ryan@skraba.com <ma...@skraba.com>>:
> Hello!  I'm motivated to see this happen :D
> 
> +Zoltan, the original author.  I created a PR against apache/avro master here: https://github.com/apache/avro/pull/589 <https://github.com/apache/avro/pull/589>
> 
> I cherry-picked the commit from your fork, and reapplied spotless/checkstyle.  I hope this is the correct way to preserve authorship and that I'm not stepping on any toes!
> 
> Can someone take a look at the above PR?  
> 
> Best regards, 
> 
> Ryan
> 
> On Tue, Jul 16, 2019 at 11:58 AM Ismaël Mejía <iemejia@gmail.com <ma...@gmail.com>> wrote:
> Yes probably it is overkill to warn given the examples you mention.
> Also your argument towards reusing the mature (and battle tested)
> combination of Schema.Parser + String serialization makes sense.
> 
> Adding this to 1.9.1 will be an extra selling point for projects
> wanting to migrate to the latest version of Avro so it sounds good to
> me but you should add it to master and then we can cherry pick it from
> there.
> 
> 
> On Tue, Jul 16, 2019 at 11:16 AM Ryan Skraba <ryan@skraba.com <ma...@skraba.com>> wrote:
> >
> > Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was looking for.
> >
> > I agree that Java serialization shouldn't be used for anything cross-platform, or (in my opinion) used for any data persistence at all.  Especially not for an Avro container file or sending binary data through a messaging system...
> >
> > But Java serialization is definitely useful and used for sending instances of "distributed work" implemented in Java from node to node in a cluster.  I'm not too worried about existing connectors -- we can see that each framework has "solved" the problem one at a time.  In addition to Flink, there's https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29 <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29> and https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35 <https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35>.
> >
> > Specifically, I see the advantage for user-defined distributed functions that happen to carry along an Avro Schema -- and I can personally say that I've encountered this a lot in our code!
> >
> > That being said, I think it's probably overkill to warn the user about the perils of Java serialization (not being cross-language and requiring consistent JDKs and libraries across JVMs).  If an error occurs for one of those reasons, there's a larger problem for the dev to address, and it's just as likely to occur for any Java library in the job if the environment is bad.  Related, we've encountered similar issues with logical types existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster... the solution is "make sure you don't do that".  (Looking at you, guava and jackson!)
> >
> > The patch in question delegates serialization to the string form of the schema, so it's basically doing what all of the above Avro "holders" are doing -- I wouldn't object to having a sample schema available that fully exercises what a schema can hold, but I also think that Schema.Parser (used underneath) is currently pretty well tested and mature!
> >
> > Do you think this could be a candidate for 1.9.1 as a minor improvement?  I can't think of any reason that this wouldn't be backwards compatible.
> >
> > Ryan
> >
> > side note: I wrote java.lang.Serializable earlier, which probably didn't help my search for prior discussion... :/
> >
> > On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <iemejia@gmail.com <ma...@gmail.com>> wrote:
> >>
> >> This is a good idea even if it may have some issues that we should
> >> probably document and warn users about:
> >>
> >> 1. Java based serialization is really practical for JVM based systems,
> >> but we should probably add a warning or documentation because Java
> >> serialization is not deterministic between JVMs so this could be a
> >> source for issues (usually companies use the same version of the JVM
> >> so this is less critical, but this still can happen specially now with
> >> all the different versions of Java and OpenJDK based flavors).
> >>
> >> 2. This is not cross language compatible, the String based
> >> representation (or even an Avro based representation of Schema) can be
> >> used in every language.
> >>
> >> Even with these I think just for ease of use it is worth to make
> >> Schema Serializable. Is the plan to fully serialize it, or just to
> >> make it a String and serialize the String as done in the issue Doug
> >> mentioned?
> >> If we take the first approach we need to properly test with a Schema
> >> that has elements of the full specification that (de)-serialization
> >> works correctly. Does anyone know if we have already a test schema
> >> that covers the full ‘schema’ specification to reuse it if so?
> >>
> >> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <fo...@driesprong.frl> wrote:
> >> >
> >> > Correct me if I'm wrong here. But as far as I understood the way of
> >> > serializing the schema is using Avro, as it is part of the file. To avoid
> >> > confusion there should be one way of serializing.
> >> >
> >> > However, I'm not sure if this is worth the hassle of not simply
> >> > implementing serializable. Also Flink there is a rather far from optimal
> >> > implementation:
> >> > https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72 <https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72>
> >> > This converts it to JSON and back while distributing the schema to the
> >> > executors.
> >> >
> >> > Cheers, Fokko
> >> >
> >> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cutting@gmail.com <ma...@gmail.com>>:
> >> >
> >> > > I can't think of a reason Schema should not implement Serializable.
> >> > >
> >> > > There's actually already an issue & patch for this:
> >> > >
> >> > > https://issues.apache.org/jira/browse/AVRO-1852 <https://issues.apache.org/jira/browse/AVRO-1852>
> >> > >
> >> > > Doug
> >> > >
> >> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <iemejia@gmail.com <ma...@gmail.com>> wrote:
> >> > >
> >> > > > +dev@avro.apache.org <ma...@avro.apache.org>
> >> > > >
> >> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ryan@skraba.com <ma...@skraba.com>> wrote:
> >> > > > >
> >> > > > > Hello!
> >> > > > >
> >> > > > > I'm looking for any discussion or reference why the Schema object isn't
> >> > > > serializable -- I'm pretty sure this must have already been discussed
> >> > > (but
> >> > > > the keywords +avro +serializable +schema have MANY results in all the
> >> > > > searches I did: JIRA, stack overflow, mailing list, web)
> >> > > > >
> >> > > > > In particular, I was at a demo today where we were asked why Schemas
> >> > > > needed to be passed as strings to run in distributed tasks.  I remember
> >> > > > running into this problem years ago with MapReduce, and again in Spark,
> >> > > and
> >> > > > again in Beam...
> >> > > > >
> >> > > > > Is there any downside to making a Schema implement
> >> > > > java.lang.Serializable?  The only thing I can think of is that the schema
> >> > > > _should not_ be serialized with the data, and making it non-serializable
> >> > > > loosely enforces this (at the cost of continually writing different
> >> > > > flavours of "Avro holders" for when you really do want to serialize it).
> >> > > > >
> >> > > > > Willing to create a JIRA and work on the implementation, of course!
> >> > > > >
> >> > > > > All my best, Ryan
> >> > > >
> >> > >


Re: Should a Schema be serializable in Java?

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Thank you Ryan, I have a few comments on Github. Looks good to me.

Cheers, Fokko

Op do 18 jul. 2019 om 11:58 schreef Ryan Skraba <ry...@skraba.com>:

> Hello!  I'm motivated to see this happen :D
>
> +Zoltan, the original author.  I created a PR against apache/avro master
> here: https://github.com/apache/avro/pull/589
>
> I cherry-picked the commit from your fork, and reapplied
> spotless/checkstyle.  I hope this is the correct way to preserve authorship
> and that I'm not stepping on any toes!
>
> Can someone take a look at the above PR?
>
> Best regards,
>
> Ryan
>
> On Tue, Jul 16, 2019 at 11:58 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> Yes probably it is overkill to warn given the examples you mention.
>> Also your argument towards reusing the mature (and battle tested)
>> combination of Schema.Parser + String serialization makes sense.
>>
>> Adding this to 1.9.1 will be an extra selling point for projects
>> wanting to migrate to the latest version of Avro so it sounds good to
>> me but you should add it to master and then we can cherry pick it from
>> there.
>>
>>
>> On Tue, Jul 16, 2019 at 11:16 AM Ryan Skraba <ry...@skraba.com> wrote:
>> >
>> > Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
>> looking for.
>> >
>> > I agree that Java serialization shouldn't be used for anything
>> cross-platform, or (in my opinion) used for any data persistence at all.
>> Especially not for an Avro container file or sending binary data through a
>> messaging system...
>> >
>> > But Java serialization is definitely useful and used for sending
>> instances of "distributed work" implemented in Java from node to node in a
>> cluster.  I'm not too worried about existing connectors -- we can see that
>> each framework has "solved" the problem one at a time.  In addition to
>> Flink, there's
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
>> and
>> https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
>> .
>> >
>> > Specifically, I see the advantage for user-defined distributed
>> functions that happen to carry along an Avro Schema -- and I can personally
>> say that I've encountered this a lot in our code!
>> >
>> > That being said, I think it's probably overkill to warn the user about
>> the perils of Java serialization (not being cross-language and requiring
>> consistent JDKs and libraries across JVMs).  If an error occurs for one of
>> those reasons, there's a larger problem for the dev to address, and it's
>> just as likely to occur for any Java library in the job if the environment
>> is bad.  Related, we've encountered similar issues with logical types
>> existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
>> the solution is "make sure you don't do that".  (Looking at you, guava and
>> jackson!)
>> >
>> > The patch in question delegates serialization to the string form of the
>> schema, so it's basically doing what all of the above Avro "holders" are
>> doing -- I wouldn't object to having a sample schema available that fully
>> exercises what a schema can hold, but I also think that Schema.Parser (used
>> underneath) is currently pretty well tested and mature!
>> >
>> > Do you think this could be a candidate for 1.9.1 as a minor
>> improvement?  I can't think of any reason that this wouldn't be backwards
>> compatible.
>> >
>> > Ryan
>> >
>> > side note: I wrote java.lang.Serializable earlier, which probably
>> didn't help my search for prior discussion... :/
>> >
>> > On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <ie...@gmail.com> wrote:
>> >>
>> >> This is a good idea even if it may have some issues that we should
>> >> probably document and warn users about:
>> >>
>> >> 1. Java based serialization is really practical for JVM based systems,
>> >> but we should probably add a warning or documentation because Java
>> >> serialization is not deterministic between JVMs so this could be a
>> >> source for issues (usually companies use the same version of the JVM
>> >> so this is less critical, but this still can happen specially now with
>> >> all the different versions of Java and OpenJDK based flavors).
>> >>
>> >> 2. This is not cross language compatible, the String based
>> >> representation (or even an Avro based representation of Schema) can be
>> >> used in every language.
>> >>
>> >> Even with these I think just for ease of use it is worth to make
>> >> Schema Serializable. Is the plan to fully serialize it, or just to
>> >> make it a String and serialize the String as done in the issue Doug
>> >> mentioned?
>> >> If we take the first approach we need to properly test with a Schema
>> >> that has elements of the full specification that (de)-serialization
>> >> works correctly. Does anyone know if we have already a test schema
>> >> that covers the full ‘schema’ specification to reuse it if so?
>> >>
>> >> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko
>> <fo...@driesprong.frl> wrote:
>> >> >
>> >> > Correct me if I'm wrong here. But as far as I understood the way of
>> >> > serializing the schema is using Avro, as it is part of the file. To
>> avoid
>> >> > confusion there should be one way of serializing.
>> >> >
>> >> > However, I'm not sure if this is worth the hassle of not simply
>> >> > implementing serializable. Also Flink there is a rather far from
>> optimal
>> >> > implementation:
>> >> >
>> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
>> >> > This converts it to JSON and back while distributing the schema to
>> the
>> >> > executors.
>> >> >
>> >> > Cheers, Fokko
>> >> >
>> >> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cutting@gmail.com
>> >:
>> >> >
>> >> > > I can't think of a reason Schema should not implement Serializable.
>> >> > >
>> >> > > There's actually already an issue & patch for this:
>> >> > >
>> >> > > https://issues.apache.org/jira/browse/AVRO-1852
>> >> > >
>> >> > > Doug
>> >> > >
>> >> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com>
>> wrote:
>> >> > >
>> >> > > > +dev@avro.apache.org
>> >> > > >
>> >> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com>
>> wrote:
>> >> > > > >
>> >> > > > > Hello!
>> >> > > > >
>> >> > > > > I'm looking for any discussion or reference why the Schema
>> object isn't
>> >> > > > serializable -- I'm pretty sure this must have already been
>> discussed
>> >> > > (but
>> >> > > > the keywords +avro +serializable +schema have MANY results in
>> all the
>> >> > > > searches I did: JIRA, stack overflow, mailing list, web)
>> >> > > > >
>> >> > > > > In particular, I was at a demo today where we were asked why
>> Schemas
>> >> > > > needed to be passed as strings to run in distributed tasks.  I
>> remember
>> >> > > > running into this problem years ago with MapReduce, and again in
>> Spark,
>> >> > > and
>> >> > > > again in Beam...
>> >> > > > >
>> >> > > > > Is there any downside to making a Schema implement
>> >> > > > java.lang.Serializable?  The only thing I can think of is that
>> the schema
>> >> > > > _should not_ be serialized with the data, and making it
>> non-serializable
>> >> > > > loosely enforces this (at the cost of continually writing
>> different
>> >> > > > flavours of "Avro holders" for when you really do want to
>> serialize it).
>> >> > > > >
>> >> > > > > Willing to create a JIRA and work on the implementation, of
>> course!
>> >> > > > >
>> >> > > > > All my best, Ryan
>> >> > > >
>> >> > >
>>
>

Re: Should a Schema be serializable in Java?

Posted by Ryan Skraba <ry...@skraba.com>.
Hello!  I'm motivated to see this happen :D

+Zoltan, the original author.  I created a PR against apache/avro master
here: https://github.com/apache/avro/pull/589

I cherry-picked the commit from your fork, and reapplied
spotless/checkstyle.  I hope this is the correct way to preserve authorship
and that I'm not stepping on any toes!

Can someone take a look at the above PR?

Best regards,

Ryan

On Tue, Jul 16, 2019 at 11:58 AM Ismaël Mejía <ie...@gmail.com> wrote:

> Yes probably it is overkill to warn given the examples you mention.
> Also your argument towards reusing the mature (and battle tested)
> combination of Schema.Parser + String serialization makes sense.
>
> Adding this to 1.9.1 will be an extra selling point for projects
> wanting to migrate to the latest version of Avro so it sounds good to
> me but you should add it to master and then we can cherry pick it from
> there.
>
>
> On Tue, Jul 16, 2019 at 11:16 AM Ryan Skraba <ry...@skraba.com> wrote:
> >
> > Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
> looking for.
> >
> > I agree that Java serialization shouldn't be used for anything
> cross-platform, or (in my opinion) used for any data persistence at all.
> Especially not for an Avro container file or sending binary data through a
> messaging system...
> >
> > But Java serialization is definitely useful and used for sending
> instances of "distributed work" implemented in Java from node to node in a
> cluster.  I'm not too worried about existing connectors -- we can see that
> each framework has "solved" the problem one at a time.  In addition to
> Flink, there's
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
> and
> https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
> .
> >
> > Specifically, I see the advantage for user-defined distributed functions
> that happen to carry along an Avro Schema -- and I can personally say that
> I've encountered this a lot in our code!
> >
> > That being said, I think it's probably overkill to warn the user about
> the perils of Java serialization (not being cross-language and requiring
> consistent JDKs and libraries across JVMs).  If an error occurs for one of
> those reasons, there's a larger problem for the dev to address, and it's
> just as likely to occur for any Java library in the job if the environment
> is bad.  Related, we've encountered similar issues with logical types
> existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
> the solution is "make sure you don't do that".  (Looking at you, guava and
> jackson!)
> >
> > The patch in question delegates serialization to the string form of the
> schema, so it's basically doing what all of the above Avro "holders" are
> doing -- I wouldn't object to having a sample schema available that fully
> exercises what a schema can hold, but I also think that Schema.Parser (used
> underneath) is currently pretty well tested and mature!
> >
> > Do you think this could be a candidate for 1.9.1 as a minor
> improvement?  I can't think of any reason that this wouldn't be backwards
> compatible.
> >
> > Ryan
> >
> > side note: I wrote java.lang.Serializable earlier, which probably didn't
> help my search for prior discussion... :/
> >
> > On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <ie...@gmail.com> wrote:
> >>
> >> This is a good idea even if it may have some issues that we should
> >> probably document and warn users about:
> >>
> >> 1. Java based serialization is really practical for JVM based systems,
> >> but we should probably add a warning or documentation because Java
> >> serialization is not deterministic between JVMs so this could be a
> >> source for issues (usually companies use the same version of the JVM
> >> so this is less critical, but this still can happen specially now with
> >> all the different versions of Java and OpenJDK based flavors).
> >>
> >> 2. This is not cross language compatible, the String based
> >> representation (or even an Avro based representation of Schema) can be
> >> used in every language.
> >>
> >> Even with these I think just for ease of use it is worth to make
> >> Schema Serializable. Is the plan to fully serialize it, or just to
> >> make it a String and serialize the String as done in the issue Doug
> >> mentioned?
> >> If we take the first approach we need to properly test with a Schema
> >> that has elements of the full specification that (de)-serialization
> >> works correctly. Does anyone know if we have already a test schema
> >> that covers the full ‘schema’ specification to reuse it if so?
> >>
> >> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
> >> >
> >> > Correct me if I'm wrong here. But as far as I understood the way of
> >> > serializing the schema is using Avro, as it is part of the file. To
> avoid
> >> > confusion there should be one way of serializing.
> >> >
> >> > However, I'm not sure if this is worth the hassle of not simply
> >> > implementing serializable. Also Flink there is a rather far from
> optimal
> >> > implementation:
> >> >
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> >> > This converts it to JSON and back while distributing the schema to the
> >> > executors.
> >> >
> >> > Cheers, Fokko
> >> >
> >> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cu...@gmail.com>:
> >> >
> >> > > I can't think of a reason Schema should not implement Serializable.
> >> > >
> >> > > There's actually already an issue & patch for this:
> >> > >
> >> > > https://issues.apache.org/jira/browse/AVRO-1852
> >> > >
> >> > > Doug
> >> > >
> >> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com>
> wrote:
> >> > >
> >> > > > +dev@avro.apache.org
> >> > > >
> >> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com>
> wrote:
> >> > > > >
> >> > > > > Hello!
> >> > > > >
> >> > > > > I'm looking for any discussion or reference why the Schema
> object isn't
> >> > > > serializable -- I'm pretty sure this must have already been
> discussed
> >> > > (but
> >> > > > the keywords +avro +serializable +schema have MANY results in all
> the
> >> > > > searches I did: JIRA, stack overflow, mailing list, web)
> >> > > > >
> >> > > > > In particular, I was at a demo today where we were asked why
> Schemas
> >> > > > needed to be passed as strings to run in distributed tasks.  I
> remember
> >> > > > running into this problem years ago with MapReduce, and again in
> Spark,
> >> > > and
> >> > > > again in Beam...
> >> > > > >
> >> > > > > Is there any downside to making a Schema implement
> >> > > > java.lang.Serializable?  The only thing I can think of is that
> the schema
> >> > > > _should not_ be serialized with the data, and making it
> non-serializable
> >> > > > loosely enforces this (at the cost of continually writing
> different
> >> > > > flavours of "Avro holders" for when you really do want to
> serialize it).
> >> > > > >
> >> > > > > Willing to create a JIRA and work on the implementation, of
> course!
> >> > > > >
> >> > > > > All my best, Ryan
> >> > > >
> >> > >
>

Re: Should a Schema be serializable in Java?

Posted by Ismaël Mejía <ie...@gmail.com>.
Yes probably it is overkill to warn given the examples you mention.
Also your argument towards reusing the mature (and battle tested)
combination of Schema.Parser + String serialization makes sense.

Adding this to 1.9.1 will be an extra selling point for projects
wanting to migrate to the latest version of Avro so it sounds good to
me but you should add it to master and then we can cherry pick it from
there.


On Tue, Jul 16, 2019 at 11:16 AM Ryan Skraba <ry...@skraba.com> wrote:
>
> Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was looking for.
>
> I agree that Java serialization shouldn't be used for anything cross-platform, or (in my opinion) used for any data persistence at all.  Especially not for an Avro container file or sending binary data through a messaging system...
>
> But Java serialization is definitely useful and used for sending instances of "distributed work" implemented in Java from node to node in a cluster.  I'm not too worried about existing connectors -- we can see that each framework has "solved" the problem one at a time.  In addition to Flink, there's https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29 and https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35.
>
> Specifically, I see the advantage for user-defined distributed functions that happen to carry along an Avro Schema -- and I can personally say that I've encountered this a lot in our code!
>
> That being said, I think it's probably overkill to warn the user about the perils of Java serialization (not being cross-language and requiring consistent JDKs and libraries across JVMs).  If an error occurs for one of those reasons, there's a larger problem for the dev to address, and it's just as likely to occur for any Java library in the job if the environment is bad.  Related, we've encountered similar issues with logical types existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster... the solution is "make sure you don't do that".  (Looking at you, guava and jackson!)
>
> The patch in question delegates serialization to the string form of the schema, so it's basically doing what all of the above Avro "holders" are doing -- I wouldn't object to having a sample schema available that fully exercises what a schema can hold, but I also think that Schema.Parser (used underneath) is currently pretty well tested and mature!
>
> Do you think this could be a candidate for 1.9.1 as a minor improvement?  I can't think of any reason that this wouldn't be backwards compatible.
>
> Ryan
>
> side note: I wrote java.lang.Serializable earlier, which probably didn't help my search for prior discussion... :/
>
> On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>> This is a good idea even if it may have some issues that we should
>> probably document and warn users about:
>>
>> 1. Java based serialization is really practical for JVM based systems,
>> but we should probably add a warning or documentation because Java
>> serialization is not deterministic between JVMs so this could be a
>> source for issues (usually companies use the same version of the JVM
>> so this is less critical, but this still can happen specially now with
>> all the different versions of Java and OpenJDK based flavors).
>>
>> 2. This is not cross language compatible, the String based
>> representation (or even an Avro based representation of Schema) can be
>> used in every language.
>>
>> Even with these I think just for ease of use it is worth to make
>> Schema Serializable. Is the plan to fully serialize it, or just to
>> make it a String and serialize the String as done in the issue Doug
>> mentioned?
>> If we take the first approach we need to properly test with a Schema
>> that has elements of the full specification that (de)-serialization
>> works correctly. Does anyone know if we have already a test schema
>> that covers the full ‘schema’ specification to reuse it if so?
>>
>> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <fo...@driesprong.frl> wrote:
>> >
>> > Correct me if I'm wrong here. But as far as I understood the way of
>> > serializing the schema is using Avro, as it is part of the file. To avoid
>> > confusion there should be one way of serializing.
>> >
>> > However, I'm not sure if this is worth the hassle of not simply
>> > implementing serializable. Also Flink there is a rather far from optimal
>> > implementation:
>> > https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
>> > This converts it to JSON and back while distributing the schema to the
>> > executors.
>> >
>> > Cheers, Fokko
>> >
>> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cu...@gmail.com>:
>> >
>> > > I can't think of a reason Schema should not implement Serializable.
>> > >
>> > > There's actually already an issue & patch for this:
>> > >
>> > > https://issues.apache.org/jira/browse/AVRO-1852
>> > >
>> > > Doug
>> > >
>> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com> wrote:
>> > >
>> > > > +dev@avro.apache.org
>> > > >
>> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
>> > > > >
>> > > > > Hello!
>> > > > >
>> > > > > I'm looking for any discussion or reference why the Schema object isn't
>> > > > serializable -- I'm pretty sure this must have already been discussed
>> > > (but
>> > > > the keywords +avro +serializable +schema have MANY results in all the
>> > > > searches I did: JIRA, stack overflow, mailing list, web)
>> > > > >
>> > > > > In particular, I was at a demo today where we were asked why Schemas
>> > > > needed to be passed as strings to run in distributed tasks.  I remember
>> > > > running into this problem years ago with MapReduce, and again in Spark,
>> > > and
>> > > > again in Beam...
>> > > > >
>> > > > > Is there any downside to making a Schema implement
>> > > > java.lang.Serializable?  The only thing I can think of is that the schema
>> > > > _should not_ be serialized with the data, and making it non-serializable
>> > > > loosely enforces this (at the cost of continually writing different
>> > > > flavours of "Avro holders" for when you really do want to serialize it).
>> > > > >
>> > > > > Willing to create a JIRA and work on the implementation, of course!
>> > > > >
>> > > > > All my best, Ryan
>> > > >
>> > >

Re: Should a Schema be serializable in Java?

Posted by Ryan Skraba <ry...@skraba.com>.
Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
looking for.

I agree that Java serialization shouldn't be used for anything
cross-platform, or (in my opinion) used for any *data* persistence at all.
Especially not for an Avro container file or sending binary data through a
messaging system...

But Java serialization is definitely useful and used for sending instances
of "distributed work" implemented in Java from node to node in a cluster.
I'm not too worried about existing connectors -- we can see that each
framework has "solved" the problem one at a time.  In addition to Flink,
there's
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
 and
https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
.

Specifically, I see the advantage for user-defined distributed functions
that happen to carry along an Avro Schema -- and I can personally say that
I've encountered this a lot in our code!

That being said, I think it's probably overkill to warn the user about the
perils of Java serialization (not being cross-language and requiring
consistent JDKs and libraries across JVMs).  If an error occurs for one of
those reasons, there's a larger problem for the dev to address, and it's
just as likely to occur for any Java library in the job if the environment
is bad.  Related, we've encountered similar issues with logical types
existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
the solution is "make sure you don't do that".  (Looking at you, guava and
jackson!)

The patch in question delegates serialization to the string form of the
schema, so it's basically doing what all of the above Avro "holders" are
doing -- I wouldn't object to having a sample schema available that fully
exercises what a schema can hold, but I also think that Schema.Parser (used
underneath) is currently pretty well tested and mature!

Do you think this could be a candidate for 1.9.1 as a minor improvement?  I
can't think of any reason that this wouldn't be backwards compatible.

Ryan

side note: I wrote java.lang.Serializable earlier, which probably didn't
help my search for prior discussion... :/

On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <ie...@gmail.com> wrote:

> This is a good idea even if it may have some issues that we should
> probably document and warn users about:
>
> 1. Java based serialization is really practical for JVM based systems,
> but we should probably add a warning or documentation because Java
> serialization is not deterministic between JVMs so this could be a
> source for issues (usually companies use the same version of the JVM
> so this is less critical, but this still can happen specially now with
> all the different versions of Java and OpenJDK based flavors).
>
> 2. This is not cross language compatible, the String based
> representation (or even an Avro based representation of Schema) can be
> used in every language.
>
> Even with these I think just for ease of use it is worth to make
> Schema Serializable. Is the plan to fully serialize it, or just to
> make it a String and serialize the String as done in the issue Doug
> mentioned?
> If we take the first approach we need to properly test with a Schema
> that has elements of the full specification that (de)-serialization
> works correctly. Does anyone know if we have already a test schema
> that covers the full ‘schema’ specification to reuse it if so?
>
> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
> >
> > Correct me if I'm wrong here. But as far as I understood the way of
> > serializing the schema is using Avro, as it is part of the file. To avoid
> > confusion there should be one way of serializing.
> >
> > However, I'm not sure if this is worth the hassle of not simply
> > implementing serializable. Also Flink there is a rather far from optimal
> > implementation:
> >
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> > This converts it to JSON and back while distributing the schema to the
> > executors.
> >
> > Cheers, Fokko
> >
> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cu...@gmail.com>:
> >
> > > I can't think of a reason Schema should not implement Serializable.
> > >
> > > There's actually already an issue & patch for this:
> > >
> > > https://issues.apache.org/jira/browse/AVRO-1852
> > >
> > > Doug
> > >
> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com>
> wrote:
> > >
> > > > +dev@avro.apache.org
> > > >
> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
> > > > >
> > > > > Hello!
> > > > >
> > > > > I'm looking for any discussion or reference why the Schema object
> isn't
> > > > serializable -- I'm pretty sure this must have already been discussed
> > > (but
> > > > the keywords +avro +serializable +schema have MANY results in all the
> > > > searches I did: JIRA, stack overflow, mailing list, web)
> > > > >
> > > > > In particular, I was at a demo today where we were asked why
> Schemas
> > > > needed to be passed as strings to run in distributed tasks.  I
> remember
> > > > running into this problem years ago with MapReduce, and again in
> Spark,
> > > and
> > > > again in Beam...
> > > > >
> > > > > Is there any downside to making a Schema implement
> > > > java.lang.Serializable?  The only thing I can think of is that the
> schema
> > > > _should not_ be serialized with the data, and making it
> non-serializable
> > > > loosely enforces this (at the cost of continually writing different
> > > > flavours of "Avro holders" for when you really do want to serialize
> it).
> > > > >
> > > > > Willing to create a JIRA and work on the implementation, of course!
> > > > >
> > > > > All my best, Ryan
> > > >
> > >
>

Re: Should a Schema be serializable in Java?

Posted by Ryan Skraba <ry...@skraba.com>.
Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
looking for.

I agree that Java serialization shouldn't be used for anything
cross-platform, or (in my opinion) used for any *data* persistence at all.
Especially not for an Avro container file or sending binary data through a
messaging system...

But Java serialization is definitely useful and used for sending instances
of "distributed work" implemented in Java from node to node in a cluster.
I'm not too worried about existing connectors -- we can see that each
framework has "solved" the problem one at a time.  In addition to Flink,
there's
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
 and
https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
.

Specifically, I see the advantage for user-defined distributed functions
that happen to carry along an Avro Schema -- and I can personally say that
I've encountered this a lot in our code!

That being said, I think it's probably overkill to warn the user about the
perils of Java serialization (not being cross-language and requiring
consistent JDKs and libraries across JVMs).  If an error occurs for one of
those reasons, there's a larger problem for the dev to address, and it's
just as likely to occur for any Java library in the job if the environment
is bad.  Related, we've encountered similar issues with logical types
existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
the solution is "make sure you don't do that".  (Looking at you, guava and
jackson!)

The patch in question delegates serialization to the string form of the
schema, so it's basically doing what all of the above Avro "holders" are
doing -- I wouldn't object to having a sample schema available that fully
exercises what a schema can hold, but I also think that Schema.Parser (used
underneath) is currently pretty well tested and mature!

Do you think this could be a candidate for 1.9.1 as a minor improvement?  I
can't think of any reason that this wouldn't be backwards compatible.

Ryan

side note: I wrote java.lang.Serializable earlier, which probably didn't
help my search for prior discussion... :/

On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <ie...@gmail.com> wrote:

> This is a good idea even if it may have some issues that we should
> probably document and warn users about:
>
> 1. Java based serialization is really practical for JVM based systems,
> but we should probably add a warning or documentation because Java
> serialization is not deterministic between JVMs so this could be a
> source for issues (usually companies use the same version of the JVM
> so this is less critical, but this still can happen specially now with
> all the different versions of Java and OpenJDK based flavors).
>
> 2. This is not cross language compatible, the String based
> representation (or even an Avro based representation of Schema) can be
> used in every language.
>
> Even with these I think just for ease of use it is worth to make
> Schema Serializable. Is the plan to fully serialize it, or just to
> make it a String and serialize the String as done in the issue Doug
> mentioned?
> If we take the first approach we need to properly test with a Schema
> that has elements of the full specification that (de)-serialization
> works correctly. Does anyone know if we have already a test schema
> that covers the full ‘schema’ specification to reuse it if so?
>
> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
> >
> > Correct me if I'm wrong here. But as far as I understood the way of
> > serializing the schema is using Avro, as it is part of the file. To avoid
> > confusion there should be one way of serializing.
> >
> > However, I'm not sure if this is worth the hassle of not simply
> > implementing serializable. Also Flink there is a rather far from optimal
> > implementation:
> >
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> > This converts it to JSON and back while distributing the schema to the
> > executors.
> >
> > Cheers, Fokko
> >
> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cu...@gmail.com>:
> >
> > > I can't think of a reason Schema should not implement Serializable.
> > >
> > > There's actually already an issue & patch for this:
> > >
> > > https://issues.apache.org/jira/browse/AVRO-1852
> > >
> > > Doug
> > >
> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com>
> wrote:
> > >
> > > > +dev@avro.apache.org
> > > >
> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
> > > > >
> > > > > Hello!
> > > > >
> > > > > I'm looking for any discussion or reference why the Schema object
> isn't
> > > > serializable -- I'm pretty sure this must have already been discussed
> > > (but
> > > > the keywords +avro +serializable +schema have MANY results in all the
> > > > searches I did: JIRA, stack overflow, mailing list, web)
> > > > >
> > > > > In particular, I was at a demo today where we were asked why
> Schemas
> > > > needed to be passed as strings to run in distributed tasks.  I
> remember
> > > > running into this problem years ago with MapReduce, and again in
> Spark,
> > > and
> > > > again in Beam...
> > > > >
> > > > > Is there any downside to making a Schema implement
> > > > java.lang.Serializable?  The only thing I can think of is that the
> schema
> > > > _should not_ be serialized with the data, and making it
> non-serializable
> > > > loosely enforces this (at the cost of continually writing different
> > > > flavours of "Avro holders" for when you really do want to serialize
> it).
> > > > >
> > > > > Willing to create a JIRA and work on the implementation, of course!
> > > > >
> > > > > All my best, Ryan
> > > >
> > >
>

Re: Should a Schema be serializable in Java?

Posted by Ismaël Mejía <ie...@gmail.com>.
This is a good idea even if it may have some issues that we should
probably document and warn users about:

1. Java based serialization is really practical for JVM based systems,
but we should probably add a warning or documentation because Java
serialization is not deterministic between JVMs so this could be a
source for issues (usually companies use the same version of the JVM
so this is less critical, but this still can happen specially now with
all the different versions of Java and OpenJDK based flavors).

2. This is not cross language compatible, the String based
representation (or even an Avro based representation of Schema) can be
used in every language.

Even with these I think just for ease of use it is worth to make
Schema Serializable. Is the plan to fully serialize it, or just to
make it a String and serialize the String as done in the issue Doug
mentioned?
If we take the first approach we need to properly test with a Schema
that has elements of the full specification that (de)-serialization
works correctly. Does anyone know if we have already a test schema
that covers the full ‘schema’ specification to reuse it if so?

On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <fo...@driesprong.frl> wrote:
>
> Correct me if I'm wrong here. But as far as I understood the way of
> serializing the schema is using Avro, as it is part of the file. To avoid
> confusion there should be one way of serializing.
>
> However, I'm not sure if this is worth the hassle of not simply
> implementing serializable. Also Flink there is a rather far from optimal
> implementation:
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> This converts it to JSON and back while distributing the schema to the
> executors.
>
> Cheers, Fokko
>
> Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cu...@gmail.com>:
>
> > I can't think of a reason Schema should not implement Serializable.
> >
> > There's actually already an issue & patch for this:
> >
> > https://issues.apache.org/jira/browse/AVRO-1852
> >
> > Doug
> >
> > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com> wrote:
> >
> > > +dev@avro.apache.org
> > >
> > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
> > > >
> > > > Hello!
> > > >
> > > > I'm looking for any discussion or reference why the Schema object isn't
> > > serializable -- I'm pretty sure this must have already been discussed
> > (but
> > > the keywords +avro +serializable +schema have MANY results in all the
> > > searches I did: JIRA, stack overflow, mailing list, web)
> > > >
> > > > In particular, I was at a demo today where we were asked why Schemas
> > > needed to be passed as strings to run in distributed tasks.  I remember
> > > running into this problem years ago with MapReduce, and again in Spark,
> > and
> > > again in Beam...
> > > >
> > > > Is there any downside to making a Schema implement
> > > java.lang.Serializable?  The only thing I can think of is that the schema
> > > _should not_ be serialized with the data, and making it non-serializable
> > > loosely enforces this (at the cost of continually writing different
> > > flavours of "Avro holders" for when you really do want to serialize it).
> > > >
> > > > Willing to create a JIRA and work on the implementation, of course!
> > > >
> > > > All my best, Ryan
> > >
> >

Re: Should a Schema be serializable in Java?

Posted by Ismaël Mejía <ie...@gmail.com>.
This is a good idea even if it may have some issues that we should
probably document and warn users about:

1. Java based serialization is really practical for JVM based systems,
but we should probably add a warning or documentation because Java
serialization is not deterministic between JVMs so this could be a
source for issues (usually companies use the same version of the JVM
so this is less critical, but this still can happen specially now with
all the different versions of Java and OpenJDK based flavors).

2. This is not cross language compatible, the String based
representation (or even an Avro based representation of Schema) can be
used in every language.

Even with these I think just for ease of use it is worth to make
Schema Serializable. Is the plan to fully serialize it, or just to
make it a String and serialize the String as done in the issue Doug
mentioned?
If we take the first approach we need to properly test with a Schema
that has elements of the full specification that (de)-serialization
works correctly. Does anyone know if we have already a test schema
that covers the full ‘schema’ specification to reuse it if so?

On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <fo...@driesprong.frl> wrote:
>
> Correct me if I'm wrong here. But as far as I understood the way of
> serializing the schema is using Avro, as it is part of the file. To avoid
> confusion there should be one way of serializing.
>
> However, I'm not sure if this is worth the hassle of not simply
> implementing serializable. Also Flink there is a rather far from optimal
> implementation:
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> This converts it to JSON and back while distributing the schema to the
> executors.
>
> Cheers, Fokko
>
> Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cu...@gmail.com>:
>
> > I can't think of a reason Schema should not implement Serializable.
> >
> > There's actually already an issue & patch for this:
> >
> > https://issues.apache.org/jira/browse/AVRO-1852
> >
> > Doug
> >
> > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com> wrote:
> >
> > > +dev@avro.apache.org
> > >
> > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
> > > >
> > > > Hello!
> > > >
> > > > I'm looking for any discussion or reference why the Schema object isn't
> > > serializable -- I'm pretty sure this must have already been discussed
> > (but
> > > the keywords +avro +serializable +schema have MANY results in all the
> > > searches I did: JIRA, stack overflow, mailing list, web)
> > > >
> > > > In particular, I was at a demo today where we were asked why Schemas
> > > needed to be passed as strings to run in distributed tasks.  I remember
> > > running into this problem years ago with MapReduce, and again in Spark,
> > and
> > > again in Beam...
> > > >
> > > > Is there any downside to making a Schema implement
> > > java.lang.Serializable?  The only thing I can think of is that the schema
> > > _should not_ be serialized with the data, and making it non-serializable
> > > loosely enforces this (at the cost of continually writing different
> > > flavours of "Avro holders" for when you really do want to serialize it).
> > > >
> > > > Willing to create a JIRA and work on the implementation, of course!
> > > >
> > > > All my best, Ryan
> > >
> >

Re: Should a Schema be serializable in Java?

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Correct me if I'm wrong here. But as far as I understood the way of
serializing the schema is using Avro, as it is part of the file. To avoid
confusion there should be one way of serializing.

However, I'm not sure if this is worth the hassle of not simply
implementing serializable. Also Flink there is a rather far from optimal
implementation:
https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
This converts it to JSON and back while distributing the schema to the
executors.

Cheers, Fokko

Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cu...@gmail.com>:

> I can't think of a reason Schema should not implement Serializable.
>
> There's actually already an issue & patch for this:
>
> https://issues.apache.org/jira/browse/AVRO-1852
>
> Doug
>
> On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
> > +dev@avro.apache.org
> >
> > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
> > >
> > > Hello!
> > >
> > > I'm looking for any discussion or reference why the Schema object isn't
> > serializable -- I'm pretty sure this must have already been discussed
> (but
> > the keywords +avro +serializable +schema have MANY results in all the
> > searches I did: JIRA, stack overflow, mailing list, web)
> > >
> > > In particular, I was at a demo today where we were asked why Schemas
> > needed to be passed as strings to run in distributed tasks.  I remember
> > running into this problem years ago with MapReduce, and again in Spark,
> and
> > again in Beam...
> > >
> > > Is there any downside to making a Schema implement
> > java.lang.Serializable?  The only thing I can think of is that the schema
> > _should not_ be serialized with the data, and making it non-serializable
> > loosely enforces this (at the cost of continually writing different
> > flavours of "Avro holders" for when you really do want to serialize it).
> > >
> > > Willing to create a JIRA and work on the implementation, of course!
> > >
> > > All my best, Ryan
> >
>

Re: Should a Schema be serializable in Java?

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Correct me if I'm wrong here. But as far as I understood the way of
serializing the schema is using Avro, as it is part of the file. To avoid
confusion there should be one way of serializing.

However, I'm not sure if this is worth the hassle of not simply
implementing serializable. Also Flink there is a rather far from optimal
implementation:
https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
This converts it to JSON and back while distributing the schema to the
executors.

Cheers, Fokko

Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cu...@gmail.com>:

> I can't think of a reason Schema should not implement Serializable.
>
> There's actually already an issue & patch for this:
>
> https://issues.apache.org/jira/browse/AVRO-1852
>
> Doug
>
> On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
> > +dev@avro.apache.org
> >
> > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
> > >
> > > Hello!
> > >
> > > I'm looking for any discussion or reference why the Schema object isn't
> > serializable -- I'm pretty sure this must have already been discussed
> (but
> > the keywords +avro +serializable +schema have MANY results in all the
> > searches I did: JIRA, stack overflow, mailing list, web)
> > >
> > > In particular, I was at a demo today where we were asked why Schemas
> > needed to be passed as strings to run in distributed tasks.  I remember
> > running into this problem years ago with MapReduce, and again in Spark,
> and
> > again in Beam...
> > >
> > > Is there any downside to making a Schema implement
> > java.lang.Serializable?  The only thing I can think of is that the schema
> > _should not_ be serialized with the data, and making it non-serializable
> > loosely enforces this (at the cost of continually writing different
> > flavours of "Avro holders" for when you really do want to serialize it).
> > >
> > > Willing to create a JIRA and work on the implementation, of course!
> > >
> > > All my best, Ryan
> >
>

Re: Should a Schema be serializable in Java?

Posted by Doug Cutting <cu...@gmail.com>.
I can't think of a reason Schema should not implement Serializable.

There's actually already an issue & patch for this:

https://issues.apache.org/jira/browse/AVRO-1852

Doug

On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com> wrote:

> +dev@avro.apache.org
>
> On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
> >
> > Hello!
> >
> > I'm looking for any discussion or reference why the Schema object isn't
> serializable -- I'm pretty sure this must have already been discussed (but
> the keywords +avro +serializable +schema have MANY results in all the
> searches I did: JIRA, stack overflow, mailing list, web)
> >
> > In particular, I was at a demo today where we were asked why Schemas
> needed to be passed as strings to run in distributed tasks.  I remember
> running into this problem years ago with MapReduce, and again in Spark, and
> again in Beam...
> >
> > Is there any downside to making a Schema implement
> java.lang.Serializable?  The only thing I can think of is that the schema
> _should not_ be serialized with the data, and making it non-serializable
> loosely enforces this (at the cost of continually writing different
> flavours of "Avro holders" for when you really do want to serialize it).
> >
> > Willing to create a JIRA and work on the implementation, of course!
> >
> > All my best, Ryan
>

Re: Should a Schema be serializable in Java?

Posted by Doug Cutting <cu...@gmail.com>.
I can't think of a reason Schema should not implement Serializable.

There's actually already an issue & patch for this:

https://issues.apache.org/jira/browse/AVRO-1852

Doug

On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ie...@gmail.com> wrote:

> +dev@avro.apache.org
>
> On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
> >
> > Hello!
> >
> > I'm looking for any discussion or reference why the Schema object isn't
> serializable -- I'm pretty sure this must have already been discussed (but
> the keywords +avro +serializable +schema have MANY results in all the
> searches I did: JIRA, stack overflow, mailing list, web)
> >
> > In particular, I was at a demo today where we were asked why Schemas
> needed to be passed as strings to run in distributed tasks.  I remember
> running into this problem years ago with MapReduce, and again in Spark, and
> again in Beam...
> >
> > Is there any downside to making a Schema implement
> java.lang.Serializable?  The only thing I can think of is that the schema
> _should not_ be serialized with the data, and making it non-serializable
> loosely enforces this (at the cost of continually writing different
> flavours of "Avro holders" for when you really do want to serialize it).
> >
> > Willing to create a JIRA and work on the implementation, of course!
> >
> > All my best, Ryan
>

Re: Should a Schema be serializable in Java?

Posted by Ismaël Mejía <ie...@gmail.com>.
+dev@avro.apache.org

On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
>
> Hello!
>
> I'm looking for any discussion or reference why the Schema object isn't serializable -- I'm pretty sure this must have already been discussed (but the keywords +avro +serializable +schema have MANY results in all the searches I did: JIRA, stack overflow, mailing list, web)
>
> In particular, I was at a demo today where we were asked why Schemas needed to be passed as strings to run in distributed tasks.  I remember running into this problem years ago with MapReduce, and again in Spark, and again in Beam...
>
> Is there any downside to making a Schema implement java.lang.Serializable?  The only thing I can think of is that the schema _should not_ be serialized with the data, and making it non-serializable loosely enforces this (at the cost of continually writing different flavours of "Avro holders" for when you really do want to serialize it).
>
> Willing to create a JIRA and work on the implementation, of course!
>
> All my best, Ryan

Re: Should a Schema be serializable in Java?

Posted by Ismaël Mejía <ie...@gmail.com>.
+dev@avro.apache.org

On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <ry...@skraba.com> wrote:
>
> Hello!
>
> I'm looking for any discussion or reference why the Schema object isn't serializable -- I'm pretty sure this must have already been discussed (but the keywords +avro +serializable +schema have MANY results in all the searches I did: JIRA, stack overflow, mailing list, web)
>
> In particular, I was at a demo today where we were asked why Schemas needed to be passed as strings to run in distributed tasks.  I remember running into this problem years ago with MapReduce, and again in Spark, and again in Beam...
>
> Is there any downside to making a Schema implement java.lang.Serializable?  The only thing I can think of is that the schema _should not_ be serialized with the data, and making it non-serializable loosely enforces this (at the cost of continually writing different flavours of "Avro holders" for when you really do want to serialize it).
>
> Willing to create a JIRA and work on the implementation, of course!
>
> All my best, Ryan