You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Austin Cawley-Edwards <au...@gmail.com> on 2019/11/27 23:05:33 UTC

Non-Deterministic avsc Directory Compilation

Hi,

We're trying to use the `compile {src dir} {output dir}` command in
`avro-tools` and finding that there are some non-deterministic
behaviors between systems, depending on how the OS sorts files.

Example:
schemas/Component.avsc
  - defines Component record type in the namespace `com.test`

schemas/Parent.avsc
  - defines a Parent record,  in the same `com.test` namespace, with a
field of type `com.test.Component`


With the command, `java -jar avro-tools-1.9.1.jar compile schemas/
out-dir/`, some systems compile the directory in the order Component,
Parent while others compile in the order Parent, Component. The latter
fails as Component has not been defined when it is referenced by
Parent.

We have also tried using the IDL and importing the dependency types,
and then converting them to avsc, and finally compiling the entire
directory, but that fails as the generated avsc files embed/ duplicate
the "Component" types each time it is used.


Is there a way to deterministically compile a directory? Or compile
directly from IDL to java?


OS:
Linux 857aaf92e059 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12
10:36:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Avro:
version 1.9.1



Thank you!
Austin Cawley-Edwards

Re: Non-Deterministic avsc Directory Compilation

Posted by Austin Cawley-Edwards <au...@gmail.com>.
Thank you all for the great info! I've opened up a ticket here:
https://issues.apache.org/jira/browse/AVRO-2644

In the meantime, I've passed the files in explicitally which works well.

Thanks again,
Austin

On Thu, Nov 28, 2019 at 10:43 AM Lee Hambley <le...@gmail.com> wrote:

> `info sort` (which is mostly talking about the `sort` POSIX command)
> claims that LC_COLLATE is what affects sort orders on POSIX-like systems,
> but I know that LANG and LC_ALL somehow override those other variables.
>
> "C" is, as Michael says the "safest" and/or most predictable one, it sorts
> by the naive reading of the underlying bytes, no lexical sorting of numbers
> and lowercase numbers sort earlier than their uppercase counterparts
> (because they are smaller binary numbers)
>
> Hope this helps
>
> Lee Hambley
> http://lee.hambley.name/
> +49 (0) 170 298 5667
>
>
> On Thu, 28 Nov 2019 at 15:56, Michael A. Smith <mi...@smith-li.com>
> wrote:
>
>> On unixish systems it probably depends on the locale, as in LANG and
>> LC_COLLATE. In my experience, the least surprising behavior comes with
>> LANG=C, except when you're dealing with file names containing a lot of
>> non-ascii text.
>>
>> On Thu, Nov 28, 2019 at 04:29 Ryan Skraba <ry...@skraba.com> wrote:
>>
>>> Effectively, the schemas are added in the order that the file system
>>> lists files:
>>> https://github.com/apache/avro/blob/f310ac8db5ab962a49d448f41b7b953488cdb033/lang/java/tools/src/main/java/org/apache/avro/tool/SpecificCompilerTool.java#L149
>>>
>>> As you observed, this depends on the operating system and/or
>>> filesystem... I've experienced this in the past (with an unrelated
>>> tool that generated a classpath from a list of JARS, and seeing an
>>> unreliable order on Windows vs. linux).
>>>
>>> Just reading the code, it should be deterministic if you explicitly
>>> list the avsc files (or at least the "problem" file)  with the
>>> required order:
>>>
>>> java -jar avro-tools-1.9.1.jar compile schemas/Component.avsc
>>> /schemas/Parent.avsc out-dir/
>>>
>>> or
>>>
>>> java -jar avro-tools-1.9.1.jar compile schemas/Component.avsc schemas/
>>> out-dir/
>>>
>>> Would it be possible to give this workaround a try?
>>>
>>> I took a quick look at the avro-maven-plugin; it doesn't use
>>> listFiles() directly to discover files, but uses FileSetManager from
>>> the maven project.  I'm hoping they've taken this into account!
>>>
>>> Thanks for the well-described, well-defined email!  It would make an
>>> excellent bug report :D  https://issues.apache.org/jira/browse/AVRO
>>>
>>> Ryan
>>>
>>>
>>> On Thu, Nov 28, 2019 at 12:05 AM Austin Cawley-Edwards
>>> <au...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> >
>>> > We're trying to use the `compile {src dir} {output dir}` command in
>>> > `avro-tools` and finding that there are some non-deterministic
>>> > behaviors between systems, depending on how the OS sorts files.
>>> >
>>> > Example:
>>> > schemas/Component.avsc
>>> >   - defines Component record type in the namespace `com.test`
>>> >
>>> > schemas/Parent.avsc
>>> >   - defines a Parent record,  in the same `com.test` namespace, with a
>>> > field of type `com.test.Component`
>>> >
>>> >
>>> > With the command, `java -jar avro-tools-1.9.1.jar compile schemas/
>>> > out-dir/`, some systems compile the directory in the order Component,
>>> > Parent while others compile in the order Parent, Component. The latter
>>> > fails as Component has not been defined when it is referenced by
>>> > Parent.
>>> >
>>> > We have also tried using the IDL and importing the dependency types,
>>> > and then converting them to avsc, and finally compiling the entire
>>> > directory, but that fails as the generated avsc files embed/ duplicate
>>> > the "Component" types each time it is used.
>>> >
>>> >
>>> > Is there a way to deterministically compile a directory? Or compile
>>> > directly from IDL to java?
>>> >
>>> >
>>> > OS:
>>> > Linux 857aaf92e059 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12
>>> > 10:36:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>>> >
>>> > Avro:
>>> > version 1.9.1
>>> >
>>> >
>>> >
>>> > Thank you!
>>> > Austin Cawley-Edwards
>>>
>>

Re: Non-Deterministic avsc Directory Compilation

Posted by Lee Hambley <le...@gmail.com>.
`info sort` (which is mostly talking about the `sort` POSIX command) claims
that LC_COLLATE is what affects sort orders on POSIX-like systems, but I
know that LANG and LC_ALL somehow override those other variables.

"C" is, as Michael says the "safest" and/or most predictable one, it sorts
by the naive reading of the underlying bytes, no lexical sorting of numbers
and lowercase numbers sort earlier than their uppercase counterparts
(because they are smaller binary numbers)

Hope this helps

Lee Hambley
http://lee.hambley.name/
+49 (0) 170 298 5667


On Thu, 28 Nov 2019 at 15:56, Michael A. Smith <mi...@smith-li.com> wrote:

> On unixish systems it probably depends on the locale, as in LANG and
> LC_COLLATE. In my experience, the least surprising behavior comes with
> LANG=C, except when you're dealing with file names containing a lot of
> non-ascii text.
>
> On Thu, Nov 28, 2019 at 04:29 Ryan Skraba <ry...@skraba.com> wrote:
>
>> Effectively, the schemas are added in the order that the file system
>> lists files:
>> https://github.com/apache/avro/blob/f310ac8db5ab962a49d448f41b7b953488cdb033/lang/java/tools/src/main/java/org/apache/avro/tool/SpecificCompilerTool.java#L149
>>
>> As you observed, this depends on the operating system and/or
>> filesystem... I've experienced this in the past (with an unrelated
>> tool that generated a classpath from a list of JARS, and seeing an
>> unreliable order on Windows vs. linux).
>>
>> Just reading the code, it should be deterministic if you explicitly
>> list the avsc files (or at least the "problem" file)  with the
>> required order:
>>
>> java -jar avro-tools-1.9.1.jar compile schemas/Component.avsc
>> /schemas/Parent.avsc out-dir/
>>
>> or
>>
>> java -jar avro-tools-1.9.1.jar compile schemas/Component.avsc schemas/
>> out-dir/
>>
>> Would it be possible to give this workaround a try?
>>
>> I took a quick look at the avro-maven-plugin; it doesn't use
>> listFiles() directly to discover files, but uses FileSetManager from
>> the maven project.  I'm hoping they've taken this into account!
>>
>> Thanks for the well-described, well-defined email!  It would make an
>> excellent bug report :D  https://issues.apache.org/jira/browse/AVRO
>>
>> Ryan
>>
>>
>> On Thu, Nov 28, 2019 at 12:05 AM Austin Cawley-Edwards
>> <au...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > We're trying to use the `compile {src dir} {output dir}` command in
>> > `avro-tools` and finding that there are some non-deterministic
>> > behaviors between systems, depending on how the OS sorts files.
>> >
>> > Example:
>> > schemas/Component.avsc
>> >   - defines Component record type in the namespace `com.test`
>> >
>> > schemas/Parent.avsc
>> >   - defines a Parent record,  in the same `com.test` namespace, with a
>> > field of type `com.test.Component`
>> >
>> >
>> > With the command, `java -jar avro-tools-1.9.1.jar compile schemas/
>> > out-dir/`, some systems compile the directory in the order Component,
>> > Parent while others compile in the order Parent, Component. The latter
>> > fails as Component has not been defined when it is referenced by
>> > Parent.
>> >
>> > We have also tried using the IDL and importing the dependency types,
>> > and then converting them to avsc, and finally compiling the entire
>> > directory, but that fails as the generated avsc files embed/ duplicate
>> > the "Component" types each time it is used.
>> >
>> >
>> > Is there a way to deterministically compile a directory? Or compile
>> > directly from IDL to java?
>> >
>> >
>> > OS:
>> > Linux 857aaf92e059 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12
>> > 10:36:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>> >
>> > Avro:
>> > version 1.9.1
>> >
>> >
>> >
>> > Thank you!
>> > Austin Cawley-Edwards
>>
>

Re: Non-Deterministic avsc Directory Compilation

Posted by "Michael A. Smith" <mi...@smith-li.com>.
On unixish systems it probably depends on the locale, as in LANG and
LC_COLLATE. In my experience, the least surprising behavior comes with
LANG=C, except when you're dealing with file names containing a lot of
non-ascii text.

On Thu, Nov 28, 2019 at 04:29 Ryan Skraba <ry...@skraba.com> wrote:

> Effectively, the schemas are added in the order that the file system
> lists files:
> https://github.com/apache/avro/blob/f310ac8db5ab962a49d448f41b7b953488cdb033/lang/java/tools/src/main/java/org/apache/avro/tool/SpecificCompilerTool.java#L149
>
> As you observed, this depends on the operating system and/or
> filesystem... I've experienced this in the past (with an unrelated
> tool that generated a classpath from a list of JARS, and seeing an
> unreliable order on Windows vs. linux).
>
> Just reading the code, it should be deterministic if you explicitly
> list the avsc files (or at least the "problem" file)  with the
> required order:
>
> java -jar avro-tools-1.9.1.jar compile schemas/Component.avsc
> /schemas/Parent.avsc out-dir/
>
> or
>
> java -jar avro-tools-1.9.1.jar compile schemas/Component.avsc schemas/
> out-dir/
>
> Would it be possible to give this workaround a try?
>
> I took a quick look at the avro-maven-plugin; it doesn't use
> listFiles() directly to discover files, but uses FileSetManager from
> the maven project.  I'm hoping they've taken this into account!
>
> Thanks for the well-described, well-defined email!  It would make an
> excellent bug report :D  https://issues.apache.org/jira/browse/AVRO
>
> Ryan
>
>
> On Thu, Nov 28, 2019 at 12:05 AM Austin Cawley-Edwards
> <au...@gmail.com> wrote:
> >
> > Hi,
> >
> > We're trying to use the `compile {src dir} {output dir}` command in
> > `avro-tools` and finding that there are some non-deterministic
> > behaviors between systems, depending on how the OS sorts files.
> >
> > Example:
> > schemas/Component.avsc
> >   - defines Component record type in the namespace `com.test`
> >
> > schemas/Parent.avsc
> >   - defines a Parent record,  in the same `com.test` namespace, with a
> > field of type `com.test.Component`
> >
> >
> > With the command, `java -jar avro-tools-1.9.1.jar compile schemas/
> > out-dir/`, some systems compile the directory in the order Component,
> > Parent while others compile in the order Parent, Component. The latter
> > fails as Component has not been defined when it is referenced by
> > Parent.
> >
> > We have also tried using the IDL and importing the dependency types,
> > and then converting them to avsc, and finally compiling the entire
> > directory, but that fails as the generated avsc files embed/ duplicate
> > the "Component" types each time it is used.
> >
> >
> > Is there a way to deterministically compile a directory? Or compile
> > directly from IDL to java?
> >
> >
> > OS:
> > Linux 857aaf92e059 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12
> > 10:36:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
> >
> > Avro:
> > version 1.9.1
> >
> >
> >
> > Thank you!
> > Austin Cawley-Edwards
>

Re: Non-Deterministic avsc Directory Compilation

Posted by Ryan Skraba <ry...@skraba.com>.
Effectively, the schemas are added in the order that the file system
lists files: https://github.com/apache/avro/blob/f310ac8db5ab962a49d448f41b7b953488cdb033/lang/java/tools/src/main/java/org/apache/avro/tool/SpecificCompilerTool.java#L149

As you observed, this depends on the operating system and/or
filesystem... I've experienced this in the past (with an unrelated
tool that generated a classpath from a list of JARS, and seeing an
unreliable order on Windows vs. linux).

Just reading the code, it should be deterministic if you explicitly
list the avsc files (or at least the "problem" file)  with the
required order:

java -jar avro-tools-1.9.1.jar compile schemas/Component.avsc
/schemas/Parent.avsc out-dir/

or

java -jar avro-tools-1.9.1.jar compile schemas/Component.avsc schemas/ out-dir/

Would it be possible to give this workaround a try?

I took a quick look at the avro-maven-plugin; it doesn't use
listFiles() directly to discover files, but uses FileSetManager from
the maven project.  I'm hoping they've taken this into account!

Thanks for the well-described, well-defined email!  It would make an
excellent bug report :D  https://issues.apache.org/jira/browse/AVRO

Ryan


On Thu, Nov 28, 2019 at 12:05 AM Austin Cawley-Edwards
<au...@gmail.com> wrote:
>
> Hi,
>
> We're trying to use the `compile {src dir} {output dir}` command in
> `avro-tools` and finding that there are some non-deterministic
> behaviors between systems, depending on how the OS sorts files.
>
> Example:
> schemas/Component.avsc
>   - defines Component record type in the namespace `com.test`
>
> schemas/Parent.avsc
>   - defines a Parent record,  in the same `com.test` namespace, with a
> field of type `com.test.Component`
>
>
> With the command, `java -jar avro-tools-1.9.1.jar compile schemas/
> out-dir/`, some systems compile the directory in the order Component,
> Parent while others compile in the order Parent, Component. The latter
> fails as Component has not been defined when it is referenced by
> Parent.
>
> We have also tried using the IDL and importing the dependency types,
> and then converting them to avsc, and finally compiling the entire
> directory, but that fails as the generated avsc files embed/ duplicate
> the "Component" types each time it is used.
>
>
> Is there a way to deterministically compile a directory? Or compile
> directly from IDL to java?
>
>
> OS:
> Linux 857aaf92e059 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12
> 10:36:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
> Avro:
> version 1.9.1
>
>
>
> Thank you!
> Austin Cawley-Edwards