You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by David Jeske <da...@gmail.com> on 2010/12/02 16:30:00 UTC

questions about sort-orders

I like the inclusion of sort-order in avro, to enable different machines to
sort and exchange. I have a few suggestions to clarify the documentation.
Please correct any assumptions I've made that are incorrect...

It seems that sorts are not stable across schema versions. I think I
understand why this makes sense inside the schema philosophy, yet I think
the documentation could clear up a couple of the subtlties a bit more. For
example, it says "*data items may only be compared if they have identical
schemas*". If I supply a source schema which avro can map into my target
schema, I would think it could load and compare things in my target schema.
Is this correct? It might be clarified.

Also, the comment "*this permits data written by one system to be
efficiently sorted by another system*", could callout that data items sorted
in one schema may not be in the proper order if during read they are mapped
to a new version of the schema. In fact, it might be useful for Avro to be
able to tell me when it does the source->target schema mapping, whether both
schemas sorted in the same order (if it doesn't already).

Lastly, it says "*Note also that Avro binary-encoded data can be efficiently
ordered without deserializing it to objects.*" What does this mean exactly?
 This might be mis-interpreted as saying one can lexicographically sort the
binary-encoding without asking Avro to deserialize it, and it'll be in a
proper order. However, this seems obviously not true from the number
formats. Perhaps it would be clearer to say "Avro can efficiently make
sort-comparisons on binary-encoded data without allocating deserialization
objects."

Did I properly understand those sort-related subtlties?

Re: questions about sort-orders

Posted by Philip Zeyliger <ph...@cloudera.com>.

> Lastly, it says "Note also that Avro binary-encoded data can be efficiently
> ordered without deserializing it to objects." What does this mean exactly?

This is hinting at an implementation detail, though one of historical interest.

There exists code, in Java, to compare two Avro objects based on their
byte[] representations.  This code happens to not create any objects;
rather, it deals with bytes directly, and thus it's "efficient".

This is of historical interest because of Avro's intended use in
Hadoop MapReduce.  Hadoop's sorting instantiates the key objects to do
the sorting, but there's a way to specify a "binary comparator" which
tells Hadoop to instantiate a class that just has a compare(byte[],
byte[]) method instead of a compare(Object, Object) method.  So, the
spec is suggesting that this is possible and that there's an
implementation of it.

You are right that plain ol' byte comparison does not sort Avro
objects correctly.  (This is kind of a bummer, in my opinion.  It
makes Avro objects not something that's useful for HBase keys.)

-- Philip

Re: questions about sort-orders

Posted by Scott Carey <sc...@richrelevance.com>.

On Dec 2, 2010, at 8:54 AM, David Jeske wrote:

One other issue I forgot to ask about...  it says:

int, long, float and double data is ordered by ascending numeric value.

This seems really convenient, but how do you plan to achieve this? You might add a comment about your method.

For example, sorting a long (64bit int) and double (64 bit float) by converting the long to double will produce invalid sort results for longs which overflow the double bits and truncate. It seems pulling off this feat would require detecting when the conversion to double will overflow a double and then convert to an unlimited precision format for the comparison. Is this what you are doing?


Each is individually sorted, doubles aren't compared to longs.   A field is either a double, or a long.  It is not both unless it is a Union, in which case the sort order is defined by the order of the union branches, then the individual types.

Re: questions about sort-orders

Posted by David Jeske <da...@gmail.com>.

One other issue I forgot to ask about...  it says:

int, long, float and double data is ordered by ascending numeric value.


This seems really convenient, but how do you plan to achieve this? You might
add a comment about your method.

For example, sorting a long (64bit int) and double (64 bit float) by
converting the long to double will produce invalid sort results for longs
which overflow the double bits and truncate. It seems pulling off this feat
would require detecting when the conversion to double will overflow a double
and then convert to an unlimited precision format for the comparison. Is
this what you are doing?

Re: questions about sort-orders

Posted by Scott Carey <sc...@richrelevance.com>.

On Dec 2, 2010, at 7:30 AM, David Jeske wrote:

I like the inclusion of sort-order in avro, to enable different machines to sort and exchange. I have a few suggestions to clarify the documentation. Please correct any assumptions I've made that are incorrect...

It seems that sorts are not stable across schema versions. I think I understand why this makes sense inside the schema philosophy, yet I think the documentation could clear up a couple of the subtlties a bit more. For example, it says "data items may only be compared if they have identical schemas". If I supply a source schema which avro can map into my target schema, I would think it could load and compare things in my target schema. Is this correct? It might be clarified.

There is some need for clarification. As I understand it, things are sorted in the order of the reader's schema, but I may be wrong. If the schema changes, the sort order can change. There is no getting around that. Usually as a schema evolves some things that were formerly different become equal, and some things that were equal become different. Typically, the new schema's definition of order and equivalence is all that matters, so a sort will be consistent, but unstable, with respect to the new schema. But some schema changes will break that (such as changing a field from ascending to descending order, or changing the order that fields are compared).

Also, the comment "this permits data written by one system to be efficiently sorted by another system", could callout that data items sorted in one schema may not be in the proper order if during read they are mapped to a new version of the schema. In fact, it might be useful for Avro to be able to tell me when it does the source->target schema mapping, whether both schemas sorted in the same order (if it doesn't already).

It would be useful to provide whether the reader/writer schema resolution altered the sort order or not. I don't think we do this. The answer to that question is not as simple as a yes/no answer however. The sort order when migrated from an old schema to a new one may change completely, or it may remain consistent but be unstable from the POV of the new schema, or be both consistent and stable with respect to sorts using the prior schema.

Lastly, it says "Note also that Avro binary-encoded data can be efficiently ordered without deserializing it to objects." What does this mean exactly? This might be mis-interpreted as saying one can lexicographically sort the binary-encoding without asking Avro to deserialize it, and it'll be in a proper order. However, this seems obviously not true from the number formats. Perhaps it would be clearer to say "Avro can efficiently make sort-comparisons on binary-encoded data without allocating deserialization objects."

Did I properly understand those sort-related subtlties?

Yes, perhaps we should say "Avro can efficiently make sort-comparisons on binary data without full deserialization" or something similar.

Re: questions about sort-orders

Posted by Patrick Linehan <pl...@plinehan.com>.

>
> Lastly, it says "Note also that Avro binary-encoded data can be efficiently
> ordered without deserializing it to objects." What does this mean exactly?
>  This might be mis-interpreted as saying one can lexicographically sort the
> binary-encoding without asking Avro to deserialize it, and it'll be in a
> proper order. However, this seems obviously not true from the number
> formats. Perhaps it would be clearer to say "Avro can efficiently make
> sort-comparisons on binary-encoded data without allocating deserialization
> objects."


i had the exact same question when first coming to avro, so perhaps it does
deserve clarification.

On Thu, Dec 2, 2010 at 7:30 AM, David Jeske <da...@gmail.com> wrote:

> I like the inclusion of sort-order in avro, to enable different machines to
> sort and exchange. I have a few suggestions to clarify the documentation.
> Please correct any assumptions I've made that are incorrect...
>
> It seems that sorts are not stable across schema versions. I think I
> understand why this makes sense inside the schema philosophy, yet I think
> the documentation could clear up a couple of the subtlties a bit more. For
> example, it says "*data items may only be compared if they have identical
> schemas*". If I supply a source schema which avro can map into my target
> schema, I would think it could load and compare things in my target schema.
> Is this correct? It might be clarified.
>
> Also, the comment "*this permits data written by one system to be
> efficiently sorted by another system*", could callout that data items
> sorted in one schema may not be in the proper order if during read they are
> mapped to a new version of the schema. In fact, it might be useful for Avro
> to be able to tell me when it does the source->target schema mapping,
> whether both schemas sorted in the same order (if it doesn't already).
>
> Lastly, it says "*Note also that Avro binary-encoded data can be
> efficiently ordered without deserializing it to objects.*" What does this
> mean exactly?  This might be mis-interpreted as saying one can
> lexicographically sort the binary-encoding without asking Avro to
> deserialize it, and it'll be in a proper order. However, this seems
> obviously not true from the number formats. Perhaps it would be clearer to
> say "Avro can efficiently make sort-comparisons on binary-encoded data
> without allocating deserialization objects."
>
> Did I properly understand those sort-related subtlties?
>
>
>