You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Stu Hood (JIRA)" <ji...@apache.org> on 2010/10/10 22:49:30 UTC

[jira] Created: (AVRO-679) Improved encodings for arrays

Improved encodings for arrays
-----------------------------

                 Key: AVRO-679
                 URL: https://issues.apache.org/jira/browse/AVRO-679
             Project: Avro
          Issue Type: New Feature
          Components: spec
            Reporter: Stu Hood
            Priority: Minor


There are better ways to encode arrays of varints [1] which are faster to decode, and more space efficient than encoding varints independently.

Extending the idea to other types of variable length data like 'bytes' and 'string', you could encode the entries for an array block as an array of lengths, followed by contiguous byte/utf8 data.

[1] group varint encoding: slides 57-63 of http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-679) Improved encodings for arrays

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920544#action_12920544 ] 

Doug Cutting commented on AVRO-679:
-----------------------------------

> encoding a block of <int,string,long> as a record array<int>, array<long>, array<string> might give a 3-6x increase in decode speed

That's what I meant by a schema transformation.  You're transforming [<int,string,long>] to <[int],[string],[long]>.  This might be done automatically by a layer that implements DatumReader and DatumWriter.  The actual schema of the datafile would be <[int],[string],[long]> but application code would treat it as  [<int,string,long>].

> Improved encodings for arrays
> -----------------------------
>
>                 Key: AVRO-679
>                 URL: https://issues.apache.org/jira/browse/AVRO-679
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Stu Hood
>            Priority: Minor
>
> There are better ways to encode arrays of varints [1] which are faster to decode, and more space efficient than encoding varints independently.
> Extending the idea to other types of variable length data like 'bytes' and 'string', you could encode the entries for an array block as an array of lengths, followed by contiguous byte/utf8 data.
> [1] group varint encoding: slides 57-63 of http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-679) Improved encodings for arrays

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919875#action_12919875 ] 

Doug Cutting commented on AVRO-679:
-----------------------------------

Adding a new fundamental type or encoding is hard to do compatibly.  Rather I wonder whether this could be layered, as a library?  One might automatically rewrite schemas and have a layer that transforms datastructures accordingly?  This could perhaps be done without copying data, as wrapping DatumWriter and DatumReader implementations.

Also related is columnar compression in a data file.  In this case, a data file is a sequence of records whose schema might be re-written.  For example, a file containing <string,long> pairs might be represented as a data file containing <int,string,long> records where the int contains the number of characters shared with the previous string and the long the difference from the previous long.  Schema properties could indicate which fields should be represented as differences.  If random-access is required, e.g., for mapreduce splitting, then the container (DataFileReader & DataFileWriter in Java) might have per-block callbacks.

> Improved encodings for arrays
> -----------------------------
>
>                 Key: AVRO-679
>                 URL: https://issues.apache.org/jira/browse/AVRO-679
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Stu Hood
>            Priority: Minor
>
> There are better ways to encode arrays of varints [1] which are faster to decode, and more space efficient than encoding varints independently.
> Extending the idea to other types of variable length data like 'bytes' and 'string', you could encode the entries for an array block as an array of lengths, followed by contiguous byte/utf8 data.
> [1] group varint encoding: slides 57-63 of http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-679) Improved encodings for arrays

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919886#action_12919886 ] 

Stu Hood commented on AVRO-679:
-------------------------------

> Adding a new fundamental type or encoding is hard to do compatibly.
Agreed: but this particular optimization is only possible with Avro's support, and opens up a lot of other interesting possibilities. For instance, in your prefix encoding example, encoding a block of <int,string,long> as a record {{array<int>, array<long>, array<string>}} might give a 3-6x increase in decode speed (based on the numbers suggested in the link).

It is worth considering how the specification can evolve backwards compatibly as well: perhaps the next revision of the specification could require a magical 'spec revision' number to be present in all schemas, and would assume that a schema that is missing the rev number is a legacy format? This would allow readers and writers to communicate across spec revision boundaries by disabling optimizations/encodings that the other side does not support.

> One might automatically rewrite schemas and have a layer that transforms datastructures accordingly?
Yea: there is probably room for a schema translation layer above Avro for things like RLE / prefix encoding, but I think it is a separate area of focus.

> Improved encodings for arrays
> -----------------------------
>
>                 Key: AVRO-679
>                 URL: https://issues.apache.org/jira/browse/AVRO-679
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Stu Hood
>            Priority: Minor
>
> There are better ways to encode arrays of varints [1] which are faster to decode, and more space efficient than encoding varints independently.
> Extending the idea to other types of variable length data like 'bytes' and 'string', you could encode the entries for an array block as an array of lengths, followed by contiguous byte/utf8 data.
> [1] group varint encoding: slides 57-63 of http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-679) Improved encodings for arrays

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920620#action_12920620 ] 

Stu Hood commented on AVRO-679:
-------------------------------

> That's what I meant by a schema transformation.
As far as I know, there is no way to transform a schema that will allow you to dodge Avro's varint encoding and do group varint encoding instead: that was where I was suggesting you would get the encoding/decoding speed benefits by using multiple arrays.

> Improved encodings for arrays
> -----------------------------
>
>                 Key: AVRO-679
>                 URL: https://issues.apache.org/jira/browse/AVRO-679
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Stu Hood
>            Priority: Minor
>
> There are better ways to encode arrays of varints [1] which are faster to decode, and more space efficient than encoding varints independently.
> Extending the idea to other types of variable length data like 'bytes' and 'string', you could encode the entries for an array block as an array of lengths, followed by contiguous byte/utf8 data.
> [1] group varint encoding: slides 57-63 of http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.