You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by Scott Carey <sc...@richrelevance.com> on 2010/03/25 18:30:23 UTC

Avro files and Pig/Hive

I have been doing some research into figuring out how to read (and later write) Avro container files in Pig and Hive.

This has brought up some interesting challenges. Below are some of my thoughts on the situation so far. I'm sure some Avro JIRA tickets will result eventually.

* PIG
>From my preliminary work, mapping Pig to Avro should be relatively easy since the main data types map to each other fairly cleanly. Both have Maps and Arrays/Bags, for example, and the maps require string keys.
Making an arbitrary reader/writer will be a bit more of a challenge, but the API in 0.7 should be better (http://issues.apache.org/jira/browse/PIG-966 http://wiki.apache.org/pig/LoadStoreRedesignProposal).
I wish I had time to make sure their new proposal was sufficient to handle Avro files as cleanly and efficiently as possible before it gets into an official release.

Pig may require a lot of 'hidden' unions with null in the schemas if it is used to write generically. The use case best matches the Generic API now, but something else down the road may be better.

* HIVE
The Hive type system can almost map to Avro completely. They support arrays, maps, and structs. Their maps however can have any intrinsic type as a key (int, long, string, float, double). Other than that, arrays are arrays, and structs are records. Avro files should be better performing and more compact than sequence files.

** Unions are a challenge
Unions are a challenge in both. Currently I am using Pig with a custom LoadFunc and for each Union I have I generate a field for each non-null branch and a field to specify which branch is used. This is ... not a good long term solution. For example {"name":"myField", "type":"union", "branches": ["string", "bytes"]} would generate three pig fields, myFieldString, myFieldBytes, myFieldType.
In Hive, that hack could work and be equally ugly, or possibly a "table family" could be created for certain union types with a table per branch. In other cases a custom operation is needed.

Example 1, small 'leaf' union: I have a field that is a union of a String and a fixed byte[16]. In my custom Pig script I just convert the bytes to hex and always use a string, generating one field. I could also just create one field as a variable length byte[] type and use the utf8 bytes of the string. In my case the string is always more than 32 characters, so there are no collisions between the branches in either. These custom field mappings cannot be done with a generic "read any avro file in Pig/Hive" class.

Example 2, large 'branch' union: Some unions are unions of many lager more complicated records. In Pig this can map to a SPLIT (several record streams from one source) or in Hive a 'table family' but neither can be currently done naturally or automatically -- a fully custom reader / writer for each schema that has such a 'branch union' in it is necessary.

Getting some sort of union-type features added to both would be beneficial, even if these are restricted in scope and only cover a few more common use cases.

** Avro enhancements
Both Specific and Generic APIs lead to extra object overhead here. For example in Pig one creates the Avro object then reads its fields and copies them into a pig Schema. Lower level readers are better -- ideally the Pig reader gets callbacks for each field it is interested in in the order it expects (reader schema order) and fills out its own object. I think some of our Decoders can operate that way. A Pig feature that makes it easier to construct tuples out-of-order (writer schema order) would be useful too.

Hive has a lot of projection features that could be served well by slightly different file formats (for example, the ability to skip variable length fields faster -- a per record map of field sizes perhaps -- could be useful).

Neither will support recursive schemas. Is there a quick way to check if a schema is recursive? In general, some features in Avro to make it easier to 'categorize' a schema would be beneficial.

Re: Avro files and Pig/Hive

Posted by Doug Cutting <cu...@apache.org>.

Scott Carey wrote:
> ** Avro enhancements Both Specific and Generic APIs lead to extra
> object overhead here.  For example in Pig one creates the Avro object
> then reads its fields and copies them into a pig Schema.

Can't you instead subclass GenericDatum{Reader/Writer} and override 
methods so that it operates directly on Pig data?  That's the intent.

> Neither will support recursive schemas.  Is there a quick way to
> check if a schema is recursive?  In general, some features in Avro to
> make it easier to 'categorize' a schema would be beneficial.

We could easily add Schema#isRecursive() and Schema#hasNonNullUnions() 
methods if those would be useful.

Doug