You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/07/10 18:19:14 UTC
[jira] Commented: (PIG-794) Use Avro serialization in Pig

    [ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729700#action_12729700 ] 

Alan Gates commented on PIG-794:
--------------------------------

I agree with Doug's comments that it's better to use an API to build the schema that will give us compile time checking.  I think it will also (hopefully) be easier to figure out the schema when reading the code, as it will avoid the need to read JSON directly.

I have a general question on the approach.  This is a direct port of Pig's BinStorage to use Avro, including the writing of indicator bytes for types.  I do not have a deep knowledge of Avro.  But I had assumed that since it was a de/serialization framework with types, part of what it would provide was type recognition.  That is, can't this code rely on Avro to set the type for it?  Do we need to be writing those indicator bytes ourselves?  Perhaps this is the same comment that Doug is making about using GenericDatumReader and addField.

In response to Hong's comment, the sync marks are vulnerable as you point out.  But the loader needs some way to find a proper starting place when it's handed any block but the initial block of a file.  I wonder if we could create a new sync type.  It would always consist of a 100 byte marker (say the first 25 prime numbers, or the first 25 digits of pi or something).  We could then write a tuple with that sync type every 1000 records in the data.  Loaders that don't start at position 0 could then seek to the first sync type it found before it began reading.  All loaders would read past the end of their position until they saw a sync type.

As for this being compatible with with non-pig apps, that isn't the purpose of this AvroStorage function.  This is for pig to pass data between MR jobs for itself.  Having a tool independent storage format is a bigger project, as it requires agreeing on things like sync marks, how to represent different Avro objects, etc.

> Use Avro serialization in Pig
> -----------------------------
>
>                 Key: PIG-794
>                 URL: https://issues.apache.org/jira/browse/PIG-794
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Rakesh Setty
>             Fix For: 0.2.0
>
>         Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.