You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2010/08/31 10:04:58 UTC

[jira] Commented: (PIG-794) Use Avro serialization in Pig

    [ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904551#action_12904551 ] 

Jeff Zhang commented on PIG-794:
--------------------------------

I did some experiment on Avro, Avro_Storage_2.patch is the detail implementation.

Here I use avro as the data storage between map reduce jobs to replace InterStorage which has been optimized compared to BinStorage. 
 I use a simple pig script which will been translate into 2 mapred jobs
{code}
a = load '/a.txt';
b = load '/b.txt';
c = join a by $0, b by $0;
d = group c by $0;
dump d;
{code}

The following table shows my experiment result (1 master + 3 slaves)
|| Storage || Time spent on job_1 || Output size of job_1 || Mapper task number of job_2 || Time spent on job_2 || Total spent time on pig script
| AvroStorage | 5min 57 sec | 7.97G | 120 | 16min 50 sec | 22min 47 sec| 
| InterStorage | 4min 33 sec | 9.55G | 143 | 17min 17 sec | 21min 50 sec|

The experiment shows that AvroStorage has more compact format than InterStorage ( according the output size of job_1), but has more overhead on serialization ( according the time spent on job_1). I think the time spent on job_2 using AvroStorage is less than that using InterStorage is because the input size of job_2 (the output of job_1) which using AvroStorage is much less than that using InterStorage, so it need less mapper task.

Overall, AvroStorage is not so good as expected.
One reason is maybe I do not use Avro's API correctly (hope avro guys can review my code), another reason is maybe avro's serialization performance is not so good.
BTW, I use avro trunk.


> Use Avro serialization in Pig
> -----------------------------
>
>                 Key: PIG-794
>                 URL: https://issues.apache.org/jira/browse/PIG-794
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Rakesh Setty
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.