You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2011/03/02 01:05:37 UTC

[jira] Updated: (PIG-1875) Keep tuples serialized to limit spilling and speed it when it happens

     [ https://issues.apache.org/jira/browse/PIG-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1875:
----------------------------

    Attachment: mrtuple.patch

Here's a first pass at what MToRTuple might look like.  I've done some basic testing to assure this works, but nothing comprehensive.

In test runs where I serialized 100k tuples, wrote them to disk, and read them back I got the following results:

DefaultTuple:
time to write to disk:       81.93 sec
size on disk:                98M
time to read from disk:      12.62 sec
size in memory (after read): 238M

MToRTuple:
time to write to disk:       10.49 sec
size on disk:                58M
time to read from disk:      1.10 sec
size in memory (after read): 57M

So roughly 1/4 the memory consumption and ~10x speedup on disk reads and writes.



> Keep tuples serialized to limit spilling and speed it when it happens
> ---------------------------------------------------------------------
>
>                 Key: PIG-1875
>                 URL: https://issues.apache.org/jira/browse/PIG-1875
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Priority: Minor
>         Attachments: mrtuple.patch
>
>
> Currently Pig reads records off of the reduce iterator and immediately deserializes them into Java objects.  This takes up much more memory than serialized versions, thus Pig spills sooner then if it stored them in serialized form.  Also, if it does have to spill, it has to serialize them again, and then again deserialize them after reading from the spill file.
> We should explore storing them in memory serialized when they are read off of the reduce iterator.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira