You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dmitriy V. Ryaboy (Updated) (JIRA)" <ji...@apache.org> on 2011/11/14 00:13:51 UTC
[jira] [Updated] (PIG-2359) Support more efficient Tuples when schemas are known

     [ https://issues.apache.org/jira/browse/PIG-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2359:
-----------------------------------

    Attachment: PIG-2359.1.patch

The attached patch is a first cut at adding this support.

Note that it changes the TupleFactory interface by adding a couple new methods for creating optimized tuples.

Two flavors of optimized tuples are provided:

1) For single-field tuple, we provide a PrimitiveFieldTuple, which simply wraps a primitive value (or a string). 

2) For multi-field tuples, we provide an implementation that uses a single bytebuffer to hold the data in memory, and deserializes the appropriate field on read. This incurs a bit of a read-time penalty, but I believe it's a good trade-off, since (a) most of the time we only read once, and the allocation costs are much lower than for regular tuples, and (b) the memory overhead is several times lower than for regular tuples, so we'll save on GC.

Microbenchmark results can be found in the javadoc for PrimitiveTuple.

Note that so far I haven't changed any behavior in existing Pig code, other than changing one interface. The next step would be to start using these Tuples when possible.

One complication is that since we don't push much metadata around with tuples, we can only deserialize them into standard tuples; so all savings are lost once we hit an MR boundary. Changing this would require a pretty significant refactor, I'd love to hear ideas from folks who worked on BinInterSedes on how to do this.

So far, I've played with using these in some UDFs that generate large bags of tuples, and the difference in both speed and memory use if fairly dramatic.
                
> Support more efficient Tuples when schemas are known
> ----------------------------------------------------
>
>                 Key: PIG-2359
>                 URL: https://issues.apache.org/jira/browse/PIG-2359
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: PIG-2359.1.patch
>
>
> Pig Tuples have significant overhead due to the fact that all the fields are Objects.
> When a Tuple only contains primitive fields (ints, longs, etc), it's possible to avoid this overhead, which would result in significant memory savings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira