You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Hong Tang (JIRA)" <ji...@apache.org> on 2009/05/02 02:01:30 UTC

[jira] Issue Comment Edited: (PIG-793) Improving memory efficiency of Tuple implementation

    [ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705188#action_12705188 ] 

Hong Tang edited comment on PIG-793 at 5/1/09 4:59 PM:
-------------------------------------------------------

Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only instantiate datums when get/set calls are made. This would help if we are moving tuples from one container to another container.
{code}
class LazyTuple implements Tuple {
  ArrayList<Object> fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), offset(), and length() ), and use a DataByteArrayFactory to create instances in two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to keep a private copy of the buffer.
## DataByteArrayFactor.createShared(byte[], offset, length). if the input buffer can be shared with the data byte array object. In this case, the contract would be that caller will no longer access the portion of byte array from offset to offset+length (exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in short/short).
- An implementation for large buffers (offset/length are int/int, and length is larger enough)

Note that the change to DataByteArray would break the current semantics where the offset is always 0, and length is always the length of the buffer.


      was (Author: hong.tang):
    Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only instantiate datums when get/set calls are made. This would help if we are moving tuples from one container to another container.
{code}
class LazyTuple implements Tuple {
  ArrayList<Object> fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), offset(), and length() ), and use a DataByteArrayFactory to create instances in two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to keep a private copy of the buffer.
## DataByteArrayCreateShared(). if the input buffer can be shared with the data byte array object. In this case, the contract would be that caller will no longer access the portion of byte array from offset to offset+length (exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in short/short).
- An implementation for large buffers (offset/length are int/int, and length is larger enough)

Note that the change to DataByteArray would break the current semantics where the offset is always 0, and length is always the length of the buffer.

  
> Improving memory efficiency of Tuple implementation
> ---------------------------------------------------
>
>                 Key: PIG-793
>                 URL: https://issues.apache.org/jira/browse/PIG-793
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than ArrayList.
> There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.