You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Pradeep Kamath (JIRA)" <ji...@apache.org> on 2009/01/27 01:03:02 UTC

[jira] Created: (PIG-636) PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner

PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner
----------------------------------------------------------------------------------------------------------------

                 Key: PIG-636
                 URL: https://issues.apache.org/jira/browse/PIG-636
             Project: Pig
          Issue Type: Improvement
    Affects Versions: types_branch
            Reporter: Pradeep Kamath
            Assignee: Pradeep Kamath
             Fix For: types_branch


Currently whenever Combiner is used in pig, in the map, the POPrecombinerLocalRearrange operator puts the single "value" tuple corresponding to a key into a DataBag and passes this to the foreach which is being combined. This will generate as many bags as there are input records. These bags all will have a single tuple and hence are small and should not need to be spilt to disk. However since the bags are created through the BagFactory mechanism, each bag creation is registered with the SpillableMemoryManager and a weak reference to the bag is stored in a linked list. This linked list grows really big over time causing unnecessary Garbage collection runs. This can be avoided by having a simple lightweight implementation of the DataBag interface to store the single tuple in a bag. Also these SingleTupleBags should be created without registering with the spillableMemoryManager. Likewise the bags created in POCombinePackage are supposed to fit in Memory and not spill. Again a NonSpillableDataBag implementation of DataBag interface which does not register with the SpillableMemoryManager would help.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-636) PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-636:
-------------------------------

    Attachment: PIG-636-v2.patch

Attached new version of patch with the following two changes as per review comments:
1) SingleTupleBag now only has one constructor which takes the tuple the bag is meant to contain. This way SingleTupleBags can only be created with the member Tuple.
2) size() now returns 1.

> PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-636
>                 URL: https://issues.apache.org/jira/browse/PIG-636
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-636-v2.patch, PIG-636.patch
>
>
> Currently whenever Combiner is used in pig, in the map, the POPrecombinerLocalRearrange operator puts the single "value" tuple corresponding to a key into a DataBag and passes this to the foreach which is being combined. This will generate as many bags as there are input records. These bags all will have a single tuple and hence are small and should not need to be spilt to disk. However since the bags are created through the BagFactory mechanism, each bag creation is registered with the SpillableMemoryManager and a weak reference to the bag is stored in a linked list. This linked list grows really big over time causing unnecessary Garbage collection runs. This can be avoided by having a simple lightweight implementation of the DataBag interface to store the single tuple in a bag. Also these SingleTupleBags should be created without registering with the spillableMemoryManager. Likewise the bags created in POCombinePackage are supposed to fit in Memory and not spill. Again a NonSpillableDataBag implementation of DataBag interface which does not register with the SpillableMemoryManager would help.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-636) PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-636:
-------------------------------

    Attachment: PIG-636.patch

> PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-636
>                 URL: https://issues.apache.org/jira/browse/PIG-636
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-636.patch
>
>
> Currently whenever Combiner is used in pig, in the map, the POPrecombinerLocalRearrange operator puts the single "value" tuple corresponding to a key into a DataBag and passes this to the foreach which is being combined. This will generate as many bags as there are input records. These bags all will have a single tuple and hence are small and should not need to be spilt to disk. However since the bags are created through the BagFactory mechanism, each bag creation is registered with the SpillableMemoryManager and a weak reference to the bag is stored in a linked list. This linked list grows really big over time causing unnecessary Garbage collection runs. This can be avoided by having a simple lightweight implementation of the DataBag interface to store the single tuple in a bag. Also these SingleTupleBags should be created without registering with the spillableMemoryManager. Likewise the bags created in POCombinePackage are supposed to fit in Memory and not spill. Again a NonSpillableDataBag implementation of DataBag interface which does not register with the SpillableMemoryManager would help.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-636) PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667778#action_12667778 ] 

Olga Natkovich commented on PIG-636:
------------------------------------

+1 on the patch. The only minor thing that I saw that needs to be changes is size of a single tuple bag should be changed to 1 rather than 0.

> PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-636
>                 URL: https://issues.apache.org/jira/browse/PIG-636
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-636.patch
>
>
> Currently whenever Combiner is used in pig, in the map, the POPrecombinerLocalRearrange operator puts the single "value" tuple corresponding to a key into a DataBag and passes this to the foreach which is being combined. This will generate as many bags as there are input records. These bags all will have a single tuple and hence are small and should not need to be spilt to disk. However since the bags are created through the BagFactory mechanism, each bag creation is registered with the SpillableMemoryManager and a weak reference to the bag is stored in a linked list. This linked list grows really big over time causing unnecessary Garbage collection runs. This can be avoided by having a simple lightweight implementation of the DataBag interface to store the single tuple in a bag. Also these SingleTupleBags should be created without registering with the spillableMemoryManager. Likewise the bags created in POCombinePackage are supposed to fit in Memory and not spill. Again a NonSpillableDataBag implementation of DataBag interface which does not register with the SpillableMemoryManager would help.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-636) PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-636:
-------------------------------

    Status: Patch Available  (was: Open)

Attached patch with implementations for SingleTupleBag and NonSpillableDataBag. SingleTupleBag is a simplistic implementation of the DataBag interface which has just a single Tuple member representing the contents of the bag. The iterator() simply returns this member Tuple object. NonSpillableDataBag is more generic and can hold any number of tuples. It essentially has all functionality of a DefaultDataBag without the ability to spill to disk. Both these classes are to be used by directly calling their constructors without going through the BagFactory and are meant only for use in internal operators to transfer data in bags.

POPreCombinerLocalRearrange has been changed to use SingleTupleBag and POCombinerPackage has been changed to use NonSpillableDataBag. The CombinerOptimizer has been changed to include a new visitor which visits the POProjects in the map plan with result type BAG. Such Projects are annotated with a flag to use SingleTupleBags. This is necessary since only changing POPrecombinerLocalRearrange to use SingleTupleBag will still result in the problem reported in the description for algebraic aggregates which work on projects of bags like the following:
{code}
a = load...
b = group a by all;
c = foreach b generate SUM(a.$0), GRUNT(a.$1);
{code}

POProject has been changed to check the above flag and use a SingleTupleBag in getNext(DataBag).

> PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-636
>                 URL: https://issues.apache.org/jira/browse/PIG-636
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>
> Currently whenever Combiner is used in pig, in the map, the POPrecombinerLocalRearrange operator puts the single "value" tuple corresponding to a key into a DataBag and passes this to the foreach which is being combined. This will generate as many bags as there are input records. These bags all will have a single tuple and hence are small and should not need to be spilt to disk. However since the bags are created through the BagFactory mechanism, each bag creation is registered with the SpillableMemoryManager and a weak reference to the bag is stored in a linked list. This linked list grows really big over time causing unnecessary Garbage collection runs. This can be avoided by having a simple lightweight implementation of the DataBag interface to store the single tuple in a bag. Also these SingleTupleBags should be created without registering with the spillableMemoryManager. Likewise the bags created in POCombinePackage are supposed to fit in Memory and not spill. Again a NonSpillableDataBag implementation of DataBag interface which does not register with the SpillableMemoryManager would help.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-636) PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-636:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Patch committed.

> PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-636
>                 URL: https://issues.apache.org/jira/browse/PIG-636
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-636-v2.patch, PIG-636.patch
>
>
> Currently whenever Combiner is used in pig, in the map, the POPrecombinerLocalRearrange operator puts the single "value" tuple corresponding to a key into a DataBag and passes this to the foreach which is being combined. This will generate as many bags as there are input records. These bags all will have a single tuple and hence are small and should not need to be spilt to disk. However since the bags are created through the BagFactory mechanism, each bag creation is registered with the SpillableMemoryManager and a weak reference to the bag is stored in a linked list. This linked list grows really big over time causing unnecessary Garbage collection runs. This can be avoided by having a simple lightweight implementation of the DataBag interface to store the single tuple in a bag. Also these SingleTupleBags should be created without registering with the spillableMemoryManager. Likewise the bags created in POCombinePackage are supposed to fit in Memory and not spill. Again a NonSpillableDataBag implementation of DataBag interface which does not register with the SpillableMemoryManager would help.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.