You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2010/08/12 23:10:17 UTC

[jira] Created: (PIG-1544) proactive-spill bags should share the memory alloted for it

proactive-spill bags should share the memory alloted for it
-----------------------------------------------------------

                 Key: PIG-1544
                 URL: https://issues.apache.org/jira/browse/PIG-1544
             Project: Pig
          Issue Type: Bug
            Reporter: Thejas M Nair


Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 

This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899162#action_12899162 ] 

Olga Natkovich commented on PIG-1544:
-------------------------------------

One way to do this is to only use InternalCacheBags for the bags that we are aware off upfront. Then we can have a visitor on the plan that counts the number of bags needed and divides memory accordingly.

> proactive-spill bags should share the memory alloted for it
> -----------------------------------------------------------
>
>                 Key: PIG-1544
>                 URL: https://issues.apache.org/jira/browse/PIG-1544
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899526#action_12899526 ] 

Thejas M Nair commented on PIG-1544:
------------------------------------

bq. We should not be using these bags for the cases like UDF for exactly the reason you are mentioning 
The case I had in mind was not one where UDF is creating proactive-spill bags, but case where udf input takes bags and they happen to be of proactive-spilling type and the udf retains bags from previous rows.

Anyway, I have come up with a more realistic(?) use case where it is difficult to determine the number of proactive-spill bags that will be present at run time -

{code}
L = load 'f1' as ( c1 : int, b1 : bag{ } );
F1 = foreach L { d = distinct b1; generate c1, d; }    -- InternalDistinctBag will be created here
G = group F by c1 using 'merge'; -- This group-by could [1] accumulate several of these   InternalDistinctBag objects
F2 = foreach G generate ...

[1] - This does not happen because the query plan has a PORelationToExpressionProject after the result from PODistinct which copies the bag. But it looks like we can optimize and get rid of that bag in this case.

{code}



> proactive-spill bags should share the memory alloted for it
> -----------------------------------------------------------
>
>                 Key: PIG-1544
>                 URL: https://issues.apache.org/jira/browse/PIG-1544
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899529#action_12899529 ] 

Olga Natkovich commented on PIG-1544:
-------------------------------------

So we should not use them in this case either. We should only use internal bags for things we no upfront

> proactive-spill bags should share the memory alloted for it
> -----------------------------------------------------------
>
>                 Key: PIG-1544
>                 URL: https://issues.apache.org/jira/browse/PIG-1544
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905628#action_12905628 ] 

Olga Natkovich commented on PIG-1544:
-------------------------------------

I am going to take my previous comment back and say that we should make this work for UDFs as well. The main reason for this is that we don't have another way to make sure that UDFs do not run out of memory. One approach that Alan proposed was to make bags when they are created to ask for memory and have a central broker in charge of the memory pool. The details of this or whether there is a better approach need to be still thought through.

> proactive-spill bags should share the memory alloted for it
> -----------------------------------------------------------
>
>                 Key: PIG-1544
>                 URL: https://issues.apache.org/jira/browse/PIG-1544
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900334#action_12900334 ] 

Thejas M Nair commented on PIG-1544:
------------------------------------

While computing the number of bags, we should remember to consider the multi-query case as well.

> proactive-spill bags should share the memory alloted for it
> -----------------------------------------------------------
>
>                 Key: PIG-1544
>                 URL: https://issues.apache.org/jira/browse/PIG-1544
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900440#action_12900440 ] 

Thejas M Nair commented on PIG-1544:
------------------------------------

bq. While computing the number of bags, we should remember to consider the multi-query case as well.
In case of multi-query, the sub-plans for each query in multi-query are executed one at a time for a given tuple with large bags. So the number of large bags that can't be garbage collected would be similar to that of single query. 

Another thing to keep in mind is that multiple bags that are working on common input (in case of  distinct/order-by in nested foreach), would be sharing some/most of the memory with the input bag because pig does not create copies of the column objects.


> proactive-spill bags should share the memory alloted for it
> -----------------------------------------------------------
>
>                 Key: PIG-1544
>                 URL: https://issues.apache.org/jira/browse/PIG-1544
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899227#action_12899227 ] 

Olga Natkovich commented on PIG-1544:
-------------------------------------

We should not be using these bags for the cases like UDF for exactly the reason you are mentioning

> proactive-spill bags should share the memory alloted for it
> -----------------------------------------------------------
>
>                 Key: PIG-1544
>                 URL: https://issues.apache.org/jira/browse/PIG-1544
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1544) proactive-spill bags should share the memory alloted for it

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1544:
--------------------------------

         Assignee: Thejas M Nair
    Fix Version/s: 0.9.0

> proactive-spill bags should share the memory alloted for it
> -----------------------------------------------------------
>
>                 Key: PIG-1544
>                 URL: https://issues.apache.org/jira/browse/PIG-1544
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.9.0
>
>
> Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899221#action_12899221 ] 

Thejas M Nair commented on PIG-1544:
------------------------------------

Note that it will not be possible to determine at query plan generation time, the number of bags that will be present at a time during query execution in all cases. For example, a udf could collect several bags. But that use case is likely to be rare, so i don't think it needs to be considered for  memory size limit estimate. It should be sufficient to determine the number of places bags are created in the query plan.




> proactive-spill bags should share the memory alloted for it
> -----------------------------------------------------------
>
>                 Key: PIG-1544
>                 URL: https://issues.apache.org/jira/browse/PIG-1544
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.