You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2010/06/11 23:24:13 UTC

[jira] Created: (PIG-1447) Tune memory usage of InternalCachedBag

Tune memory usage of InternalCachedBag
--------------------------------------

                 Key: PIG-1447
                 URL: https://issues.apache.org/jira/browse/PIG-1447
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: 0.7.0
            Reporter: Daniel Dai
             Fix For: 0.8.0


We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1447:
-------------------------------

    Attachment: PIG-1447.1.patch

> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1447:
-------------------------------

    Attachment: L15_modified2.pig

> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig, L15_modified2.pig
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899581#action_12899581 ] 

Olga Natkovich commented on PIG-1447:
-------------------------------------

Did you see any perf improvement?

> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1447:
-------------------------------

    Status: Patch Available  (was: Open)

Patch for increasing default value to 20%. 
No new test cases as this only changes the memory limit default.
All core tests pass. Result of test-patch -

     [exec] -1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.


> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899576#action_12899576 ] 

Thejas M Nair commented on PIG-1447:
------------------------------------

I ran L15_modified after applying patch in PIG-1524, which gives number of bags that spilled, and total number of records that spilled -
||query || spills with 0.1f || spills with 0.2f ||
| L15_modified | 1.2 million bags containing total of  5million records. in range of 500mb  | 413k bags containing 3 million records . in range of 300mb|


> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900939#action_12900939 ] 

Thejas M Nair commented on PIG-1447:
------------------------------------

Some more reasons why higher value would still be safe -
1. A lot of the memory attributed to the InternalDistinct/InternalSorted bags used from within nested-foreach will be shared with the InternalCacheBag in the input tuple because the pig does not create a copy of the column objects.
2. In a nested foreach,  at a time only one inner-plan will hold references to the Internal* bags . The internal* bags are eventually converted to DefaultDataBag by RelationToExpressionProject in these plans. In most common cases (say you are generating multiple-count distincts, order-bys on bags in nested foreach), that means only one Internal* bag created within nested foreach will be referenced at a time. I tried comparing the memory footprint with different number of distinct operations in a nested-foreach, and found them to be in same range.
I am planning to set the default at 20% for now. If we find the memory limits being hit as a result of this during the beta testing period, we can reduce the default.


> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-1447:
-----------------------------------

    Assignee: Thejas M Nair

> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901576#action_12901576 ] 

Olga Natkovich commented on PIG-1447:
-------------------------------------

This is probably the smallest patch I have reviewed recently :). +1

> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900332#action_12900332 ] 

Thejas M Nair commented on PIG-1447:
------------------------------------

bq. Did you see any perf improvement? 
No, the query is the same and the performance is the same, just that the number of records reported earlier were not correct. Infact there was also a mistake in the calculation, i have fixed that in updated patch for PIG-1524 .

I made further modifications to the L15_modified.pig to use larger columns - L15_modified2.pig (attached). With this query the number of records dumped are 17.5 million with 0.1f and 20 million  with 0.2f for pig.cachedbag.memusage . The records are also much larger in size . I see around 10% improvement with 0.2f .

Considering the issue in PIG-1544 and that multi-query optimized queries can have large number of bags, I think it is safer to leave the value at 10% for now. We can add documentation on adjusting the value of this property so that users can adjust it if they see lot of records being proactive-spilled .

We should revisit this once PIG-1544 is fixed.

> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig, L15_modified2.pig
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1447:
-------------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Patch committed to trunk.

> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1447:
-------------------------------

    Attachment: L15_modified.pig

The quest for better value for a new default value for pig.cachedbag.memusage was prompted by changes in PIG-1443 and PIG-1492 . Before the changes made as part of those jiras, pig was underestimating the memory footprint of data.
In data of 'typical' sizes  (chararray/bytearray with less than 20 chars), the new memory size estimates can be upto 2 times the old version without any changes (0.6.0).

I tried running pig queries with max heap size setting for tasks as 1GB, and compared the use of 0.1f and 0.2f as values for pig.cachedbag.memusage. I ran pigmix v1  queries(L1-L12) ,  modified pigmix v1 that specifies types , and modified L15 query which has several distincts in a nested foreach statement.
Only queries L5, L7 and L15 had proactive spills. I see that the number of spills goes down with 0.2f as the value, but the total runtime is practically the same. 

(See PIG-1524 for more on spills currently reported )

|| query || spills with 0.1f || spills with 0.2f || 
| L5 (original pigmix) | 496k | 0 |
| L7 (original pigmix) | 82k | 0 |
| L5 (with types) | 609k | 82k |
| L7 (with types) | 128k | 0 |
| L15_modified (attached to jira) |  501k | 326k |


Some other factors to consider while determining a new value for this property -
- as a result of issue described in PIG-1544, all proactive-spill bags don't share the memory limit.
- the default value should be low enough, so that queries work fine in most cases. Expert users can tweak this to improve performance
- the value of 0.1f has been used for a long time (with old memory estimate formula), and seems to work for most cases.
- during the above tests, no other queries were running, so the disks were relatively free. 

I propose that we increase the default value to 0.15f accommodate for changes in memory size estimation so that the spill behavior is closer to what it has been with 0.6 and 0.7. 


> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900435#action_12900435 ] 

Thejas M Nair commented on PIG-1447:
------------------------------------

I am wrong about multi-query being a big cause for concern in raising this parameter value - the sub-plans for each query in multi-query are executed one at a time for a given tuple with large bags. So the number of large bags that can't be garbage collected would be similar to that of single query. 15% default value seems to be reasonably safe.



> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig, L15_modified2.pig
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.