You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2009/11/23 21:43:39 UTC

[jira] Created: (PIG-1102) Collect number of spills per job

Collect number of spills per job
--------------------------------

                 Key: PIG-1102
                 URL: https://issues.apache.org/jira/browse/PIG-1102
             Project: Pig
          Issue Type: Improvement
            Reporter: Olga Natkovich
            Assignee: Sriranjan Manjunath
             Fix For: 0.7.0


Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.

Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.

Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792680#action_12792680 ] 

Sriranjan Manjunath commented on PIG-1102:
------------------------------------------

I ran the test again on my local machine, and it passes. The test failed because of "too many open file descriptors". Is this a hudson related issue?

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793369#action_12793369 ] 

Olga Natkovich commented on PIG-1102:
-------------------------------------

A few questions/comments on the patch:

(1) I think the count should default to 0, not -1.
(2) Does increment of count have to be combined with warn statement. Does this mean that users will see this many warnings? If so, should we combine this with spill message we already print?
(3) I thought we discussed having increment per buffer not per record and to approximate that based on the buffer size. I did not see the code that did this.
(4) I don't think you correctly separated bags that practively spill vs the bags that are spilled by memory manager. All the bags created by DefaultBagFactory get registerf with SpillableMemoryManager and belong to the second category.


> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Attachment:     (was: PIG_1102.patch.1)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791810#action_12791810 ] 

Hadoop QA commented on PIG-1102:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428253/PIG_1102.patch
  against trunk revision 891499.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 406 release audit warnings (more than the trunk's current 403 warnings).

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/134/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/134/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/134/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/134/console

This message is automatically generated.

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Status: Open  (was: Patch Available)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Status: Open  (was: Patch Available)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Status: Patch Available  (was: Open)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793771#action_12793771 ] 

Sriranjan Manjunath commented on PIG-1102:
------------------------------------------

1. The default is -1 to distinguish it from the case where there is no spill. The value is set to -1, if counters could not be initialized which is an exception.
2. warn is a misnomer. I reused an existing function which updates the counter if initialized. If the counter is not initialized, it dumps a warning.
3,4. I have fixed these and submitted a new patch.


> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Attachment: PIG_1102.patch

There are no test cases included in the patch since it was difficult to consistently spill in a unit test case. I have manually tested the change. The easiest way to test this to load a huge data bag (1gb or so) and watch the map reduce UI. The UI will show new counters - SPILLABLE_MEMORY_MANAGER_SPILL_COUNT or PROACTIVE_SPILL_COUNT depending on the type of POPackage used.

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Attachment: PIG_1102.patch.1

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Status: Open  (was: Patch Available)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793868#action_12793868 ] 

Olga Natkovich commented on PIG-1102:
-------------------------------------

I could not figure out how the patch addressed (3). Could you, please, explain.

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Status: Patch Available  (was: Open)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793911#action_12793911 ] 

Hadoop QA commented on PIG-1102:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428760/PIG_1102.patch.1
  against trunk revision 893053.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 411 release audit warnings (more than the trunk's current 408 warnings).

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/152/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/152/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/152/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/152/console

This message is automatically generated.

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792399#action_12792399 ] 

Hadoop QA commented on PIG-1102:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428356/PIG_1102.patch
  against trunk revision 892125.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 400 release audit warnings (more than the trunk's current 397 warnings).

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/139/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/139/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/139/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/139/console

This message is automatically generated.

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1102:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

patch committed. Thanks, Sri!

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Status: Patch Available  (was: Open)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784148#action_12784148 ] 

Sriranjan Manjunath commented on PIG-1102:
------------------------------------------

Hadoop currently does not provide us average CPU usage  / mem usage per job. It even does not provide the number of spills per job. I have created a jira requesting the same: https://issues.apache.org/jira/browse/MAPREDUCE-1257

The only information we can currently gather is the number of spill records.

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Attachment: PIG_1102.patch

Release audit warnings are related to html files.

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792612#action_12792612 ] 

Olga Natkovich commented on PIG-1102:
-------------------------------------

I will be reviewing this patch

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Attachment: PIG_1102.patch.1

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Status: Open  (was: Patch Available)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Attachment:     (was: PIG_1102.patch)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Status: Patch Available  (was: Open)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (PIG-1102) Collect number of spills per job

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai closed PIG-1102.
---------------------------


> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794217#action_12794217 ] 

Sriranjan Manjunath commented on PIG-1102:
------------------------------------------

(3) refers to the case where we try to guess the number of records that fit into memory and start spilling the other records. InternalCachedBag.java addresses this case:

+                if (cacheLimit!= 0 && mContents.size() % cacheLimit == 0) {
+                    /* Increment the spill count*/
+                    incSpillCount(PigCounters.PROACTIVE_SPILL_COUNT);                    
+                }
             }

cacheLimit holds the number of records that can be held in memory whereas mContents is the tuple that holds all the records. Here, I do not increment the counter for every record. Instead I count every n'th record, n being the cacheLimit.

This however, does not increment the counter by the buffer size. Incrementing it by the buffer size will give us a value which approximately equal to the number of spilled records.

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1102) Collect number of spills per job

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1102:
-------------------------------------

    Status: Patch Available  (was: Open)

> Collect number of spills per job
> --------------------------------
>
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
>
>
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.