You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Vinod Kumar Vavilapalli (Created) (JIRA)" <ji...@apache.org> on 2011/11/15 16:28:52 UTC

[jira] [Created] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
---------------------------------------------------------------------------------------

                 Key: MAPREDUCE-3402
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster, mrv2
    Affects Versions: 0.23.0
            Reporter: Vinod Kumar Vavilapalli
             Fix For: 0.23.1


The world was rosier before October 19-25, [~karams] says.

The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.

One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Arun C Murthy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-3402:
-------------------------------------

    Fix Version/s:     (was: 0.23.1)
    
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Arun C Murthy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-3402:
-------------------------------------

    Fix Version/s: 0.23.1
    
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Vinod Kumar Vavilapalli (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli resolved MAPREDUCE-3402.
------------------------------------------------

    Resolution: Fixed

Fixed after MAPREDUCE-3511.
                
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163206#comment-13163206 ] 

Siddharth Seth commented on MAPREDUCE-3402:
-------------------------------------------

Possibly different from Vinod's leads. With some changes to the environment - and maybe a result of a few more commits, the job does complete.
Couple of observations: 
- The first tens of thousands of maps finish pretty fast.
- GC kicks in midway through the job and can't reclaim much. Spends several cycles where nothing is reclaimed before managing to reclaim a small amount.
- Counters are taking up a good amount of heap.
- JobHistory writes cannot keep up.
- Bumping up the AM heapsize does help.

Doesn't explain why the performance was better pre Oct 19 though. Opening and linking 2 jiras (non blockers since increasing the heap works well) for possible changes to counters and JobHistory. 
                
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Sharad Agarwal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151029#comment-13151029 ] 

Sharad Agarwal commented on MAPREDUCE-3402:
-------------------------------------------

just fyi org.apache.hadoop.mapreduce.v2.app.MRAppBenchmark can be used to benchmark the AM mainly for memory usage, job latencies and state machine transitions. It however doesn't capture the remoting/rpc issues as it doesn't run on real cluster.
                
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13155819#comment-13155819 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3402:
----------------------------------------------------

I got quite a few leads. Multiple issues in play.

Still debugging with some raw patches.
                
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151099#comment-13151099 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3402:
----------------------------------------------------

Independent invention! I was so into debugging I didn't check the JIRA posts. Yes, I am just using the same benchmark, and reproduced many a oddities with 100K maps, and was extolling you on the way for the benchmark :)

Playing with heap-dumps and profilers on this benchmark now.
                
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164900#comment-13164900 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3402:
----------------------------------------------------

[~karams] had been extremely helpful in running various tests to hunt this down. And we finally got some results after a couple of weeks of hard work.

Turns out that most of the issues are because we made a switch from 32 bit JVMs to 64 bit. Using compressed references dramatically increased the AMs speed, and the job finishes in around 30-35 mins. That is still a regression, but atleast the job finishes after the compressed-ops setting and/or changing the jvm back to 32 bit.

Giving more heap to the 32 bit JVM, around 3GB, helps to finish the job in around 7-8 mins. But that isn't something we want to do for all jobs. Reverting back to original speed definitely means that AM is wasting away time in GCs. Some of the observations Sid made above may hint at the root culprit.

Will file separate tickets to fix the inefficiencies.
                
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3402:
-----------------------------------------------

    Issue Type: Sub-task  (was: Bug)
        Parent: MAPREDUCE-3561
    
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Mahadev konar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated MAPREDUCE-3402:
-------------------------------------

    Priority: Blocker  (was: Major)
    
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

Posted by "Vinod Kumar Vavilapalli (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli reassigned MAPREDUCE-3402:
--------------------------------------------------

    Assignee: Vinod Kumar Vavilapalli
    
> AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.23.1
>
>
> The world was rosier before October 19-25, [~karams] says.
> The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks.
> One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira