You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Jie Li (JIRA)" <ji...@apache.org> on 2012/06/25 09:20:43 UTC

[jira] [Created] (MAPREDUCE-4365) Shipping Profiler Libraries by DistributedCache

Jie Li created MAPREDUCE-4365:
---------------------------------

             Summary: Shipping Profiler Libraries by DistributedCache
                 Key: MAPREDUCE-4365
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4365
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
    Affects Versions: 1.0.3
            Reporter: Jie Li


Hadoop profiling is great for performance tuning and debugging, but currently we can only use Java built-in profilers such as HProf, and for other profilers we need to install them on all slave nodes first, which is inconvenient for large clusters and sometimes impossible for production clusters. 

Supporting shipping profiler libraries using DistributedCache will solve this problem. For example, in mapred.task.profile.params, we specify a profiler library from the DistributedCache using special place holders such as <foo.jar>, and Hadoop can look at the DistributedCache to replace <foo.jar> with the localized path before launching the child jvm.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4365) Shipping Profiler Libraries by DistributedCache

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400806#comment-13400806 ] 

Robert Joseph Evans commented on MAPREDUCE-4365:
------------------------------------------------

Jie,

I am confused too.  Do you want to profile the task or the task tracker?  If you want to profile the task you can do a combination of what I said and what Arun is saying.

{noformat}
hadoop jar ... -Dmapred.map.child.java.opts="... -agentlib:yjpagent" -Dmapred.reduce.child.java.opts="... -agentlib:yjpagent" -Dmapred.child.env="... LD_LIBRARY_PATH=yourkit/bin/linux-x86-32" -archive '/path/to/yourkit.tgz#yourkit' ...
{noformat}

This is just thrown together from memory and from http://www.yourkit.com/docs/80/help/agent.jsp so some of the parameter options may be wrong, but it should point you down the correct path.
                
> Shipping Profiler Libraries by DistributedCache
> -----------------------------------------------
>
>                 Key: MAPREDUCE-4365
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4365
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 1.0.3
>            Reporter: Jie Li
>
> Hadoop profiling is great for performance tuning and debugging, but currently we can only use Java built-in profilers such as HProf, and for other profilers we need to install them on all slave nodes first, which is inconvenient for large clusters and sometimes impossible for production clusters. 
> Supporting shipping profiler libraries using DistributedCache will solve this problem. For example, in mapred.task.profile.params, we specify a profiler library from the DistributedCache using special place holders such as <foo.jar>, and Hadoop can look at the DistributedCache to replace <foo.jar> with the localized path before launching the child jvm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4365) Shipping Profiler Libraries by DistributedCache

Posted by "Jie Li (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400766#comment-13400766 ] 

Jie Li commented on MAPREDUCE-4365:
-----------------------------------

Hi Robert,

I don't quite understand your approach, because we need to provide the
path of the profiler libraries to the TaskTracker instead of the
tasks. So if the libraries appear in the task' working directory, how
can the TaskTracker find it when launching the task? And currently TT
doesn't look into the profile parameters to see if there is any
distributed cache entry.
                
> Shipping Profiler Libraries by DistributedCache
> -----------------------------------------------
>
>                 Key: MAPREDUCE-4365
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4365
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 1.0.3
>            Reporter: Jie Li
>
> Hadoop profiling is great for performance tuning and debugging, but currently we can only use Java built-in profilers such as HProf, and for other profilers we need to install them on all slave nodes first, which is inconvenient for large clusters and sometimes impossible for production clusters. 
> Supporting shipping profiler libraries using DistributedCache will solve this problem. For example, in mapred.task.profile.params, we specify a profiler library from the DistributedCache using special place holders such as <foo.jar>, and Hadoop can look at the DistributedCache to replace <foo.jar> with the localized path before launching the child jvm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4365) Shipping Profiler Libraries by DistributedCache

Posted by "Jie Li (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401803#comment-13401803 ] 

Jie Li commented on MAPREDUCE-4365:
-----------------------------------

Thanks Arun and Robert. 

I meant profiling tasks and actually I'm using [Hadoop profiling|http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Profiling] by setting mapred.task.profile.{maps|reduces} so Hadoop will automatically send back the profiling output files.

The reason why your approach couldn't work is that, currently it is task's responsibility to set up the symlink for the distributed cache, so when TT launches the task, the symlink is not set up yet. Note TaskRunner#setupWorkDir is called in Child#main.

So one solution is to create the symlink before launching tasks, or we can replace the distributed cache entry found in the profiling parameters with the localized path for this particular problem?
                
> Shipping Profiler Libraries by DistributedCache
> -----------------------------------------------
>
>                 Key: MAPREDUCE-4365
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4365
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 1.0.3
>            Reporter: Jie Li
>
> Hadoop profiling is great for performance tuning and debugging, but currently we can only use Java built-in profilers such as HProf, and for other profilers we need to install them on all slave nodes first, which is inconvenient for large clusters and sometimes impossible for production clusters. 
> Supporting shipping profiler libraries using DistributedCache will solve this problem. For example, in mapred.task.profile.params, we specify a profiler library from the DistributedCache using special place holders such as <foo.jar>, and Hadoop can look at the DistributedCache to replace <foo.jar> with the localized path before launching the child jvm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4365) Shipping Profiler Libraries by DistributedCache

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403191#comment-13403191 ] 

Robert Joseph Evans commented on MAPREDUCE-4365:
------------------------------------------------

@Jie on 1.0 that may work, but I don't know if we are exploding the job.jar for 2.0.  I think we need to have a JIRA for creating the symlinks before launching at least in 2.0.
                
> Shipping Profiler Libraries by DistributedCache
> -----------------------------------------------
>
>                 Key: MAPREDUCE-4365
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4365
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 1.0.3
>            Reporter: Jie Li
>
> Hadoop profiling is great for performance tuning and debugging, but currently we can only use Java built-in profilers such as HProf, and for other profilers we need to install them on all slave nodes first, which is inconvenient for large clusters and sometimes impossible for production clusters. 
> Supporting shipping profiler libraries using DistributedCache will solve this problem. For example, in mapred.task.profile.params, we specify a profiler library from the DistributedCache using special place holders such as <foo.jar>, and Hadoop can look at the DistributedCache to replace <foo.jar> with the localized path before launching the child jvm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAPREDUCE-4365) Shipping Profiler Libraries by DistributedCache

Posted by "Jie Li (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jie Li resolved MAPREDUCE-4365.
-------------------------------

          Resolution: Fixed
    Target Version/s:   (was: 1.1.0)

One way is to include the profiler library into the job jar and use relative path like "../../foo.library" to locate it.

Thanks Deveraj, Sid, Vinod and everyone!
                
> Shipping Profiler Libraries by DistributedCache
> -----------------------------------------------
>
>                 Key: MAPREDUCE-4365
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4365
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 1.0.3
>            Reporter: Jie Li
>
> Hadoop profiling is great for performance tuning and debugging, but currently we can only use Java built-in profilers such as HProf, and for other profilers we need to install them on all slave nodes first, which is inconvenient for large clusters and sometimes impossible for production clusters. 
> Supporting shipping profiler libraries using DistributedCache will solve this problem. For example, in mapred.task.profile.params, we specify a profiler library from the DistributedCache using special place holders such as <foo.jar>, and Hadoop can look at the DistributedCache to replace <foo.jar> with the localized path before launching the child jvm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4365) Shipping Profiler Libraries by DistributedCache

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400488#comment-13400488 ] 

Robert Joseph Evans commented on MAPREDUCE-4365:
------------------------------------------------

Why can't you just use specify your distributed cache entry with something like hdfs://path/to/profiler.lib#profiler-link.lib?  I know it is a little ugly but it will add a symbolic name link named profiler-link.lib in the current working directory of your task to wherever it is in the distributed cache.

You can do this to with a tgz or zip.

hdfs://path/to/archive.tgz#profiler

Now if you want to access lib/profiler.so from inside of archive.tgz you would use a path of profiler/lib/profiler.so.
                
> Shipping Profiler Libraries by DistributedCache
> -----------------------------------------------
>
>                 Key: MAPREDUCE-4365
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4365
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 1.0.3
>            Reporter: Jie Li
>
> Hadoop profiling is great for performance tuning and debugging, but currently we can only use Java built-in profilers such as HProf, and for other profilers we need to install them on all slave nodes first, which is inconvenient for large clusters and sometimes impossible for production clusters. 
> Supporting shipping profiler libraries using DistributedCache will solve this problem. For example, in mapred.task.profile.params, we specify a profiler library from the DistributedCache using special place holders such as <foo.jar>, and Hadoop can look at the DistributedCache to replace <foo.jar> with the localized path before launching the child jvm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4365) Shipping Profiler Libraries by DistributedCache

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400780#comment-13400780 ] 

Arun C Murthy commented on MAPREDUCE-4365:
------------------------------------------

Jie - You can just add the profiler params to mapred.(map,reduce).child.java.opts and the TT will add it to the tasks' jvm launch cmd.
                
> Shipping Profiler Libraries by DistributedCache
> -----------------------------------------------
>
>                 Key: MAPREDUCE-4365
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4365
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 1.0.3
>            Reporter: Jie Li
>
> Hadoop profiling is great for performance tuning and debugging, but currently we can only use Java built-in profilers such as HProf, and for other profilers we need to install them on all slave nodes first, which is inconvenient for large clusters and sometimes impossible for production clusters. 
> Supporting shipping profiler libraries using DistributedCache will solve this problem. For example, in mapred.task.profile.params, we specify a profiler library from the DistributedCache using special place holders such as <foo.jar>, and Hadoop can look at the DistributedCache to replace <foo.jar> with the localized path before launching the child jvm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira