You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Stan Rosenberg <st...@gmail.com> on 2013/01/17 20:32:11 UTC

Re: task jvm bootstrapping via distributed cache

Hi,

I am back with my original problem.  I am trying to bootstrap child
JVM via -javaagent.  I am doing what Harsh and Arun suggested, which
also agrees with the documentation.
In theory this should work, but it doesn't.  Any ideas before I start
digging into the code? Thanks.

Here is the command I am using to test:

hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u3.jar wordcount
-files "core-tools-0.0.1-SNAPSHOT-common-assembly.jar#foo.jar"
-Dmapred.map.child.java.opts="-javaagent:./foo.jar=classes=.*" test1
output

I can see the following (relevant) properties set in job.xml,

mapred.cache.files=/user/srosenberg/.staging/job_201211061805_50132/files/core-tools-0.0.1-SNAPSHOT-common-assembly.jar#foo.jar
mapred.create.symlink=yes
mapred.map.child.java.opts=-javaagent:./foo.jar=classes=.*

The map tasks fail with the following stdout/stderr output, resp.,

Error occurred during initialization of VM
agent library failed to init: instrument

Error opening zip file or JAR manifest missing : ./foo.jar

This seems like the jar is not symlinked into the current working
directory of the child JVM; or perhaps the symlinking happens after
the child JVM starts?




On Fri, Aug 3, 2012 at 1:31 PM, Harsh J <ha...@cloudera.com> wrote:
> Stan,
>
> What Arun says would surely work.
>
> For instance, read this command:
>
> hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0.jar pi
> -files
> "share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.0.0.jar#foo.jar"
> -Dmapred.child.java.opts="-javaagent:./foo.jar" 1 1
>
> What this would do is merely take your passed -files jar (client-common) and
> symlink it into the JVM's working directory (the task's working directory)
> _before_ the JVM is begun, as "foo.jar". So if I pass additionally, JVM opts
> that refer to this foo.jar under ./, then it would work as you expect it to,
> as the JVM is begun from that directory (its CWD).
>
> Do let us know if this solves it and also makes sense?
>
>
> On Fri, Aug 3, 2012 at 10:02 PM, Stan Rosenberg <st...@gmail.com>
> wrote:
>>
>> Arun,
>>
>> I don't believe the symlink is of help.  The symlink is created in the
>> task's current working directory (cwd), but I don't know what cwd is
>> when I launch with 'hadoop jar ...'.
>>
>> Thanks,
>>
>> stan
>>
>> On Fri, Aug 3, 2012 at 2:39 AM, Arun C Murthy <ac...@hortonworks.com> wrote:
>> > Stan,
>> >
>> >  You can ask TT to create a symlink to your jar shipped via DistCache:
>> >
>> >
>> > http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html#DistributedCache
>> >
>> >  That should give you what you want.
>> >
>> > hth,
>> > Arun
>> >
>> > On Jul 30, 2012, at 3:23 PM, Stan Rosenberg wrote:
>> >
>> > Hi,
>> >
>> > I am seeking a way to leverage hadoop's distributed cache in order to
>> > ship jars that are required to bootstrap a task's jvm, i.e., before a
>> > map/reduce task is launched.
>> > As a concrete example, let's say that I need to launch with
>> > '-javaagent:/path/profiler.jar'.  In theory, the task tracker is
>> > responsible for downloading cached files onto its local filesystem.
>> > However, the absolute path to a given cached file is not known a
>> > priori; however, we need the path in order to configure '-javaagent'.
>> >
>> > Is this currently possible with the distributed cache? If not, is the
>> > use case appealing enough to open a jira ticket?
>> >
>> > Thanks,
>> >
>> > stan
>> >
>> >
>> > --
>> > Arun C. Murthy
>> > Hortonworks Inc.
>> > http://hortonworks.com/
>> >
>> >
>
>
>
>
> --
> Harsh J

Re: task jvm bootstrapping via distributed cache

Posted by Stan Rosenberg <st...@gmail.com>.
Hi,

As I suspected, cache files are symlinked after a child JVM is
started:  TaskRunner.setupWorkDir is being called from
org.apache.hadoop.mapred.Child.main.
This is unfortunate as it makes impossible to leverage distributed
cache for the purpose of deploying JVM agents.  I could submit a jira
if there is any interest in getting this to work.
Otherwise, I'll think of some other hacks and use a distributed scp as
a last resort.

Thanks,

stan

On Thu, Jan 17, 2013 at 2:32 PM, Stan Rosenberg
<st...@gmail.com> wrote:
> Hi,
>
> I am back with my original problem.  I am trying to bootstrap child
> JVM via -javaagent.  I am doing what Harsh and Arun suggested, which
> also agrees with the documentation.
> In theory this should work, but it doesn't.  Any ideas before I start
> digging into the code? Thanks.
>
> Here is the command I am using to test:
>
> hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u3.jar wordcount
> -files "core-tools-0.0.1-SNAPSHOT-common-assembly.jar#foo.jar"
> -Dmapred.map.child.java.opts="-javaagent:./foo.jar=classes=.*" test1
> output
>
> I can see the following (relevant) properties set in job.xml,
>
> mapred.cache.files=/user/srosenberg/.staging/job_201211061805_50132/files/core-tools-0.0.1-SNAPSHOT-common-assembly.jar#foo.jar
> mapred.create.symlink=yes
> mapred.map.child.java.opts=-javaagent:./foo.jar=classes=.*
>
> The map tasks fail with the following stdout/stderr output, resp.,
>
> Error occurred during initialization of VM
> agent library failed to init: instrument
>
> Error opening zip file or JAR manifest missing : ./foo.jar
>
> This seems like the jar is not symlinked into the current working
> directory of the child JVM; or perhaps the symlinking happens after
> the child JVM starts?
>
>
>
>
> On Fri, Aug 3, 2012 at 1:31 PM, Harsh J <ha...@cloudera.com> wrote:
>> Stan,
>>
>> What Arun says would surely work.
>>
>> For instance, read this command:
>>
>> hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0.jar pi
>> -files
>> "share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.0.0.jar#foo.jar"
>> -Dmapred.child.java.opts="-javaagent:./foo.jar" 1 1
>>
>> What this would do is merely take your passed -files jar (client-common) and
>> symlink it into the JVM's working directory (the task's working directory)
>> _before_ the JVM is begun, as "foo.jar". So if I pass additionally, JVM opts
>> that refer to this foo.jar under ./, then it would work as you expect it to,
>> as the JVM is begun from that directory (its CWD).
>>
>> Do let us know if this solves it and also makes sense?
>>
>>
>> On Fri, Aug 3, 2012 at 10:02 PM, Stan Rosenberg <st...@gmail.com>
>> wrote:
>>>
>>> Arun,
>>>
>>> I don't believe the symlink is of help.  The symlink is created in the
>>> task's current working directory (cwd), but I don't know what cwd is
>>> when I launch with 'hadoop jar ...'.
>>>
>>> Thanks,
>>>
>>> stan
>>>
>>> On Fri, Aug 3, 2012 at 2:39 AM, Arun C Murthy <ac...@hortonworks.com> wrote:
>>> > Stan,
>>> >
>>> >  You can ask TT to create a symlink to your jar shipped via DistCache:
>>> >
>>> >
>>> > http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html#DistributedCache
>>> >
>>> >  That should give you what you want.
>>> >
>>> > hth,
>>> > Arun
>>> >
>>> > On Jul 30, 2012, at 3:23 PM, Stan Rosenberg wrote:
>>> >
>>> > Hi,
>>> >
>>> > I am seeking a way to leverage hadoop's distributed cache in order to
>>> > ship jars that are required to bootstrap a task's jvm, i.e., before a
>>> > map/reduce task is launched.
>>> > As a concrete example, let's say that I need to launch with
>>> > '-javaagent:/path/profiler.jar'.  In theory, the task tracker is
>>> > responsible for downloading cached files onto its local filesystem.
>>> > However, the absolute path to a given cached file is not known a
>>> > priori; however, we need the path in order to configure '-javaagent'.
>>> >
>>> > Is this currently possible with the distributed cache? If not, is the
>>> > use case appealing enough to open a jira ticket?
>>> >
>>> > Thanks,
>>> >
>>> > stan
>>> >
>>> >
>>> > --
>>> > Arun C. Murthy
>>> > Hortonworks Inc.
>>> > http://hortonworks.com/
>>> >
>>> >
>>
>>
>>
>>
>> --
>> Harsh J