You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Yang <te...@gmail.com> on 2014/10/27 07:19:45 UTC

run arbitrary job (non-MR) on YARN ?

I happened to run into this interesting scenario:

I had some mahout seq2sparse jobs, originally i run them in parallel using
the distributed mode. but because the input files are so small, running
them locally actually is much faster. so I truned them to local mode.

but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
together, everyone became very slow.

is there an existing code that takes a desired shell script, and possibly
some archive files (could contain the jar file, or C++ --generated
executable code). I understand that I could use yarn API to code such a
thing, but it would be nice if I could just take it and run in shell..

Thanks
Yang

Re: run arbitrary job (non-MR) on YARN ?

Posted by Yang <te...@gmail.com>.

thanks!

On Wed, Oct 29, 2014 at 2:38 PM, Kevin <ke...@gmail.com> wrote:

> You can accomplish this by using the DistributedShell application that
> comes with YARN.
>
> If you copy all your archives to HDFS, then inside your shell script you
> could copy those archives to your YARN container and then execute whatever
> you want, provided all the other system dependencies exist in the container
> (correct Java version, Python, C++ libraries, etc.)
>
> For example,
>
> In myscript.sh I wrote the following:
>
> #!/usr/bin/env bash
> echo "This is my script running!"
> echo "Present working directory:"
> pwd
> echo "Current directory listing: (nothing exciting yet)"
> ls
> echo "Copying file from HDFS to container"
> hadoop fs -get /path/to/some/data/on/hdfs .
> echo "Current directory listing: (file should not be here)"
> ls
> echo "Cat ExecScript.sh (this is the script created by the
> DistributedShell application)"
> cat ExecScript.sh
>
> Run the DistributedShell application with the hadoop (or yarn) command:
>
> hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar
> /usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar
> -num_containers 1 -shell_script myscript.sh
>
> If you have the YARN log aggregation property set, then you can pipe the
> container's logs to your client console using the yarn command:
>
> yarn logs -applicationId application_1414160538995_0035
>
> (replace the application id with yours)
>
> Here is a quick reference that should help get you going:
>
> http://books.google.com/books?id=heoXAwAAQBAJ&pg=PA227&lpg=PA227&dq=hadoop+yarn+distributed+shell+application&source=bl&ots=psGuJYlY1Y&sig=khp3b3hgzsZLZWFfz7GOe2yhgyY&hl=en&sa=X&ei=0U5RVKzDLeTK8gGgoYGoDQ&ved=0CFcQ6AEwCA#v=onepage&q&f=false
>
> Hopefully this helps,
> Kevin
>
> On Mon Oct 27 2014 at 2:21:18 AM Yang <te...@gmail.com> wrote:
>
>> I happened to run into this interesting scenario:
>>
>> I had some mahout seq2sparse jobs, originally i run them in parallel
>> using the distributed mode. but because the input files are so small,
>> running them locally actually is much faster. so I truned them to local
>> mode.
>>
>> but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
>> together, everyone became very slow.
>>
>> is there an existing code that takes a desired shell script, and possibly
>> some archive files (could contain the jar file, or C++ --generated
>> executable code). I understand that I could use yarn API to code such a
>> thing, but it would be nice if I could just take it and run in shell..
>>
>> Thanks
>> Yang
>>
>

Re: run arbitrary job (non-MR) on YARN ?

Posted by Yang <te...@gmail.com>.

thanks!

On Wed, Oct 29, 2014 at 2:38 PM, Kevin <ke...@gmail.com> wrote:

> You can accomplish this by using the DistributedShell application that
> comes with YARN.
>
> If you copy all your archives to HDFS, then inside your shell script you
> could copy those archives to your YARN container and then execute whatever
> you want, provided all the other system dependencies exist in the container
> (correct Java version, Python, C++ libraries, etc.)
>
> For example,
>
> In myscript.sh I wrote the following:
>
> #!/usr/bin/env bash
> echo "This is my script running!"
> echo "Present working directory:"
> pwd
> echo "Current directory listing: (nothing exciting yet)"
> ls
> echo "Copying file from HDFS to container"
> hadoop fs -get /path/to/some/data/on/hdfs .
> echo "Current directory listing: (file should not be here)"
> ls
> echo "Cat ExecScript.sh (this is the script created by the
> DistributedShell application)"
> cat ExecScript.sh
>
> Run the DistributedShell application with the hadoop (or yarn) command:
>
> hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar
> /usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar
> -num_containers 1 -shell_script myscript.sh
>
> If you have the YARN log aggregation property set, then you can pipe the
> container's logs to your client console using the yarn command:
>
> yarn logs -applicationId application_1414160538995_0035
>
> (replace the application id with yours)
>
> Here is a quick reference that should help get you going:
>
> http://books.google.com/books?id=heoXAwAAQBAJ&pg=PA227&lpg=PA227&dq=hadoop+yarn+distributed+shell+application&source=bl&ots=psGuJYlY1Y&sig=khp3b3hgzsZLZWFfz7GOe2yhgyY&hl=en&sa=X&ei=0U5RVKzDLeTK8gGgoYGoDQ&ved=0CFcQ6AEwCA#v=onepage&q&f=false
>
> Hopefully this helps,
> Kevin
>
> On Mon Oct 27 2014 at 2:21:18 AM Yang <te...@gmail.com> wrote:
>
>> I happened to run into this interesting scenario:
>>
>> I had some mahout seq2sparse jobs, originally i run them in parallel
>> using the distributed mode. but because the input files are so small,
>> running them locally actually is much faster. so I truned them to local
>> mode.
>>
>> but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
>> together, everyone became very slow.
>>
>> is there an existing code that takes a desired shell script, and possibly
>> some archive files (could contain the jar file, or C++ --generated
>> executable code). I understand that I could use yarn API to code such a
>> thing, but it would be nice if I could just take it and run in shell..
>>
>> Thanks
>> Yang
>>
>

Re: run arbitrary job (non-MR) on YARN ?

Posted by Yang <te...@gmail.com>.

thanks!

On Wed, Oct 29, 2014 at 2:38 PM, Kevin <ke...@gmail.com> wrote:

> You can accomplish this by using the DistributedShell application that
> comes with YARN.
>
> If you copy all your archives to HDFS, then inside your shell script you
> could copy those archives to your YARN container and then execute whatever
> you want, provided all the other system dependencies exist in the container
> (correct Java version, Python, C++ libraries, etc.)
>
> For example,
>
> In myscript.sh I wrote the following:
>
> #!/usr/bin/env bash
> echo "This is my script running!"
> echo "Present working directory:"
> pwd
> echo "Current directory listing: (nothing exciting yet)"
> ls
> echo "Copying file from HDFS to container"
> hadoop fs -get /path/to/some/data/on/hdfs .
> echo "Current directory listing: (file should not be here)"
> ls
> echo "Cat ExecScript.sh (this is the script created by the
> DistributedShell application)"
> cat ExecScript.sh
>
> Run the DistributedShell application with the hadoop (or yarn) command:
>
> hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar
> /usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar
> -num_containers 1 -shell_script myscript.sh
>
> If you have the YARN log aggregation property set, then you can pipe the
> container's logs to your client console using the yarn command:
>
> yarn logs -applicationId application_1414160538995_0035
>
> (replace the application id with yours)
>
> Here is a quick reference that should help get you going:
>
> http://books.google.com/books?id=heoXAwAAQBAJ&pg=PA227&lpg=PA227&dq=hadoop+yarn+distributed+shell+application&source=bl&ots=psGuJYlY1Y&sig=khp3b3hgzsZLZWFfz7GOe2yhgyY&hl=en&sa=X&ei=0U5RVKzDLeTK8gGgoYGoDQ&ved=0CFcQ6AEwCA#v=onepage&q&f=false
>
> Hopefully this helps,
> Kevin
>
> On Mon Oct 27 2014 at 2:21:18 AM Yang <te...@gmail.com> wrote:
>
>> I happened to run into this interesting scenario:
>>
>> I had some mahout seq2sparse jobs, originally i run them in parallel
>> using the distributed mode. but because the input files are so small,
>> running them locally actually is much faster. so I truned them to local
>> mode.
>>
>> but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
>> together, everyone became very slow.
>>
>> is there an existing code that takes a desired shell script, and possibly
>> some archive files (could contain the jar file, or C++ --generated
>> executable code). I understand that I could use yarn API to code such a
>> thing, but it would be nice if I could just take it and run in shell..
>>
>> Thanks
>> Yang
>>
>

Re: run arbitrary job (non-MR) on YARN ?

Posted by Yang <te...@gmail.com>.

thanks!

On Wed, Oct 29, 2014 at 2:38 PM, Kevin <ke...@gmail.com> wrote:

> You can accomplish this by using the DistributedShell application that
> comes with YARN.
>
> If you copy all your archives to HDFS, then inside your shell script you
> could copy those archives to your YARN container and then execute whatever
> you want, provided all the other system dependencies exist in the container
> (correct Java version, Python, C++ libraries, etc.)
>
> For example,
>
> In myscript.sh I wrote the following:
>
> #!/usr/bin/env bash
> echo "This is my script running!"
> echo "Present working directory:"
> pwd
> echo "Current directory listing: (nothing exciting yet)"
> ls
> echo "Copying file from HDFS to container"
> hadoop fs -get /path/to/some/data/on/hdfs .
> echo "Current directory listing: (file should not be here)"
> ls
> echo "Cat ExecScript.sh (this is the script created by the
> DistributedShell application)"
> cat ExecScript.sh
>
> Run the DistributedShell application with the hadoop (or yarn) command:
>
> hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar
> /usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar
> -num_containers 1 -shell_script myscript.sh
>
> If you have the YARN log aggregation property set, then you can pipe the
> container's logs to your client console using the yarn command:
>
> yarn logs -applicationId application_1414160538995_0035
>
> (replace the application id with yours)
>
> Here is a quick reference that should help get you going:
>
> http://books.google.com/books?id=heoXAwAAQBAJ&pg=PA227&lpg=PA227&dq=hadoop+yarn+distributed+shell+application&source=bl&ots=psGuJYlY1Y&sig=khp3b3hgzsZLZWFfz7GOe2yhgyY&hl=en&sa=X&ei=0U5RVKzDLeTK8gGgoYGoDQ&ved=0CFcQ6AEwCA#v=onepage&q&f=false
>
> Hopefully this helps,
> Kevin
>
> On Mon Oct 27 2014 at 2:21:18 AM Yang <te...@gmail.com> wrote:
>
>> I happened to run into this interesting scenario:
>>
>> I had some mahout seq2sparse jobs, originally i run them in parallel
>> using the distributed mode. but because the input files are so small,
>> running them locally actually is much faster. so I truned them to local
>> mode.
>>
>> but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
>> together, everyone became very slow.
>>
>> is there an existing code that takes a desired shell script, and possibly
>> some archive files (could contain the jar file, or C++ --generated
>> executable code). I understand that I could use yarn API to code such a
>> thing, but it would be nice if I could just take it and run in shell..
>>
>> Thanks
>> Yang
>>
>

Re: run arbitrary job (non-MR) on YARN ?

Posted by Kevin <ke...@gmail.com>.

You can accomplish this by using the DistributedShell application that
comes with YARN.

If you copy all your archives to HDFS, then inside your shell script you
could copy those archives to your YARN container and then execute whatever
you want, provided all the other system dependencies exist in the container
(correct Java version, Python, C++ libraries, etc.)

For example,

In myscript.sh I wrote the following:

#!/usr/bin/env bash
echo "This is my script running!"
echo "Present working directory:"
pwd
echo "Current directory listing: (nothing exciting yet)"
ls
echo "Copying file from HDFS to container"
hadoop fs -get /path/to/some/data/on/hdfs .
echo "Current directory listing: (file should not be here)"
ls
echo "Cat ExecScript.sh (this is the script created by the DistributedShell
application)"
cat ExecScript.sh

Run the DistributedShell application with the hadoop (or yarn) command:

hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar
/usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar
-num_containers 1 -shell_script myscript.sh

If you have the YARN log aggregation property set, then you can pipe the
container's logs to your client console using the yarn command:

yarn logs -applicationId application_1414160538995_0035

(replace the application id with yours)

Here is a quick reference that should help get you going:
http://books.google.com/books?id=heoXAwAAQBAJ&pg=PA227&lpg=PA227&dq=hadoop+yarn+distributed+shell+application&source=bl&ots=psGuJYlY1Y&sig=khp3b3hgzsZLZWFfz7GOe2yhgyY&hl=en&sa=X&ei=0U5RVKzDLeTK8gGgoYGoDQ&ved=0CFcQ6AEwCA#v=onepage&q&f=false

Hopefully this helps,
Kevin

On Mon Oct 27 2014 at 2:21:18 AM Yang <te...@gmail.com> wrote:

> I happened to run into this interesting scenario:
>
> I had some mahout seq2sparse jobs, originally i run them in parallel using
> the distributed mode. but because the input files are so small, running
> them locally actually is much faster. so I truned them to local mode.
>
> but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
> together, everyone became very slow.
>
> is there an existing code that takes a desired shell script, and possibly
> some archive files (could contain the jar file, or C++ --generated
> executable code). I understand that I could use yarn API to code such a
> thing, but it would be nice if I could just take it and run in shell..
>
> Thanks
> Yang
>

Re: run arbitrary job (non-MR) on YARN ?

Posted by Kevin <ke...@gmail.com>.

You can accomplish this by using the DistributedShell application that
comes with YARN.

If you copy all your archives to HDFS, then inside your shell script you
could copy those archives to your YARN container and then execute whatever
you want, provided all the other system dependencies exist in the container
(correct Java version, Python, C++ libraries, etc.)

For example,

In myscript.sh I wrote the following:

#!/usr/bin/env bash
echo "This is my script running!"
echo "Present working directory:"
pwd
echo "Current directory listing: (nothing exciting yet)"
ls
echo "Copying file from HDFS to container"
hadoop fs -get /path/to/some/data/on/hdfs .
echo "Current directory listing: (file should not be here)"
ls
echo "Cat ExecScript.sh (this is the script created by the DistributedShell
application)"
cat ExecScript.sh

Run the DistributedShell application with the hadoop (or yarn) command:

hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar
/usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar
-num_containers 1 -shell_script myscript.sh

If you have the YARN log aggregation property set, then you can pipe the
container's logs to your client console using the yarn command:

yarn logs -applicationId application_1414160538995_0035

(replace the application id with yours)

Here is a quick reference that should help get you going:
http://books.google.com/books?id=heoXAwAAQBAJ&pg=PA227&lpg=PA227&dq=hadoop+yarn+distributed+shell+application&source=bl&ots=psGuJYlY1Y&sig=khp3b3hgzsZLZWFfz7GOe2yhgyY&hl=en&sa=X&ei=0U5RVKzDLeTK8gGgoYGoDQ&ved=0CFcQ6AEwCA#v=onepage&q&f=false

Hopefully this helps,
Kevin

On Mon Oct 27 2014 at 2:21:18 AM Yang <te...@gmail.com> wrote:

> I happened to run into this interesting scenario:
>
> I had some mahout seq2sparse jobs, originally i run them in parallel using
> the distributed mode. but because the input files are so small, running
> them locally actually is much faster. so I truned them to local mode.
>
> but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
> together, everyone became very slow.
>
> is there an existing code that takes a desired shell script, and possibly
> some archive files (could contain the jar file, or C++ --generated
> executable code). I understand that I could use yarn API to code such a
> thing, but it would be nice if I could just take it and run in shell..
>
> Thanks
> Yang
>

Re: run arbitrary job (non-MR) on YARN ?

Posted by Kevin <ke...@gmail.com>.

You can accomplish this by using the DistributedShell application that
comes with YARN.

If you copy all your archives to HDFS, then inside your shell script you
could copy those archives to your YARN container and then execute whatever
you want, provided all the other system dependencies exist in the container
(correct Java version, Python, C++ libraries, etc.)

For example,

In myscript.sh I wrote the following:

#!/usr/bin/env bash
echo "This is my script running!"
echo "Present working directory:"
pwd
echo "Current directory listing: (nothing exciting yet)"
ls
echo "Copying file from HDFS to container"
hadoop fs -get /path/to/some/data/on/hdfs .
echo "Current directory listing: (file should not be here)"
ls
echo "Cat ExecScript.sh (this is the script created by the DistributedShell
application)"
cat ExecScript.sh

Run the DistributedShell application with the hadoop (or yarn) command:

hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar
/usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar
-num_containers 1 -shell_script myscript.sh

If you have the YARN log aggregation property set, then you can pipe the
container's logs to your client console using the yarn command:

yarn logs -applicationId application_1414160538995_0035

(replace the application id with yours)

Here is a quick reference that should help get you going:
http://books.google.com/books?id=heoXAwAAQBAJ&pg=PA227&lpg=PA227&dq=hadoop+yarn+distributed+shell+application&source=bl&ots=psGuJYlY1Y&sig=khp3b3hgzsZLZWFfz7GOe2yhgyY&hl=en&sa=X&ei=0U5RVKzDLeTK8gGgoYGoDQ&ved=0CFcQ6AEwCA#v=onepage&q&f=false

Hopefully this helps,
Kevin

On Mon Oct 27 2014 at 2:21:18 AM Yang <te...@gmail.com> wrote:

> I happened to run into this interesting scenario:
>
> I had some mahout seq2sparse jobs, originally i run them in parallel using
> the distributed mode. but because the input files are so small, running
> them locally actually is much faster. so I truned them to local mode.
>
> but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
> together, everyone became very slow.
>
> is there an existing code that takes a desired shell script, and possibly
> some archive files (could contain the jar file, or C++ --generated
> executable code). I understand that I could use yarn API to code such a
> thing, but it would be nice if I could just take it and run in shell..
>
> Thanks
> Yang
>

Re: run arbitrary job (non-MR) on YARN ?

Posted by Kevin <ke...@gmail.com>.

You can accomplish this by using the DistributedShell application that
comes with YARN.

If you copy all your archives to HDFS, then inside your shell script you
could copy those archives to your YARN container and then execute whatever
you want, provided all the other system dependencies exist in the container
(correct Java version, Python, C++ libraries, etc.)

For example,

In myscript.sh I wrote the following:

#!/usr/bin/env bash
echo "This is my script running!"
echo "Present working directory:"
pwd
echo "Current directory listing: (nothing exciting yet)"
ls
echo "Copying file from HDFS to container"
hadoop fs -get /path/to/some/data/on/hdfs .
echo "Current directory listing: (file should not be here)"
ls
echo "Cat ExecScript.sh (this is the script created by the DistributedShell
application)"
cat ExecScript.sh

Run the DistributedShell application with the hadoop (or yarn) command:

hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar
/usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar
-num_containers 1 -shell_script myscript.sh

If you have the YARN log aggregation property set, then you can pipe the
container's logs to your client console using the yarn command:

yarn logs -applicationId application_1414160538995_0035

(replace the application id with yours)

Here is a quick reference that should help get you going:
http://books.google.com/books?id=heoXAwAAQBAJ&pg=PA227&lpg=PA227&dq=hadoop+yarn+distributed+shell+application&source=bl&ots=psGuJYlY1Y&sig=khp3b3hgzsZLZWFfz7GOe2yhgyY&hl=en&sa=X&ei=0U5RVKzDLeTK8gGgoYGoDQ&ved=0CFcQ6AEwCA#v=onepage&q&f=false

Hopefully this helps,
Kevin

On Mon Oct 27 2014 at 2:21:18 AM Yang <te...@gmail.com> wrote:

> I happened to run into this interesting scenario:
>
> I had some mahout seq2sparse jobs, originally i run them in parallel using
> the distributed mode. but because the input files are so small, running
> them locally actually is much faster. so I truned them to local mode.
>
> but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
> together, everyone became very slow.
>
> is there an existing code that takes a desired shell script, and possibly
> some archive files (could contain the jar file, or C++ --generated
> executable code). I understand that I could use yarn API to code such a
> thing, but it would be nice if I could just take it and run in shell..
>
> Thanks
> Yang
>