You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by bo yang <bo...@gmail.com> on 2014/04/28 21:57:20 UTC

Running YARN in-process Application Master?

Hi All,

I just joined this group, and not sure whether this question was discussed
before.

Is it possible to run Application Master within the same process as Node
Manager? If not, any plan to support it future?

I am asking this because we might want to use YARN as a job dispatching
system. We have our own heavy-weight service which is not practical to
be launched as a new process for each new job. A possible thing we can do
is to use YARN to launch a new dummy Application Master process which will
dispatch the job to our service. But that new dummy process is a little
waste. So if YARN supports in-process Application Master, that will be
great.

Thanks,
Bo

Re: Issue with partitioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.


Any suggestions?

---------

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true
 \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat
 " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Re: Issue with partitioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.


Any suggestions?

---------

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true
 \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat
 " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Re: Issue with partitioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.


Any suggestions?

---------

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true
 \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat
 " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Re: Issue with partitioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.


Any suggestions?

---------

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true
 \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat
 " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Issue with partioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Issue with partioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Issue with partioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Issue with partioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Re: Running YARN in-process Application Master?

Posted by bo yang <bo...@gmail.com>.

This is great info for me. Thanks Oleg! I will take a look. Hope it can
also fit in our production environment.

Best Regards,
Bo


On Tue, Apr 29, 2014 at 3:38 AM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> Yes there is. You can provide your own implementation of
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor and
> configure it as  'yarn.nodemanager.container-executor.class' property.
> There you can bypass Shell and create your own way of invoking processes.
> Obviously it only makes sense for testing with mini-cluster during the
> development and I think that's what you meant.
> I've been playing with it lately, so feel free to take a look -
> https://github.com/olegz/yaya/wiki/CoreFeatures#in-jvm-container-executor.
> At least you'll get some ideas out of it.
>
> Cheers
> Oleg
>
>
> On Mon, Apr 28, 2014 at 3:57 PM, bo yang <bo...@gmail.com> wrote:
>
>> Hi All,
>>
>> I just joined this group, and not sure whether this question was
>> discussed before.
>>
>> Is it possible to run Application Master within the same process as Node
>> Manager? If not, any plan to support it future?
>>
>> I am asking this because we might want to use YARN as a job dispatching
>> system. We have our own heavy-weight service which is not practical to
>> be launched as a new process for each new job. A possible thing we can do
>> is to use YARN to launch a new dummy Application Master process which will
>> dispatch the job to our service. But that new dummy process is a little
>> waste. So if YARN supports in-process Application Master, that will be
>> great.
>>
>> Thanks,
>> Bo
>>
>>
>

Re: Running YARN in-process Application Master?

Posted by bo yang <bo...@gmail.com>.

This is great info for me. Thanks Oleg! I will take a look. Hope it can
also fit in our production environment.

Best Regards,
Bo


On Tue, Apr 29, 2014 at 3:38 AM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> Yes there is. You can provide your own implementation of
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor and
> configure it as  'yarn.nodemanager.container-executor.class' property.
> There you can bypass Shell and create your own way of invoking processes.
> Obviously it only makes sense for testing with mini-cluster during the
> development and I think that's what you meant.
> I've been playing with it lately, so feel free to take a look -
> https://github.com/olegz/yaya/wiki/CoreFeatures#in-jvm-container-executor.
> At least you'll get some ideas out of it.
>
> Cheers
> Oleg
>
>
> On Mon, Apr 28, 2014 at 3:57 PM, bo yang <bo...@gmail.com> wrote:
>
>> Hi All,
>>
>> I just joined this group, and not sure whether this question was
>> discussed before.
>>
>> Is it possible to run Application Master within the same process as Node
>> Manager? If not, any plan to support it future?
>>
>> I am asking this because we might want to use YARN as a job dispatching
>> system. We have our own heavy-weight service which is not practical to
>> be launched as a new process for each new job. A possible thing we can do
>> is to use YARN to launch a new dummy Application Master process which will
>> dispatch the job to our service. But that new dummy process is a little
>> waste. So if YARN supports in-process Application Master, that will be
>> great.
>>
>> Thanks,
>> Bo
>>
>>
>

Re: Running YARN in-process Application Master?

Posted by bo yang <bo...@gmail.com>.

This is great info for me. Thanks Oleg! I will take a look. Hope it can
also fit in our production environment.

Best Regards,
Bo


On Tue, Apr 29, 2014 at 3:38 AM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> Yes there is. You can provide your own implementation of
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor and
> configure it as  'yarn.nodemanager.container-executor.class' property.
> There you can bypass Shell and create your own way of invoking processes.
> Obviously it only makes sense for testing with mini-cluster during the
> development and I think that's what you meant.
> I've been playing with it lately, so feel free to take a look -
> https://github.com/olegz/yaya/wiki/CoreFeatures#in-jvm-container-executor.
> At least you'll get some ideas out of it.
>
> Cheers
> Oleg
>
>
> On Mon, Apr 28, 2014 at 3:57 PM, bo yang <bo...@gmail.com> wrote:
>
>> Hi All,
>>
>> I just joined this group, and not sure whether this question was
>> discussed before.
>>
>> Is it possible to run Application Master within the same process as Node
>> Manager? If not, any plan to support it future?
>>
>> I am asking this because we might want to use YARN as a job dispatching
>> system. We have our own heavy-weight service which is not practical to
>> be launched as a new process for each new job. A possible thing we can do
>> is to use YARN to launch a new dummy Application Master process which will
>> dispatch the job to our service. But that new dummy process is a little
>> waste. So if YARN supports in-process Application Master, that will be
>> great.
>>
>> Thanks,
>> Bo
>>
>>
>

Re: Running YARN in-process Application Master?

Posted by bo yang <bo...@gmail.com>.

This is great info for me. Thanks Oleg! I will take a look. Hope it can
also fit in our production environment.

Best Regards,
Bo


On Tue, Apr 29, 2014 at 3:38 AM, Oleg Zhurakousky <
oleg.zhurakousky@gmail.com> wrote:

> Yes there is. You can provide your own implementation of
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor and
> configure it as  'yarn.nodemanager.container-executor.class' property.
> There you can bypass Shell and create your own way of invoking processes.
> Obviously it only makes sense for testing with mini-cluster during the
> development and I think that's what you meant.
> I've been playing with it lately, so feel free to take a look -
> https://github.com/olegz/yaya/wiki/CoreFeatures#in-jvm-container-executor.
> At least you'll get some ideas out of it.
>
> Cheers
> Oleg
>
>
> On Mon, Apr 28, 2014 at 3:57 PM, bo yang <bo...@gmail.com> wrote:
>
>> Hi All,
>>
>> I just joined this group, and not sure whether this question was
>> discussed before.
>>
>> Is it possible to run Application Master within the same process as Node
>> Manager? If not, any plan to support it future?
>>
>> I am asking this because we might want to use YARN as a job dispatching
>> system. We have our own heavy-weight service which is not practical to
>> be launched as a new process for each new job. A possible thing we can do
>> is to use YARN to launch a new dummy Application Master process which will
>> dispatch the job to our service. But that new dummy process is a little
>> waste. So if YARN supports in-process Application Master, that will be
>> great.
>>
>> Thanks,
>> Bo
>>
>>
>

Re: Running YARN in-process Application Master?

Posted by Oleg Zhurakousky <ol...@gmail.com>.

Yes there is. You can provide your own implementation of
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor and
configure it as  'yarn.nodemanager.container-executor.class' property.
There you can bypass Shell and create your own way of invoking processes.
Obviously it only makes sense for testing with mini-cluster during the
development and I think that's what you meant.
I've been playing with it lately, so feel free to take a look -
https://github.com/olegz/yaya/wiki/CoreFeatures#in-jvm-container-executor.
At least you'll get some ideas out of it.

Cheers
Oleg

On Mon, Apr 28, 2014 at 3:57 PM, bo yang <bo...@gmail.com> wrote:

> Hi All,
>
> I just joined this group, and not sure whether this question was discussed
> before.
>
> Is it possible to run Application Master within the same process as Node
> Manager? If not, any plan to support it future?
>
> I am asking this because we might want to use YARN as a job dispatching
> system. We have our own heavy-weight service which is not practical to
> be launched as a new process for each new job. A possible thing we can do
> is to use YARN to launch a new dummy Application Master process which will
> dispatch the job to our service. But that new dummy process is a little
> waste. So if YARN supports in-process Application Master, that will be
> great.
>
> Thanks,
> Bo
>
>

Re: Running YARN in-process Application Master?

Posted by Oleg Zhurakousky <ol...@gmail.com>.

Yes there is. You can provide your own implementation of
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor and
configure it as  'yarn.nodemanager.container-executor.class' property.
There you can bypass Shell and create your own way of invoking processes.
Obviously it only makes sense for testing with mini-cluster during the
development and I think that's what you meant.
I've been playing with it lately, so feel free to take a look -
https://github.com/olegz/yaya/wiki/CoreFeatures#in-jvm-container-executor.
At least you'll get some ideas out of it.

Cheers
Oleg

On Mon, Apr 28, 2014 at 3:57 PM, bo yang <bo...@gmail.com> wrote:

> Hi All,
>
> I just joined this group, and not sure whether this question was discussed
> before.
>
> Is it possible to run Application Master within the same process as Node
> Manager? If not, any plan to support it future?
>
> I am asking this because we might want to use YARN as a job dispatching
> system. We have our own heavy-weight service which is not practical to
> be launched as a new process for each new job. A possible thing we can do
> is to use YARN to launch a new dummy Application Master process which will
> dispatch the job to our service. But that new dummy process is a little
> waste. So if YARN supports in-process Application Master, that will be
> great.
>
> Thanks,
> Bo
>
>

Re: Running YARN in-process Application Master?

Posted by Oleg Zhurakousky <ol...@gmail.com>.

Yes there is. You can provide your own implementation of
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor and
configure it as  'yarn.nodemanager.container-executor.class' property.
There you can bypass Shell and create your own way of invoking processes.
Obviously it only makes sense for testing with mini-cluster during the
development and I think that's what you meant.
I've been playing with it lately, so feel free to take a look -
https://github.com/olegz/yaya/wiki/CoreFeatures#in-jvm-container-executor.
At least you'll get some ideas out of it.

Cheers
Oleg

On Mon, Apr 28, 2014 at 3:57 PM, bo yang <bo...@gmail.com> wrote:

> Hi All,
>
> I just joined this group, and not sure whether this question was discussed
> before.
>
> Is it possible to run Application Master within the same process as Node
> Manager? If not, any plan to support it future?
>
> I am asking this because we might want to use YARN as a job dispatching
> system. We have our own heavy-weight service which is not practical to
> be launched as a new process for each new job. A possible thing we can do
> is to use YARN to launch a new dummy Application Master process which will
> dispatch the job to our service. But that new dummy process is a little
> waste. So if YARN supports in-process Application Master, that will be
> great.
>
> Thanks,
> Bo
>
>

Re: Running YARN in-process Application Master?

Posted by Oleg Zhurakousky <ol...@gmail.com>.

Yes there is. You can provide your own implementation of
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor and
configure it as  'yarn.nodemanager.container-executor.class' property.
There you can bypass Shell and create your own way of invoking processes.
Obviously it only makes sense for testing with mini-cluster during the
development and I think that's what you meant.
I've been playing with it lately, so feel free to take a look -
https://github.com/olegz/yaya/wiki/CoreFeatures#in-jvm-container-executor.
At least you'll get some ideas out of it.

Cheers
Oleg

On Mon, Apr 28, 2014 at 3:57 PM, bo yang <bo...@gmail.com> wrote:

> Hi All,
>
> I just joined this group, and not sure whether this question was discussed
> before.
>
> Is it possible to run Application Master within the same process as Node
> Manager? If not, any plan to support it future?
>
> I am asking this because we might want to use YARN as a job dispatching
> system. We have our own heavy-weight service which is not practical to
> be launched as a new process for each new job. A possible thing we can do
> is to use YARN to launch a new dummy Application Master process which will
> dispatch the job to our service. But that new dummy process is a little
> waste. So if YARN supports in-process Application Master, that will be
> great.
>
> Thanks,
> Bo
>
>