You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Zoltán Tóth-Czifra <zo...@softonic.com> on 2012/11/27 13:03:41 UTC

Complex MapReduce applications with the streaming API

Hi everyone,

Thanks in advance for the support. My problem is the following:

I'm trying to develop a fairly complex MapReduce application using the streaming API (for demonstation purposes, so unfortunately the "use Java" answer doesn't work :-( ). I can get one single MapReduce phase running from command line with no problem. The problem is when I want to add more MapReduce phases which use each others output, and I maybe even want to do a recursion (feed the its output to the same phase again) conditioned by a counter.

The solution in Java MapReduce is trivial (i.e. creating multiple Job instances and monitoring counters) but with the streaming API not quite. What is the correct way to manage my application with its native code? (Python, PHP, Perl...) Calling shell commands from a "controller" script? How should I obtain counters?...

Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops" so the recusrsion can't really be implemented.

Thanks a lot!
Zoltan

RE: Complex MapReduce applications with the streaming API

Posted by Zoltán Tóth-Czifra <zo...@softonic.com>.

Hi,

Thanks, the self-referencing subworkflow is a good idea, it never occured to me.
However, I'm still expecting something that is more light-weight, with no Oozie or external tools.

My best idea now is simply abstracting the exec call in my script that submits the job (hadoop jar hadoop-streaming.jar ...), extracing JobId from the output, then abstracing another exec (hadoop job -counter ....) which can give me info about the counters. Is this the best option?

Thanks!

________________________________
From: Alejandro Abdelnur [tucu@cloudera.com]
Sent: Tuesday, November 27, 2012 6:10 PM
To: common-user@hadoop.apache.org
Subject: Re: Complex MapReduce applications with the streaming API

> Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design (early prototypes supported loops). The idea was that you didn't want never ending workflows. To this end, Coordinator Jobs address the recurrent run of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow invoking to itself as a sub-workflow. Just make sure you define properly your exit condition.

If you have additional questions, please move this thread to the user@oozie.apache.org<ma...@oozie.apache.org> alias.

Thx

On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra <zo...@softonic.com>> wrote:
Hi everyone,

Thanks in advance for the support. My problem is the following:

I'm trying to develop a fairly complex MapReduce application using the streaming API (for demonstation purposes, so unfortunately the "use Java" answer doesn't work :-( ). I can get one single MapReduce phase running from command line with no problem. The problem is when I want to add more MapReduce phases which use each others output, and I maybe even want to do a recursion (feed the its output to the same phase again) conditioned by a counter.

The solution in Java MapReduce is trivial (i.e. creating multiple Job instances and monitoring counters) but with the streaming API not quite. What is the correct way to manage my application with its native code? (Python, PHP, Perl...) Calling shell commands from a "controller" script? How should I obtain counters?...

Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops" so the recusrsion can't really be implemented.

Thanks a lot!
Zoltan

--
Alejandro

RE: Complex MapReduce applications with the streaming API

Posted by Zoltán Tóth-Czifra <zo...@softonic.com>.

Hi,

Thanks, the self-referencing subworkflow is a good idea, it never occured to me.
However, I'm still expecting something that is more light-weight, with no Oozie or external tools.

My best idea now is simply abstracting the exec call in my script that submits the job (hadoop jar hadoop-streaming.jar ...), extracing JobId from the output, then abstracing another exec (hadoop job -counter ....) which can give me info about the counters. Is this the best option?

Thanks!

________________________________
From: Alejandro Abdelnur [tucu@cloudera.com]
Sent: Tuesday, November 27, 2012 6:10 PM
To: common-user@hadoop.apache.org
Subject: Re: Complex MapReduce applications with the streaming API

> Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design (early prototypes supported loops). The idea was that you didn't want never ending workflows. To this end, Coordinator Jobs address the recurrent run of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow invoking to itself as a sub-workflow. Just make sure you define properly your exit condition.

If you have additional questions, please move this thread to the user@oozie.apache.org<ma...@oozie.apache.org> alias.

Thx

On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra <zo...@softonic.com>> wrote:
Hi everyone,

Thanks in advance for the support. My problem is the following:

I'm trying to develop a fairly complex MapReduce application using the streaming API (for demonstation purposes, so unfortunately the "use Java" answer doesn't work :-( ). I can get one single MapReduce phase running from command line with no problem. The problem is when I want to add more MapReduce phases which use each others output, and I maybe even want to do a recursion (feed the its output to the same phase again) conditioned by a counter.

The solution in Java MapReduce is trivial (i.e. creating multiple Job instances and monitoring counters) but with the streaming API not quite. What is the correct way to manage my application with its native code? (Python, PHP, Perl...) Calling shell commands from a "controller" script? How should I obtain counters?...

Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops" so the recusrsion can't really be implemented.

Thanks a lot!
Zoltan

--
Alejandro

RE: Complex MapReduce applications with the streaming API

Posted by Zoltán Tóth-Czifra <zo...@softonic.com>.

Hi,

Thanks, the self-referencing subworkflow is a good idea, it never occured to me.
However, I'm still expecting something that is more light-weight, with no Oozie or external tools.

My best idea now is simply abstracting the exec call in my script that submits the job (hadoop jar hadoop-streaming.jar ...), extracing JobId from the output, then abstracing another exec (hadoop job -counter ....) which can give me info about the counters. Is this the best option?

Thanks!

________________________________
From: Alejandro Abdelnur [tucu@cloudera.com]
Sent: Tuesday, November 27, 2012 6:10 PM
To: common-user@hadoop.apache.org
Subject: Re: Complex MapReduce applications with the streaming API

> Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design (early prototypes supported loops). The idea was that you didn't want never ending workflows. To this end, Coordinator Jobs address the recurrent run of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow invoking to itself as a sub-workflow. Just make sure you define properly your exit condition.

If you have additional questions, please move this thread to the user@oozie.apache.org<ma...@oozie.apache.org> alias.

Thx

On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra <zo...@softonic.com>> wrote:
Hi everyone,

Thanks in advance for the support. My problem is the following:

I'm trying to develop a fairly complex MapReduce application using the streaming API (for demonstation purposes, so unfortunately the "use Java" answer doesn't work :-( ). I can get one single MapReduce phase running from command line with no problem. The problem is when I want to add more MapReduce phases which use each others output, and I maybe even want to do a recursion (feed the its output to the same phase again) conditioned by a counter.

The solution in Java MapReduce is trivial (i.e. creating multiple Job instances and monitoring counters) but with the streaming API not quite. What is the correct way to manage my application with its native code? (Python, PHP, Perl...) Calling shell commands from a "controller" script? How should I obtain counters?...

Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops" so the recusrsion can't really be implemented.

Thanks a lot!
Zoltan

--
Alejandro

RE: Complex MapReduce applications with the streaming API

Posted by Zoltán Tóth-Czifra <zo...@softonic.com>.

Hi,

Thanks, the self-referencing subworkflow is a good idea, it never occured to me.
However, I'm still expecting something that is more light-weight, with no Oozie or external tools.

My best idea now is simply abstracting the exec call in my script that submits the job (hadoop jar hadoop-streaming.jar ...), extracing JobId from the output, then abstracing another exec (hadoop job -counter ....) which can give me info about the counters. Is this the best option?

Thanks!

________________________________
From: Alejandro Abdelnur [tucu@cloudera.com]
Sent: Tuesday, November 27, 2012 6:10 PM
To: common-user@hadoop.apache.org
Subject: Re: Complex MapReduce applications with the streaming API

> Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design (early prototypes supported loops). The idea was that you didn't want never ending workflows. To this end, Coordinator Jobs address the recurrent run of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow invoking to itself as a sub-workflow. Just make sure you define properly your exit condition.

If you have additional questions, please move this thread to the user@oozie.apache.org<ma...@oozie.apache.org> alias.

Thx

On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra <zo...@softonic.com>> wrote:
Hi everyone,

Thanks in advance for the support. My problem is the following:

I'm trying to develop a fairly complex MapReduce application using the streaming API (for demonstation purposes, so unfortunately the "use Java" answer doesn't work :-( ). I can get one single MapReduce phase running from command line with no problem. The problem is when I want to add more MapReduce phases which use each others output, and I maybe even want to do a recursion (feed the its output to the same phase again) conditioned by a counter.

The solution in Java MapReduce is trivial (i.e. creating multiple Job instances and monitoring counters) but with the streaming API not quite. What is the correct way to manage my application with its native code? (Python, PHP, Perl...) Calling shell commands from a "controller" script? How should I obtain counters?...

Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops" so the recusrsion can't really be implemented.

Thanks a lot!
Zoltan

--
Alejandro

Re: Complex MapReduce applications with the streaming API

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

> Using Oozie seems to be an overkilling for this application, besides, it
doesn't support "loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design
(early prototypes supported loops). The idea was that you didn't want never
ending workflows. To this end, Coordinator Jobs address the recurrent run
of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow
invoking to itself as a sub-workflow. Just make sure you define properly
your exit condition.

If you have additional questions, please move this thread to the
user@oozie.apache.org alias.


Thx


On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra <
zoltan.tothczifra@softonic.com> wrote:

>  Hi everyone,
>
>  Thanks in advance for the support. My problem is the following:
>
>  I'm trying to develop a fairly complex MapReduce application using the
> streaming API (for demonstation purposes, so unfortunately the "use Java"
> answer doesn't work :-( ). I can get one single MapReduce phase running
> from command line with no problem. The problem is when I want to add more
> MapReduce phases which use each others output, and I maybe even want to do
> a recursion (feed the its output to the same phase again) conditioned by a
> counter.
>
>  The solution in Java MapReduce is trivial (i.e. creating multiple Job
> instances and monitoring counters) but with the streaming API not quite.
> What is the correct way to manage my application with its native code?
> (Python, PHP, Perl...) Calling shell commands from a "controller" script?
> How should I obtain counters?...
>
>  Using Oozie seems to be an overkilling for this application, besides, it
> doesn't support "loops" so the recusrsion can't really be implemented.
>
>  Thanks a lot!
> Zoltan
>



-- 
Alejandro

Re: Complex MapReduce applications with the streaming API

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

> Using Oozie seems to be an overkilling for this application, besides, it
doesn't support "loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design
(early prototypes supported loops). The idea was that you didn't want never
ending workflows. To this end, Coordinator Jobs address the recurrent run
of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow
invoking to itself as a sub-workflow. Just make sure you define properly
your exit condition.

If you have additional questions, please move this thread to the
user@oozie.apache.org alias.


Thx


On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra <
zoltan.tothczifra@softonic.com> wrote:

>  Hi everyone,
>
>  Thanks in advance for the support. My problem is the following:
>
>  I'm trying to develop a fairly complex MapReduce application using the
> streaming API (for demonstation purposes, so unfortunately the "use Java"
> answer doesn't work :-( ). I can get one single MapReduce phase running
> from command line with no problem. The problem is when I want to add more
> MapReduce phases which use each others output, and I maybe even want to do
> a recursion (feed the its output to the same phase again) conditioned by a
> counter.
>
>  The solution in Java MapReduce is trivial (i.e. creating multiple Job
> instances and monitoring counters) but with the streaming API not quite.
> What is the correct way to manage my application with its native code?
> (Python, PHP, Perl...) Calling shell commands from a "controller" script?
> How should I obtain counters?...
>
>  Using Oozie seems to be an overkilling for this application, besides, it
> doesn't support "loops" so the recusrsion can't really be implemented.
>
>  Thanks a lot!
> Zoltan
>



-- 
Alejandro

Re: Complex MapReduce applications with the streaming API

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

> Using Oozie seems to be an overkilling for this application, besides, it
doesn't support "loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design
(early prototypes supported loops). The idea was that you didn't want never
ending workflows. To this end, Coordinator Jobs address the recurrent run
of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow
invoking to itself as a sub-workflow. Just make sure you define properly
your exit condition.

If you have additional questions, please move this thread to the
user@oozie.apache.org alias.


Thx


On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra <
zoltan.tothczifra@softonic.com> wrote:

>  Hi everyone,
>
>  Thanks in advance for the support. My problem is the following:
>
>  I'm trying to develop a fairly complex MapReduce application using the
> streaming API (for demonstation purposes, so unfortunately the "use Java"
> answer doesn't work :-( ). I can get one single MapReduce phase running
> from command line with no problem. The problem is when I want to add more
> MapReduce phases which use each others output, and I maybe even want to do
> a recursion (feed the its output to the same phase again) conditioned by a
> counter.
>
>  The solution in Java MapReduce is trivial (i.e. creating multiple Job
> instances and monitoring counters) but with the streaming API not quite.
> What is the correct way to manage my application with its native code?
> (Python, PHP, Perl...) Calling shell commands from a "controller" script?
> How should I obtain counters?...
>
>  Using Oozie seems to be an overkilling for this application, besides, it
> doesn't support "loops" so the recusrsion can't really be implemented.
>
>  Thanks a lot!
> Zoltan
>



-- 
Alejandro

Re: Complex MapReduce applications with the streaming API

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

> Using Oozie seems to be an overkilling for this application, besides, it
doesn't support "loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design
(early prototypes supported loops). The idea was that you didn't want never
ending workflows. To this end, Coordinator Jobs address the recurrent run
of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow
invoking to itself as a sub-workflow. Just make sure you define properly
your exit condition.

If you have additional questions, please move this thread to the
user@oozie.apache.org alias.


Thx


On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra <
zoltan.tothczifra@softonic.com> wrote:

>  Hi everyone,
>
>  Thanks in advance for the support. My problem is the following:
>
>  I'm trying to develop a fairly complex MapReduce application using the
> streaming API (for demonstation purposes, so unfortunately the "use Java"
> answer doesn't work :-( ). I can get one single MapReduce phase running
> from command line with no problem. The problem is when I want to add more
> MapReduce phases which use each others output, and I maybe even want to do
> a recursion (feed the its output to the same phase again) conditioned by a
> counter.
>
>  The solution in Java MapReduce is trivial (i.e. creating multiple Job
> instances and monitoring counters) but with the streaming API not quite.
> What is the correct way to manage my application with its native code?
> (Python, PHP, Perl...) Calling shell commands from a "controller" script?
> How should I obtain counters?...
>
>  Using Oozie seems to be an overkilling for this application, besides, it
> doesn't support "loops" so the recusrsion can't really be implemented.
>
>  Thanks a lot!
> Zoltan
>



-- 
Alejandro