You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2013/01/12 17:09:11 UTC

Scheduling non-MR processes

I am trying to understand how one can make a "side process" cooperate with the Hadoop MapReduce task scheduler. Suppose that I have an application that is not directly integrated with MapReduce (i.e., it is not a MapReduce job at all; there are no mappers or reducers). This application could access HDFS as an external client, but it would be limited in its throughput. I want to run this application in parallel on HDFS nodes to realize the benefits of parallel computation and data locality. But I want to cooperate in resource management with Hadoop. But I don't want the *data* to get pushed through MapReduce, because the nature of the application doesn't lend itself nicely to MR integration.

Perhaps if I explain why I think this is not suitable for regular MR jobs it may help. Suppose that I have stored into HDFS a very large file for which there is no Java library. JNI could be an option, but wrapping the complex function of legacy application code into JNI may be more work than it is worth. The application performs some very complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm. Obviously the data file is "splittable" or this approach wouldn't work at all. So perhaps it is possible to hook into MR at the Splitter level, and use that to create a series of mapper tasks where the mappers don't actually read the data directly, but hand off the corresponding data block to the legacy application for processing?

Sorry if this is somewhat loosely defined as we are searching to understand the optimal integration strategy. I hope you can see what I am trying to do and give some suggestions.

Re: Scheduling non-MR processes

Posted by Harsh J <ha...@cloudera.com>.

Perhaps JNI is your best bet at this point. I am not aware of a native
C++ interface for YARN available yet, although in 2.x releases or 2.x
based distributions the protocol communication is done using protocol
buffers, which may perhaps be a primitive step to having native
interfaces (many would like to have something easy for C/C++, so if
there's anything you end up finding/developing, do share!).

On Sat, Jan 12, 2013 at 10:14 PM, John Lilley <jo...@redpoint.net> wrote:
> Harsh,
>
> Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?
>
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Saturday, January 12, 2013 9:41 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Scheduling non-MR processes
>
> Hi,
>
> Inline.
>
> On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
>> I am trying to understand how one can make a "side process" cooperate
>> with the Hadoop MapReduce task scheduler.  Suppose that I have an
>> application that is not directly integrated with MapReduce (i.e., it
>> is not a MapReduce job at all; there are no mappers or reducers).
>> This application could access HDFS as an external client, but it would
>> be limited in its throughput.  I want to run this application in
>> parallel on HDFS nodes to realize the benefits of parallel computation
>> and data locality.  But I want to cooperate in resource management
>> with Hadoop.  But I don't want the
>> *data* to get pushed through MapReduce, because the nature of the
>> application doesn't lend itself nicely to MR integration.
>
> Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.
>
>> Perhaps if I explain why I think this is not suitable for regular MR
>> jobs it may help.  Suppose that I have stored into HDFS a very large
>> file for which there is no Java library.  JNI could be an option, but
>> wrapping the complex function of legacy application code into JNI may
>> be more work than it is worth.  The application performs some very
>> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
>> Obviously the data file is "splittable" or this approach wouldn't work
>> at all.  So perhaps it is possible to hook into MR at the Splitter
>> level, and use that to create a series of mapper tasks where the
>> mappers don't actually read the data directly, but hand off the
>> corresponding data block to the legacy application for processing?
>
> Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.
>
> But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.
>
> [1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
> [2] - https://github.com/cloudera/kitten/
>
> --
> Harsh J



-- 
Harsh J

Re: Scheduling non-MR processes

Posted by Harsh J <ha...@cloudera.com>.

Perhaps JNI is your best bet at this point. I am not aware of a native
C++ interface for YARN available yet, although in 2.x releases or 2.x
based distributions the protocol communication is done using protocol
buffers, which may perhaps be a primitive step to having native
interfaces (many would like to have something easy for C/C++, so if
there's anything you end up finding/developing, do share!).

On Sat, Jan 12, 2013 at 10:14 PM, John Lilley <jo...@redpoint.net> wrote:
> Harsh,
>
> Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?
>
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Saturday, January 12, 2013 9:41 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Scheduling non-MR processes
>
> Hi,
>
> Inline.
>
> On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
>> I am trying to understand how one can make a "side process" cooperate
>> with the Hadoop MapReduce task scheduler.  Suppose that I have an
>> application that is not directly integrated with MapReduce (i.e., it
>> is not a MapReduce job at all; there are no mappers or reducers).
>> This application could access HDFS as an external client, but it would
>> be limited in its throughput.  I want to run this application in
>> parallel on HDFS nodes to realize the benefits of parallel computation
>> and data locality.  But I want to cooperate in resource management
>> with Hadoop.  But I don't want the
>> *data* to get pushed through MapReduce, because the nature of the
>> application doesn't lend itself nicely to MR integration.
>
> Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.
>
>> Perhaps if I explain why I think this is not suitable for regular MR
>> jobs it may help.  Suppose that I have stored into HDFS a very large
>> file for which there is no Java library.  JNI could be an option, but
>> wrapping the complex function of legacy application code into JNI may
>> be more work than it is worth.  The application performs some very
>> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
>> Obviously the data file is "splittable" or this approach wouldn't work
>> at all.  So perhaps it is possible to hook into MR at the Splitter
>> level, and use that to create a series of mapper tasks where the
>> mappers don't actually read the data directly, but hand off the
>> corresponding data block to the legacy application for processing?
>
> Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.
>
> But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.
>
> [1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
> [2] - https://github.com/cloudera/kitten/
>
> --
> Harsh J



-- 
Harsh J

Re: Scheduling non-MR processes

Posted by Arun C Murthy <ac...@hortonworks.com>.

YARN implies 2 pieces:
# Application specific 'master' to co-ordinate your application (mainly to get resources i.e. containers for it's application from the ResourceManager and use them)
# Application specific code which runs in the allocated Containers.

Given your use case, I'd recommend that you explore writing the AppMaster in Java (i.e. just the co-ordination piece) and implement the actual application in C, C++ etc.

In fact, we already have something called a DistributedShell which has the Java AppMaster but runs any shell script as the 'Container':
http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/

Some useful links to learn more about YARN:
Series of posts on YARN: http://hortonworks.com/blog/introducing-apache-hadoop-yarn/
Writing your own application in YARN: http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

hth,
Arun

On Jan 12, 2013, at 8:44 AM, John Lilley wrote:

> Harsh,
> 
> Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?
> 
> John
> 
> 
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com] 
> Sent: Saturday, January 12, 2013 9:41 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Scheduling non-MR processes
> 
> Hi,
> 
> Inline.
> 
> On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
>> I am trying to understand how one can make a "side process" cooperate 
>> with the Hadoop MapReduce task scheduler.  Suppose that I have an 
>> application that is not directly integrated with MapReduce (i.e., it 
>> is not a MapReduce job at all; there are no mappers or reducers).  
>> This application could access HDFS as an external client, but it would 
>> be limited in its throughput.  I want to run this application in 
>> parallel on HDFS nodes to realize the benefits of parallel computation 
>> and data locality.  But I want to cooperate in resource management 
>> with Hadoop.  But I don't want the
>> *data* to get pushed through MapReduce, because the nature of the 
>> application doesn't lend itself nicely to MR integration.
> 
> Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.
> 
>> Perhaps if I explain why I think this is not suitable for regular MR 
>> jobs it may help.  Suppose that I have stored into HDFS a very large 
>> file for which there is no Java library.  JNI could be an option, but 
>> wrapping the complex function of legacy application code into JNI may 
>> be more work than it is worth.  The application performs some very 
>> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
>> Obviously the data file is "splittable" or this approach wouldn't work 
>> at all.  So perhaps it is possible to hook into MR at the Splitter 
>> level, and use that to create a series of mapper tasks where the 
>> mappers don't actually read the data directly, but hand off the 
>> corresponding data block to the legacy application for processing?
> 
> Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.
> 
> But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.
> 
> [1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
> [2] - https://github.com/cloudera/kitten/
> 
> --
> Harsh J

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Scheduling non-MR processes

Posted by Arun C Murthy <ac...@hortonworks.com>.

YARN implies 2 pieces:
# Application specific 'master' to co-ordinate your application (mainly to get resources i.e. containers for it's application from the ResourceManager and use them)
# Application specific code which runs in the allocated Containers.

Given your use case, I'd recommend that you explore writing the AppMaster in Java (i.e. just the co-ordination piece) and implement the actual application in C, C++ etc.

In fact, we already have something called a DistributedShell which has the Java AppMaster but runs any shell script as the 'Container':
http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/

Some useful links to learn more about YARN:
Series of posts on YARN: http://hortonworks.com/blog/introducing-apache-hadoop-yarn/
Writing your own application in YARN: http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

hth,
Arun

On Jan 12, 2013, at 8:44 AM, John Lilley wrote:

> Harsh,
> 
> Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?
> 
> John
> 
> 
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com] 
> Sent: Saturday, January 12, 2013 9:41 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Scheduling non-MR processes
> 
> Hi,
> 
> Inline.
> 
> On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
>> I am trying to understand how one can make a "side process" cooperate 
>> with the Hadoop MapReduce task scheduler.  Suppose that I have an 
>> application that is not directly integrated with MapReduce (i.e., it 
>> is not a MapReduce job at all; there are no mappers or reducers).  
>> This application could access HDFS as an external client, but it would 
>> be limited in its throughput.  I want to run this application in 
>> parallel on HDFS nodes to realize the benefits of parallel computation 
>> and data locality.  But I want to cooperate in resource management 
>> with Hadoop.  But I don't want the
>> *data* to get pushed through MapReduce, because the nature of the 
>> application doesn't lend itself nicely to MR integration.
> 
> Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.
> 
>> Perhaps if I explain why I think this is not suitable for regular MR 
>> jobs it may help.  Suppose that I have stored into HDFS a very large 
>> file for which there is no Java library.  JNI could be an option, but 
>> wrapping the complex function of legacy application code into JNI may 
>> be more work than it is worth.  The application performs some very 
>> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
>> Obviously the data file is "splittable" or this approach wouldn't work 
>> at all.  So perhaps it is possible to hook into MR at the Splitter 
>> level, and use that to create a series of mapper tasks where the 
>> mappers don't actually read the data directly, but hand off the 
>> corresponding data block to the legacy application for processing?
> 
> Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.
> 
> But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.
> 
> [1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
> [2] - https://github.com/cloudera/kitten/
> 
> --
> Harsh J

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Scheduling non-MR processes

Posted by Harsh J <ha...@cloudera.com>.

Perhaps JNI is your best bet at this point. I am not aware of a native
C++ interface for YARN available yet, although in 2.x releases or 2.x
based distributions the protocol communication is done using protocol
buffers, which may perhaps be a primitive step to having native
interfaces (many would like to have something easy for C/C++, so if
there's anything you end up finding/developing, do share!).

On Sat, Jan 12, 2013 at 10:14 PM, John Lilley <jo...@redpoint.net> wrote:
> Harsh,
>
> Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?
>
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Saturday, January 12, 2013 9:41 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Scheduling non-MR processes
>
> Hi,
>
> Inline.
>
> On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
>> I am trying to understand how one can make a "side process" cooperate
>> with the Hadoop MapReduce task scheduler.  Suppose that I have an
>> application that is not directly integrated with MapReduce (i.e., it
>> is not a MapReduce job at all; there are no mappers or reducers).
>> This application could access HDFS as an external client, but it would
>> be limited in its throughput.  I want to run this application in
>> parallel on HDFS nodes to realize the benefits of parallel computation
>> and data locality.  But I want to cooperate in resource management
>> with Hadoop.  But I don't want the
>> *data* to get pushed through MapReduce, because the nature of the
>> application doesn't lend itself nicely to MR integration.
>
> Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.
>
>> Perhaps if I explain why I think this is not suitable for regular MR
>> jobs it may help.  Suppose that I have stored into HDFS a very large
>> file for which there is no Java library.  JNI could be an option, but
>> wrapping the complex function of legacy application code into JNI may
>> be more work than it is worth.  The application performs some very
>> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
>> Obviously the data file is "splittable" or this approach wouldn't work
>> at all.  So perhaps it is possible to hook into MR at the Splitter
>> level, and use that to create a series of mapper tasks where the
>> mappers don't actually read the data directly, but hand off the
>> corresponding data block to the legacy application for processing?
>
> Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.
>
> But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.
>
> [1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
> [2] - https://github.com/cloudera/kitten/
>
> --
> Harsh J



-- 
Harsh J

Re: Scheduling non-MR processes

Posted by Harsh J <ha...@cloudera.com>.

Perhaps JNI is your best bet at this point. I am not aware of a native
C++ interface for YARN available yet, although in 2.x releases or 2.x
based distributions the protocol communication is done using protocol
buffers, which may perhaps be a primitive step to having native
interfaces (many would like to have something easy for C/C++, so if
there's anything you end up finding/developing, do share!).

On Sat, Jan 12, 2013 at 10:14 PM, John Lilley <jo...@redpoint.net> wrote:
> Harsh,
>
> Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?
>
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Saturday, January 12, 2013 9:41 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Scheduling non-MR processes
>
> Hi,
>
> Inline.
>
> On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
>> I am trying to understand how one can make a "side process" cooperate
>> with the Hadoop MapReduce task scheduler.  Suppose that I have an
>> application that is not directly integrated with MapReduce (i.e., it
>> is not a MapReduce job at all; there are no mappers or reducers).
>> This application could access HDFS as an external client, but it would
>> be limited in its throughput.  I want to run this application in
>> parallel on HDFS nodes to realize the benefits of parallel computation
>> and data locality.  But I want to cooperate in resource management
>> with Hadoop.  But I don't want the
>> *data* to get pushed through MapReduce, because the nature of the
>> application doesn't lend itself nicely to MR integration.
>
> Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.
>
>> Perhaps if I explain why I think this is not suitable for regular MR
>> jobs it may help.  Suppose that I have stored into HDFS a very large
>> file for which there is no Java library.  JNI could be an option, but
>> wrapping the complex function of legacy application code into JNI may
>> be more work than it is worth.  The application performs some very
>> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
>> Obviously the data file is "splittable" or this approach wouldn't work
>> at all.  So perhaps it is possible to hook into MR at the Splitter
>> level, and use that to create a series of mapper tasks where the
>> mappers don't actually read the data directly, but hand off the
>> corresponding data block to the legacy application for processing?
>
> Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.
>
> But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.
>
> [1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
> [2] - https://github.com/cloudera/kitten/
>
> --
> Harsh J



-- 
Harsh J

Re: Scheduling non-MR processes

Posted by Arun C Murthy <ac...@hortonworks.com>.

YARN implies 2 pieces:
# Application specific 'master' to co-ordinate your application (mainly to get resources i.e. containers for it's application from the ResourceManager and use them)
# Application specific code which runs in the allocated Containers.

Given your use case, I'd recommend that you explore writing the AppMaster in Java (i.e. just the co-ordination piece) and implement the actual application in C, C++ etc.

In fact, we already have something called a DistributedShell which has the Java AppMaster but runs any shell script as the 'Container':
http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/

Some useful links to learn more about YARN:
Series of posts on YARN: http://hortonworks.com/blog/introducing-apache-hadoop-yarn/
Writing your own application in YARN: http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

hth,
Arun

On Jan 12, 2013, at 8:44 AM, John Lilley wrote:

> Harsh,
> 
> Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?
> 
> John
> 
> 
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com] 
> Sent: Saturday, January 12, 2013 9:41 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Scheduling non-MR processes
> 
> Hi,
> 
> Inline.
> 
> On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
>> I am trying to understand how one can make a "side process" cooperate 
>> with the Hadoop MapReduce task scheduler.  Suppose that I have an 
>> application that is not directly integrated with MapReduce (i.e., it 
>> is not a MapReduce job at all; there are no mappers or reducers).  
>> This application could access HDFS as an external client, but it would 
>> be limited in its throughput.  I want to run this application in 
>> parallel on HDFS nodes to realize the benefits of parallel computation 
>> and data locality.  But I want to cooperate in resource management 
>> with Hadoop.  But I don't want the
>> *data* to get pushed through MapReduce, because the nature of the 
>> application doesn't lend itself nicely to MR integration.
> 
> Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.
> 
>> Perhaps if I explain why I think this is not suitable for regular MR 
>> jobs it may help.  Suppose that I have stored into HDFS a very large 
>> file for which there is no Java library.  JNI could be an option, but 
>> wrapping the complex function of legacy application code into JNI may 
>> be more work than it is worth.  The application performs some very 
>> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
>> Obviously the data file is "splittable" or this approach wouldn't work 
>> at all.  So perhaps it is possible to hook into MR at the Splitter 
>> level, and use that to create a series of mapper tasks where the 
>> mappers don't actually read the data directly, but hand off the 
>> corresponding data block to the legacy application for processing?
> 
> Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.
> 
> But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.
> 
> [1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
> [2] - https://github.com/cloudera/kitten/
> 
> --
> Harsh J

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Scheduling non-MR processes

Posted by Arun C Murthy <ac...@hortonworks.com>.

YARN implies 2 pieces:
# Application specific 'master' to co-ordinate your application (mainly to get resources i.e. containers for it's application from the ResourceManager and use them)
# Application specific code which runs in the allocated Containers.

Given your use case, I'd recommend that you explore writing the AppMaster in Java (i.e. just the co-ordination piece) and implement the actual application in C, C++ etc.

In fact, we already have something called a DistributedShell which has the Java AppMaster but runs any shell script as the 'Container':
http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/

Some useful links to learn more about YARN:
Series of posts on YARN: http://hortonworks.com/blog/introducing-apache-hadoop-yarn/
Writing your own application in YARN: http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

hth,
Arun

On Jan 12, 2013, at 8:44 AM, John Lilley wrote:

> Harsh,
> 
> Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?
> 
> John
> 
> 
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com] 
> Sent: Saturday, January 12, 2013 9:41 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Scheduling non-MR processes
> 
> Hi,
> 
> Inline.
> 
> On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
>> I am trying to understand how one can make a "side process" cooperate 
>> with the Hadoop MapReduce task scheduler.  Suppose that I have an 
>> application that is not directly integrated with MapReduce (i.e., it 
>> is not a MapReduce job at all; there are no mappers or reducers).  
>> This application could access HDFS as an external client, but it would 
>> be limited in its throughput.  I want to run this application in 
>> parallel on HDFS nodes to realize the benefits of parallel computation 
>> and data locality.  But I want to cooperate in resource management 
>> with Hadoop.  But I don't want the
>> *data* to get pushed through MapReduce, because the nature of the 
>> application doesn't lend itself nicely to MR integration.
> 
> Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.
> 
>> Perhaps if I explain why I think this is not suitable for regular MR 
>> jobs it may help.  Suppose that I have stored into HDFS a very large 
>> file for which there is no Java library.  JNI could be an option, but 
>> wrapping the complex function of legacy application code into JNI may 
>> be more work than it is worth.  The application performs some very 
>> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
>> Obviously the data file is "splittable" or this approach wouldn't work 
>> at all.  So perhaps it is possible to hook into MR at the Splitter 
>> level, and use that to create a series of mapper tasks where the 
>> mappers don't actually read the data directly, but hand off the 
>> corresponding data block to the legacy application for processing?
> 
> Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.
> 
> But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.
> 
> [1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
> [2] - https://github.com/cloudera/kitten/
> 
> --
> Harsh J

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

RE: Scheduling non-MR processes

Posted by John Lilley <jo...@redpoint.net>.

Harsh,

Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?

John

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Saturday, January 12, 2013 9:41 AM
To: <us...@hadoop.apache.org>
Subject: Re: Scheduling non-MR processes

Hi,

Inline.

On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
> I am trying to understand how one can make a "side process" cooperate 
> with the Hadoop MapReduce task scheduler.  Suppose that I have an 
> application that is not directly integrated with MapReduce (i.e., it 
> is not a MapReduce job at all; there are no mappers or reducers).  
> This application could access HDFS as an external client, but it would 
> be limited in its throughput.  I want to run this application in 
> parallel on HDFS nodes to realize the benefits of parallel computation 
> and data locality.  But I want to cooperate in resource management 
> with Hadoop.  But I don't want the
> *data* to get pushed through MapReduce, because the nature of the 
> application doesn't lend itself nicely to MR integration.

Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.

> Perhaps if I explain why I think this is not suitable for regular MR 
> jobs it may help.  Suppose that I have stored into HDFS a very large 
> file for which there is no Java library.  JNI could be an option, but 
> wrapping the complex function of legacy application code into JNI may 
> be more work than it is worth.  The application performs some very 
> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
> Obviously the data file is "splittable" or this approach wouldn't work 
> at all.  So perhaps it is possible to hook into MR at the Splitter 
> level, and use that to create a series of mapper tasks where the 
> mappers don't actually read the data directly, but hand off the 
> corresponding data block to the legacy application for processing?

Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.

But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.

[1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
[2] - https://github.com/cloudera/kitten/

--
Harsh J

RE: Scheduling non-MR processes

Posted by John Lilley <jo...@redpoint.net>.

Harsh,

Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?

John

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Saturday, January 12, 2013 9:41 AM
To: <us...@hadoop.apache.org>
Subject: Re: Scheduling non-MR processes

Hi,

Inline.

On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
> I am trying to understand how one can make a "side process" cooperate 
> with the Hadoop MapReduce task scheduler.  Suppose that I have an 
> application that is not directly integrated with MapReduce (i.e., it 
> is not a MapReduce job at all; there are no mappers or reducers).  
> This application could access HDFS as an external client, but it would 
> be limited in its throughput.  I want to run this application in 
> parallel on HDFS nodes to realize the benefits of parallel computation 
> and data locality.  But I want to cooperate in resource management 
> with Hadoop.  But I don't want the
> *data* to get pushed through MapReduce, because the nature of the 
> application doesn't lend itself nicely to MR integration.

Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.

> Perhaps if I explain why I think this is not suitable for regular MR 
> jobs it may help.  Suppose that I have stored into HDFS a very large 
> file for which there is no Java library.  JNI could be an option, but 
> wrapping the complex function of legacy application code into JNI may 
> be more work than it is worth.  The application performs some very 
> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
> Obviously the data file is "splittable" or this approach wouldn't work 
> at all.  So perhaps it is possible to hook into MR at the Splitter 
> level, and use that to create a series of mapper tasks where the 
> mappers don't actually read the data directly, but hand off the 
> corresponding data block to the legacy application for processing?

Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.

But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.

[1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
[2] - https://github.com/cloudera/kitten/

--
Harsh J

RE: Scheduling non-MR processes

Posted by John Lilley <jo...@redpoint.net>.

Harsh,

Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?

John

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Saturday, January 12, 2013 9:41 AM
To: <us...@hadoop.apache.org>
Subject: Re: Scheduling non-MR processes

Hi,

Inline.

On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
> I am trying to understand how one can make a "side process" cooperate 
> with the Hadoop MapReduce task scheduler.  Suppose that I have an 
> application that is not directly integrated with MapReduce (i.e., it 
> is not a MapReduce job at all; there are no mappers or reducers).  
> This application could access HDFS as an external client, but it would 
> be limited in its throughput.  I want to run this application in 
> parallel on HDFS nodes to realize the benefits of parallel computation 
> and data locality.  But I want to cooperate in resource management 
> with Hadoop.  But I don't want the
> *data* to get pushed through MapReduce, because the nature of the 
> application doesn't lend itself nicely to MR integration.

Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.

> Perhaps if I explain why I think this is not suitable for regular MR 
> jobs it may help.  Suppose that I have stored into HDFS a very large 
> file for which there is no Java library.  JNI could be an option, but 
> wrapping the complex function of legacy application code into JNI may 
> be more work than it is worth.  The application performs some very 
> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
> Obviously the data file is "splittable" or this approach wouldn't work 
> at all.  So perhaps it is possible to hook into MR at the Splitter 
> level, and use that to create a series of mapper tasks where the 
> mappers don't actually read the data directly, but hand off the 
> corresponding data block to the legacy application for processing?

Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.

But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.

[1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
[2] - https://github.com/cloudera/kitten/

--
Harsh J

RE: Scheduling non-MR processes

Posted by John Lilley <jo...@redpoint.net>.

Harsh,

Thanks for the insight!  I didn't realize that YARN was more than a more-scalable MR scheduler.  So if we program our application to schedule its tasks directly with YARN we should be able to do what I am describing?  Is there any non-native-Java interop for YARN or should we focus on JNI for that?

John

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Saturday, January 12, 2013 9:41 AM
To: <us...@hadoop.apache.org>
Subject: Re: Scheduling non-MR processes

Hi,

Inline.

On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
> I am trying to understand how one can make a "side process" cooperate 
> with the Hadoop MapReduce task scheduler.  Suppose that I have an 
> application that is not directly integrated with MapReduce (i.e., it 
> is not a MapReduce job at all; there are no mappers or reducers).  
> This application could access HDFS as an external client, but it would 
> be limited in its throughput.  I want to run this application in 
> parallel on HDFS nodes to realize the benefits of parallel computation 
> and data locality.  But I want to cooperate in resource management 
> with Hadoop.  But I don't want the
> *data* to get pushed through MapReduce, because the nature of the 
> application doesn't lend itself nicely to MR integration.

Apache Hadoop has moved past plain MR onto YARN. YARN allows MR (called MR2) and also allows other forms of generic, distributed apps to be developed for any other purposes.

> Perhaps if I explain why I think this is not suitable for regular MR 
> jobs it may help.  Suppose that I have stored into HDFS a very large 
> file for which there is no Java library.  JNI could be an option, but 
> wrapping the complex function of legacy application code into JNI may 
> be more work than it is worth.  The application performs some very 
> complex processing, and this is something that we don't necessarily want to redesign to fit the MR paradigm.
> Obviously the data file is "splittable" or this approach wouldn't work 
> at all.  So perhaps it is possible to hook into MR at the Splitter 
> level, and use that to create a series of mapper tasks where the 
> mappers don't actually read the data directly, but hand off the 
> corresponding data block to the legacy application for processing?

Yes if you're stuck on a platform that just has MR and you want to somehow leverage a map-only distribution to do this, you should tweak your job to (a) use empty splits and (b) run infinitely. For (a), take a look at the Sleep Job example [1] that utilizes empty splits - no data, but you can control number of mappers, etc. and have mapper logic do work. For (b), study the SleepJob's mapper to see how it periodically reports progress or status changes (can be done via a daemon thread too) such that the framework does not think it has died or gone unresponsive.

But ideally, you'd want to leverage YARN for this. Libraries such as Kitten [2] help along in this task.

[1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
[2] - https://github.com/cloudera/kitten/

--
Harsh J

Re: Scheduling non-MR processes

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
> I am trying to understand how one can make a “side process” cooperate with
> the Hadoop MapReduce task scheduler.  Suppose that I have an application
> that is not directly integrated with MapReduce (i.e., it is not a MapReduce
> job at all; there are no mappers or reducers).  This application could
> access HDFS as an external client, but it would be limited in its
> throughput.  I want to run this application in parallel on HDFS nodes to
> realize the benefits of parallel computation and data locality.  But I want
> to cooperate in resource management with Hadoop.  But I don’t want the
> *data* to get pushed through MapReduce, because the nature of the
> application doesn’t lend itself nicely to MR integration.

Apache Hadoop has moved past plain MR onto YARN. YARN allows MR
(called MR2) and also allows other forms of generic, distributed apps
to be developed for any other purposes.

> Perhaps if I explain why I think this is not suitable for regular MR jobs it
> may help.  Suppose that I have stored into HDFS a very large file for which
> there is no Java library.  JNI could be an option, but wrapping the complex
> function of legacy application code into JNI may be more work than it is
> worth.  The application performs some very complex processing, and this is
> something that we don’t necessarily want to redesign to fit the MR paradigm.
> Obviously the data file is “splittable” or this approach wouldn’t work at
> all.  So perhaps it is possible to hook into MR at the Splitter level, and
> use that to create a series of mapper tasks where the mappers don’t actually
> read the data directly, but hand off the corresponding data block to the
> legacy application for processing?

Yes if you're stuck on a platform that just has MR and you want to
somehow leverage a map-only distribution to do this, you should tweak
your job to (a) use empty splits and (b) run infinitely. For (a), take
a look at the Sleep Job example [1] that utilizes empty splits - no
data, but you can control number of mappers, etc. and have mapper
logic do work. For (b), study the SleepJob's mapper to see how it
periodically reports progress or status changes (can be done via a
daemon thread too) such that the framework does not think it has died
or gone unresponsive.

But ideally, you'd want to leverage YARN for this. Libraries such as
Kitten [2] help along in this task.

[1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
[2] - https://github.com/cloudera/kitten/

--
Harsh J

Re: Scheduling non-MR processes

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
> I am trying to understand how one can make a “side process” cooperate with
> the Hadoop MapReduce task scheduler.  Suppose that I have an application
> that is not directly integrated with MapReduce (i.e., it is not a MapReduce
> job at all; there are no mappers or reducers).  This application could
> access HDFS as an external client, but it would be limited in its
> throughput.  I want to run this application in parallel on HDFS nodes to
> realize the benefits of parallel computation and data locality.  But I want
> to cooperate in resource management with Hadoop.  But I don’t want the
> *data* to get pushed through MapReduce, because the nature of the
> application doesn’t lend itself nicely to MR integration.

Apache Hadoop has moved past plain MR onto YARN. YARN allows MR
(called MR2) and also allows other forms of generic, distributed apps
to be developed for any other purposes.

> Perhaps if I explain why I think this is not suitable for regular MR jobs it
> may help.  Suppose that I have stored into HDFS a very large file for which
> there is no Java library.  JNI could be an option, but wrapping the complex
> function of legacy application code into JNI may be more work than it is
> worth.  The application performs some very complex processing, and this is
> something that we don’t necessarily want to redesign to fit the MR paradigm.
> Obviously the data file is “splittable” or this approach wouldn’t work at
> all.  So perhaps it is possible to hook into MR at the Splitter level, and
> use that to create a series of mapper tasks where the mappers don’t actually
> read the data directly, but hand off the corresponding data block to the
> legacy application for processing?

Yes if you're stuck on a platform that just has MR and you want to
somehow leverage a map-only distribution to do this, you should tweak
your job to (a) use empty splits and (b) run infinitely. For (a), take
a look at the Sleep Job example [1] that utilizes empty splits - no
data, but you can control number of mappers, etc. and have mapper
logic do work. For (b), study the SleepJob's mapper to see how it
periodically reports progress or status changes (can be done via a
daemon thread too) such that the framework does not think it has died
or gone unresponsive.

But ideally, you'd want to leverage YARN for this. Libraries such as
Kitten [2] help along in this task.

[1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
[2] - https://github.com/cloudera/kitten/

--
Harsh J

Re: Scheduling non-MR processes

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
> I am trying to understand how one can make a “side process” cooperate with
> the Hadoop MapReduce task scheduler.  Suppose that I have an application
> that is not directly integrated with MapReduce (i.e., it is not a MapReduce
> job at all; there are no mappers or reducers).  This application could
> access HDFS as an external client, but it would be limited in its
> throughput.  I want to run this application in parallel on HDFS nodes to
> realize the benefits of parallel computation and data locality.  But I want
> to cooperate in resource management with Hadoop.  But I don’t want the
> *data* to get pushed through MapReduce, because the nature of the
> application doesn’t lend itself nicely to MR integration.

Apache Hadoop has moved past plain MR onto YARN. YARN allows MR
(called MR2) and also allows other forms of generic, distributed apps
to be developed for any other purposes.

> Perhaps if I explain why I think this is not suitable for regular MR jobs it
> may help.  Suppose that I have stored into HDFS a very large file for which
> there is no Java library.  JNI could be an option, but wrapping the complex
> function of legacy application code into JNI may be more work than it is
> worth.  The application performs some very complex processing, and this is
> something that we don’t necessarily want to redesign to fit the MR paradigm.
> Obviously the data file is “splittable” or this approach wouldn’t work at
> all.  So perhaps it is possible to hook into MR at the Splitter level, and
> use that to create a series of mapper tasks where the mappers don’t actually
> read the data directly, but hand off the corresponding data block to the
> legacy application for processing?

Yes if you're stuck on a platform that just has MR and you want to
somehow leverage a map-only distribution to do this, you should tweak
your job to (a) use empty splits and (b) run infinitely. For (a), take
a look at the Sleep Job example [1] that utilizes empty splits - no
data, but you can control number of mappers, etc. and have mapper
logic do work. For (b), study the SleepJob's mapper to see how it
periodically reports progress or status changes (can be done via a
daemon thread too) such that the framework does not think it has died
or gone unresponsive.

But ideally, you'd want to leverage YARN for this. Libraries such as
Kitten [2] help along in this task.

[1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
[2] - https://github.com/cloudera/kitten/

--
Harsh J

Re: Scheduling non-MR processes

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Inline.

On Sat, Jan 12, 2013 at 9:39 PM, John Lilley <jo...@redpoint.net> wrote:
> I am trying to understand how one can make a “side process” cooperate with
> the Hadoop MapReduce task scheduler.  Suppose that I have an application
> that is not directly integrated with MapReduce (i.e., it is not a MapReduce
> job at all; there are no mappers or reducers).  This application could
> access HDFS as an external client, but it would be limited in its
> throughput.  I want to run this application in parallel on HDFS nodes to
> realize the benefits of parallel computation and data locality.  But I want
> to cooperate in resource management with Hadoop.  But I don’t want the
> *data* to get pushed through MapReduce, because the nature of the
> application doesn’t lend itself nicely to MR integration.

Apache Hadoop has moved past plain MR onto YARN. YARN allows MR
(called MR2) and also allows other forms of generic, distributed apps
to be developed for any other purposes.

> Perhaps if I explain why I think this is not suitable for regular MR jobs it
> may help.  Suppose that I have stored into HDFS a very large file for which
> there is no Java library.  JNI could be an option, but wrapping the complex
> function of legacy application code into JNI may be more work than it is
> worth.  The application performs some very complex processing, and this is
> something that we don’t necessarily want to redesign to fit the MR paradigm.
> Obviously the data file is “splittable” or this approach wouldn’t work at
> all.  So perhaps it is possible to hook into MR at the Splitter level, and
> use that to create a series of mapper tasks where the mappers don’t actually
> read the data directly, but hand off the corresponding data block to the
> legacy application for processing?

Yes if you're stuck on a platform that just has MR and you want to
somehow leverage a map-only distribution to do this, you should tweak
your job to (a) use empty splits and (b) run infinitely. For (a), take
a look at the Sleep Job example [1] that utilizes empty splits - no
data, but you can control number of mappers, etc. and have mapper
logic do work. For (b), study the SleepJob's mapper to see how it
periodically reports progress or status changes (can be done via a
daemon thread too) such that the framework does not think it has died
or gone unresponsive.

But ideally, you'd want to leverage YARN for this. Libraries such as
Kitten [2] help along in this task.

[1] - https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/SleepJob.java
[2] - https://github.com/cloudera/kitten/

--
Harsh J