You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2014/06/25 22:48:24 UTC

persisent services in Hadoop

We are an ISV that currently ships a data-quality/integration suite running as a native YARN application.  We are finding several use cases that would benefit from being able to manage a per-node persistent service.  MapReduce has its "shuffle auxiliary service", but it isn't straightforward to add auxiliary services because they cannot be loaded from HDFS, so we'd have to manage the distribution of JARs across nodes (please tell me if I'm wrong here...).  Given that, is there a preferred method for managing persistent services on a Hadoop cluster?  We could have an AM that creates a set of YARN tasks and just waits until YARN gives a task on each node, and restart any failed tasks, but it doesn't really fit the AM/container structure very well.  I've also read about Slider, which looks interesting.  Other ideas?
--john

RE: persisent services in Hadoop

Posted by John Lilley <jo...@redpoint.net>.
Thanks Arun!
I do think we are on the bleeding edge of YARN, because everyone else in our application space generates MapReduce (Pig, Hive), or they have overlaid their legacy server-grid on Hadoop.
I will explore both resources you mentioned to see where the development community is headed.
Cheers,
john


From: Arun Murthy [mailto:acm@hortonworks.com]
Sent: Wednesday, June 25, 2014 11:50 PM
To: user@hadoop.apache.org
Subject: Re: persisent services in Hadoop

John,

 We are excited to see ISVs like you get value from YARN, and appreciate the patience you've already shown in the past to work through the teething issues of YARN & hadoop-2.x.

 W.r.t long-running services, the most straight-forward option is to go through Apache Slider (http://slider.incubator.apache.org/). Slider has already made good progress in supporting various long-running services such as Apache HBase, Apache Accumulo & Apache Storm. I'm very sure the Slider community would be very welcoming of your use-cases, suggestions etc. - particularly as they are gearing up to support various applications atop; and would love your feedback.

 Furthemore, there is work going on in YARN itself to better support your use case: https://issues.apache.org/jira/browse/YARN-896.
 Again, your feedback there is very, very welcome.

 Also, you might be interested in https://issues.apache.org/jira/browse/YARN-1530 which provides a generic framework for collecting application metrics for YARN applications.

 Hope that helps.

thanks,
Arun

On Wed, Jun 25, 2014 at 1:48 PM, John Lilley <jo...@redpoint.net>> wrote:
We are an ISV that currently ships a data-quality/integration suite running as a native YARN application.  We are finding several use cases that would benefit from being able to manage a per-node persistent service.  MapReduce has its “shuffle auxiliary service”, but it isn’t straightforward to add auxiliary services because they cannot be loaded from HDFS, so we’d have to manage the distribution of JARs across nodes (please tell me if I’m wrong here…).  Given that, is there a preferred method for managing persistent services on a Hadoop cluster?  We could have an AM that creates a set of YARN tasks and just waits until YARN gives a task on each node, and restart any failed tasks, but it doesn’t really fit the AM/container structure very well.  I’ve also read about Slider, which looks interesting.  Other ideas?
--john



--

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: persisent services in Hadoop

Posted by John Lilley <jo...@redpoint.net>.
Thanks Arun!
I do think we are on the bleeding edge of YARN, because everyone else in our application space generates MapReduce (Pig, Hive), or they have overlaid their legacy server-grid on Hadoop.
I will explore both resources you mentioned to see where the development community is headed.
Cheers,
john


From: Arun Murthy [mailto:acm@hortonworks.com]
Sent: Wednesday, June 25, 2014 11:50 PM
To: user@hadoop.apache.org
Subject: Re: persisent services in Hadoop

John,

 We are excited to see ISVs like you get value from YARN, and appreciate the patience you've already shown in the past to work through the teething issues of YARN & hadoop-2.x.

 W.r.t long-running services, the most straight-forward option is to go through Apache Slider (http://slider.incubator.apache.org/). Slider has already made good progress in supporting various long-running services such as Apache HBase, Apache Accumulo & Apache Storm. I'm very sure the Slider community would be very welcoming of your use-cases, suggestions etc. - particularly as they are gearing up to support various applications atop; and would love your feedback.

 Furthemore, there is work going on in YARN itself to better support your use case: https://issues.apache.org/jira/browse/YARN-896.
 Again, your feedback there is very, very welcome.

 Also, you might be interested in https://issues.apache.org/jira/browse/YARN-1530 which provides a generic framework for collecting application metrics for YARN applications.

 Hope that helps.

thanks,
Arun

On Wed, Jun 25, 2014 at 1:48 PM, John Lilley <jo...@redpoint.net>> wrote:
We are an ISV that currently ships a data-quality/integration suite running as a native YARN application.  We are finding several use cases that would benefit from being able to manage a per-node persistent service.  MapReduce has its “shuffle auxiliary service”, but it isn’t straightforward to add auxiliary services because they cannot be loaded from HDFS, so we’d have to manage the distribution of JARs across nodes (please tell me if I’m wrong here…).  Given that, is there a preferred method for managing persistent services on a Hadoop cluster?  We could have an AM that creates a set of YARN tasks and just waits until YARN gives a task on each node, and restart any failed tasks, but it doesn’t really fit the AM/container structure very well.  I’ve also read about Slider, which looks interesting.  Other ideas?
--john



--

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: persisent services in Hadoop

Posted by John Lilley <jo...@redpoint.net>.
Thanks Arun!
I do think we are on the bleeding edge of YARN, because everyone else in our application space generates MapReduce (Pig, Hive), or they have overlaid their legacy server-grid on Hadoop.
I will explore both resources you mentioned to see where the development community is headed.
Cheers,
john


From: Arun Murthy [mailto:acm@hortonworks.com]
Sent: Wednesday, June 25, 2014 11:50 PM
To: user@hadoop.apache.org
Subject: Re: persisent services in Hadoop

John,

 We are excited to see ISVs like you get value from YARN, and appreciate the patience you've already shown in the past to work through the teething issues of YARN & hadoop-2.x.

 W.r.t long-running services, the most straight-forward option is to go through Apache Slider (http://slider.incubator.apache.org/). Slider has already made good progress in supporting various long-running services such as Apache HBase, Apache Accumulo & Apache Storm. I'm very sure the Slider community would be very welcoming of your use-cases, suggestions etc. - particularly as they are gearing up to support various applications atop; and would love your feedback.

 Furthemore, there is work going on in YARN itself to better support your use case: https://issues.apache.org/jira/browse/YARN-896.
 Again, your feedback there is very, very welcome.

 Also, you might be interested in https://issues.apache.org/jira/browse/YARN-1530 which provides a generic framework for collecting application metrics for YARN applications.

 Hope that helps.

thanks,
Arun

On Wed, Jun 25, 2014 at 1:48 PM, John Lilley <jo...@redpoint.net>> wrote:
We are an ISV that currently ships a data-quality/integration suite running as a native YARN application.  We are finding several use cases that would benefit from being able to manage a per-node persistent service.  MapReduce has its “shuffle auxiliary service”, but it isn’t straightforward to add auxiliary services because they cannot be loaded from HDFS, so we’d have to manage the distribution of JARs across nodes (please tell me if I’m wrong here…).  Given that, is there a preferred method for managing persistent services on a Hadoop cluster?  We could have an AM that creates a set of YARN tasks and just waits until YARN gives a task on each node, and restart any failed tasks, but it doesn’t really fit the AM/container structure very well.  I’ve also read about Slider, which looks interesting.  Other ideas?
--john



--

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: persisent services in Hadoop

Posted by John Lilley <jo...@redpoint.net>.
Thanks Arun!
I do think we are on the bleeding edge of YARN, because everyone else in our application space generates MapReduce (Pig, Hive), or they have overlaid their legacy server-grid on Hadoop.
I will explore both resources you mentioned to see where the development community is headed.
Cheers,
john


From: Arun Murthy [mailto:acm@hortonworks.com]
Sent: Wednesday, June 25, 2014 11:50 PM
To: user@hadoop.apache.org
Subject: Re: persisent services in Hadoop

John,

 We are excited to see ISVs like you get value from YARN, and appreciate the patience you've already shown in the past to work through the teething issues of YARN & hadoop-2.x.

 W.r.t long-running services, the most straight-forward option is to go through Apache Slider (http://slider.incubator.apache.org/). Slider has already made good progress in supporting various long-running services such as Apache HBase, Apache Accumulo & Apache Storm. I'm very sure the Slider community would be very welcoming of your use-cases, suggestions etc. - particularly as they are gearing up to support various applications atop; and would love your feedback.

 Furthemore, there is work going on in YARN itself to better support your use case: https://issues.apache.org/jira/browse/YARN-896.
 Again, your feedback there is very, very welcome.

 Also, you might be interested in https://issues.apache.org/jira/browse/YARN-1530 which provides a generic framework for collecting application metrics for YARN applications.

 Hope that helps.

thanks,
Arun

On Wed, Jun 25, 2014 at 1:48 PM, John Lilley <jo...@redpoint.net>> wrote:
We are an ISV that currently ships a data-quality/integration suite running as a native YARN application.  We are finding several use cases that would benefit from being able to manage a per-node persistent service.  MapReduce has its “shuffle auxiliary service”, but it isn’t straightforward to add auxiliary services because they cannot be loaded from HDFS, so we’d have to manage the distribution of JARs across nodes (please tell me if I’m wrong here…).  Given that, is there a preferred method for managing persistent services on a Hadoop cluster?  We could have an AM that creates a set of YARN tasks and just waits until YARN gives a task on each node, and restart any failed tasks, but it doesn’t really fit the AM/container structure very well.  I’ve also read about Slider, which looks interesting.  Other ideas?
--john



--

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: persisent services in Hadoop

Posted by Arun Murthy <ac...@hortonworks.com>.
John,

 We are excited to see ISVs like you get value from YARN, and appreciate
the patience you've already shown in the past to work through the teething
issues of YARN & hadoop-2.x.

 W.r.t long-running services, the most straight-forward option is to go
through Apache Slider (http://slider.incubator.apache.org/). Slider has
already made good progress in supporting various long-running services such
as Apache HBase, Apache Accumulo & Apache Storm. I'm very sure the Slider
community would be very welcoming of your use-cases, suggestions etc. -
particularly as they are gearing up to support various applications atop;
and would love your feedback.

 Furthemore, there is work going on in YARN itself to better support your
use case: https://issues.apache.org/jira/browse/YARN-896.
 Again, your feedback there is very, very welcome.

 Also, you might be interested in
https://issues.apache.org/jira/browse/YARN-1530 which provides a generic
framework for collecting application metrics for YARN applications.

 Hope that helps.

thanks,
Arun


On Wed, Jun 25, 2014 at 1:48 PM, John Lilley <jo...@redpoint.net>
wrote:

>  We are an ISV that currently ships a data-quality/integration suite
> running as a native YARN application.  We are finding several use cases
> that would benefit from being able to manage a per-node persistent
> service.  MapReduce has its “shuffle auxiliary service”, but it isn’t
> straightforward to add auxiliary services because they cannot be loaded
> from HDFS, so we’d have to manage the distribution of JARs across nodes
> (please tell me if I’m wrong here…).  Given that, is there a preferred
> method for managing persistent services on a Hadoop cluster?  We could have
> an AM that creates a set of YARN tasks and just waits until YARN gives a
> task on each node, and restart any failed tasks, but it doesn’t really fit
> the AM/container structure very well.  I’ve also read about Slider, which
> looks interesting.  Other ideas?
>
> --john
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: persisent services in Hadoop

Posted by Arun Murthy <ac...@hortonworks.com>.
John,

 We are excited to see ISVs like you get value from YARN, and appreciate
the patience you've already shown in the past to work through the teething
issues of YARN & hadoop-2.x.

 W.r.t long-running services, the most straight-forward option is to go
through Apache Slider (http://slider.incubator.apache.org/). Slider has
already made good progress in supporting various long-running services such
as Apache HBase, Apache Accumulo & Apache Storm. I'm very sure the Slider
community would be very welcoming of your use-cases, suggestions etc. -
particularly as they are gearing up to support various applications atop;
and would love your feedback.

 Furthemore, there is work going on in YARN itself to better support your
use case: https://issues.apache.org/jira/browse/YARN-896.
 Again, your feedback there is very, very welcome.

 Also, you might be interested in
https://issues.apache.org/jira/browse/YARN-1530 which provides a generic
framework for collecting application metrics for YARN applications.

 Hope that helps.

thanks,
Arun


On Wed, Jun 25, 2014 at 1:48 PM, John Lilley <jo...@redpoint.net>
wrote:

>  We are an ISV that currently ships a data-quality/integration suite
> running as a native YARN application.  We are finding several use cases
> that would benefit from being able to manage a per-node persistent
> service.  MapReduce has its “shuffle auxiliary service”, but it isn’t
> straightforward to add auxiliary services because they cannot be loaded
> from HDFS, so we’d have to manage the distribution of JARs across nodes
> (please tell me if I’m wrong here…).  Given that, is there a preferred
> method for managing persistent services on a Hadoop cluster?  We could have
> an AM that creates a set of YARN tasks and just waits until YARN gives a
> task on each node, and restart any failed tasks, but it doesn’t really fit
> the AM/container structure very well.  I’ve also read about Slider, which
> looks interesting.  Other ideas?
>
> --john
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: persisent services in Hadoop

Posted by Arun Murthy <ac...@hortonworks.com>.
John,

 We are excited to see ISVs like you get value from YARN, and appreciate
the patience you've already shown in the past to work through the teething
issues of YARN & hadoop-2.x.

 W.r.t long-running services, the most straight-forward option is to go
through Apache Slider (http://slider.incubator.apache.org/). Slider has
already made good progress in supporting various long-running services such
as Apache HBase, Apache Accumulo & Apache Storm. I'm very sure the Slider
community would be very welcoming of your use-cases, suggestions etc. -
particularly as they are gearing up to support various applications atop;
and would love your feedback.

 Furthemore, there is work going on in YARN itself to better support your
use case: https://issues.apache.org/jira/browse/YARN-896.
 Again, your feedback there is very, very welcome.

 Also, you might be interested in
https://issues.apache.org/jira/browse/YARN-1530 which provides a generic
framework for collecting application metrics for YARN applications.

 Hope that helps.

thanks,
Arun


On Wed, Jun 25, 2014 at 1:48 PM, John Lilley <jo...@redpoint.net>
wrote:

>  We are an ISV that currently ships a data-quality/integration suite
> running as a native YARN application.  We are finding several use cases
> that would benefit from being able to manage a per-node persistent
> service.  MapReduce has its “shuffle auxiliary service”, but it isn’t
> straightforward to add auxiliary services because they cannot be loaded
> from HDFS, so we’d have to manage the distribution of JARs across nodes
> (please tell me if I’m wrong here…).  Given that, is there a preferred
> method for managing persistent services on a Hadoop cluster?  We could have
> an AM that creates a set of YARN tasks and just waits until YARN gives a
> task on each node, and restart any failed tasks, but it doesn’t really fit
> the AM/container structure very well.  I’ve also read about Slider, which
> looks interesting.  Other ideas?
>
> --john
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: persisent services in Hadoop

Posted by Arun Murthy <ac...@hortonworks.com>.
John,

 We are excited to see ISVs like you get value from YARN, and appreciate
the patience you've already shown in the past to work through the teething
issues of YARN & hadoop-2.x.

 W.r.t long-running services, the most straight-forward option is to go
through Apache Slider (http://slider.incubator.apache.org/). Slider has
already made good progress in supporting various long-running services such
as Apache HBase, Apache Accumulo & Apache Storm. I'm very sure the Slider
community would be very welcoming of your use-cases, suggestions etc. -
particularly as they are gearing up to support various applications atop;
and would love your feedback.

 Furthemore, there is work going on in YARN itself to better support your
use case: https://issues.apache.org/jira/browse/YARN-896.
 Again, your feedback there is very, very welcome.

 Also, you might be interested in
https://issues.apache.org/jira/browse/YARN-1530 which provides a generic
framework for collecting application metrics for YARN applications.

 Hope that helps.

thanks,
Arun


On Wed, Jun 25, 2014 at 1:48 PM, John Lilley <jo...@redpoint.net>
wrote:

>  We are an ISV that currently ships a data-quality/integration suite
> running as a native YARN application.  We are finding several use cases
> that would benefit from being able to manage a per-node persistent
> service.  MapReduce has its “shuffle auxiliary service”, but it isn’t
> straightforward to add auxiliary services because they cannot be loaded
> from HDFS, so we’d have to manage the distribution of JARs across nodes
> (please tell me if I’m wrong here…).  Given that, is there a preferred
> method for managing persistent services on a Hadoop cluster?  We could have
> an AM that creates a set of YARN tasks and just waits until YARN gives a
> task on each node, and restart any failed tasks, but it doesn’t really fit
> the AM/container structure very well.  I’ve also read about Slider, which
> looks interesting.  Other ideas?
>
> --john
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.