You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Amit Sela <am...@gmail.com> on 2016/07/28 10:24:39 UTC

[DISCUSS] cluster infrastructure - resource manager - for on going tests

Following a discussion I had with Kenneth and Dan here
<https://github.com/apache/incubator-beam/pull/711>. I want to raise the
issue of which resource manager we should use for on going tests that will
run on actual clusters (on top of local/in-mem tests).
If we plan to test all runners on all their supported resource managers,
great! But I guess this won't be the case, at least not at the beginning.

Spark can run it's own (Standalone Mode) resource manager, use YARN or use
Mesos. According to the latest survey
<http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf>
by
Databricks Standalone is in the lead (48%), with YARN tailing it
(40%) while Mesos looks like the least favourite.
For Spark, I'd vote for Standalone as it is the most popular use case + it
avoids the additional complexity of maintaining YARN on this cluster.
Having said that, AFAIK Flink is a "first-class" YARN citizen (right ?) and
I don't know what available resource managers can be used by other runners,
so I think runner authors should give their input here.

*Summary:*
*Spark* - StandaloneMode or YARN (in that order).
*Flink * - ?
*Others* - ?

Thanks,
Amit

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Manu,

It's what I'm doing on my boxes for Spark and Flink as well.
I think it's a very convenient approach.

Regards
JB

On 07/29/2016 12:26 AM, Manu Zhang wrote:
> In Gearpump, we use docker to launch standalone Gearpump (with master and
> worker on different containers), Kafka cluster (source) and HDFS cluster
> (checkpoint store) to test message delivery guarantee, i.e. at least once,
> exactly-once.
>
> Thanks,
> Manu
>
> On Fri, Jul 29, 2016 at 5:10 AM Isma�l Mej�a <ie...@gmail.com> wrote:
>
>> Hi again,
>>
>> Understood, if later on we (beam) decide to put this in place, I can help a
>> bit, since this is a subject I like, and it is clear for me that this idea
>> can have immediate benefits (better integration tests and of course better
>> IOs/runners).
>>
>> Ismael
>>
>> On Thu, Jul 28, 2016 at 10:38 PM, Kenneth Knowles <kl...@google.com.invalid>
>> wrote:
>>
>>> Hi Isma�l,
>>>
>>> I was just talking in general about what any project would want to do. I
>>> don't have any specific plans.
>>>
>>> Kenn
>>>
>>> On Thu, Jul 28, 2016 at 12:53 PM, Isma�l Mej�a <ie...@gmail.com>
>> wrote:
>>>
>>>> Kenneth this is great news (I am talking about the addtional
>> services), I
>>>> was just discussing with JB the other day, about how nice it would be
>> to
>>>> have this kind of tests, with the right infrastructure, since we are
>>>> working on new IOs, e.g. to test certain particular behaviors with
>> Kafka
>>> or
>>>> other systems, how do the IO react to failure, etc.
>>>>
>>>> It is nice to know that this can be supported. Any concrete plans of
>> how
>>>> will to make this work ? Do you intend to deploy such systems via
>>>> containers or just having them in some test cluster ?
>>>>
>>>> As Aljoscha mentions just kafka or yarn both need quite a bit of
>> 'extra'
>>>> dependencies at deploy time.
>>>>
>>>> Thanks again for this idea,
>>>> Ismael.
>>>>
>>>>
>>>>
>>>> On Thu, Jul 28, 2016 at 6:48 PM, Aljoscha Krettek <aljoscha@apache.org
>>>
>>>> wrote:
>>>>
>>>>> For Flink, Yarn is fine and I guess it's the common denominator for
>> all
>>>>> runners (except DataflowRunner, of course).
>>>>>
>>>>> @Kenn IMHO the common deployment is Kafka (running standalone,
>> because
>>> it
>>>>> only works that way), which also requires Zookeeper (if I'm not
>>> mistaken)
>>>>> and YARN, which all runners should be able to run on.
>>>>>
>>>>> On Thu, 28 Jul 2016 at 18:36 Kenneth Knowles <klk@google.com.invalid
>>>
>>>>> wrote:
>>>>>
>>>>>> Presumably we'll eventually also run additional services alongside
>>>> (like
>>>>>> Kafka) to have true integration tests for I/O connectors. What is
>> the
>>>>>> common deployment in this case?
>>>>>>
>>>>>> On Jul 28, 2016 06:35, "Amit Sela" <am...@gmail.com> wrote:
>>>>>>
>>>>>>> So what would be the preferred resource manager to test Flink on
>> ?
>>>>>>>
>>>>>>> On Thu, Jul 28, 2016, 16:34 Aljoscha Krettek <
>> aljoscha@apache.org>
>>>>>> wrote:
>>>>>>>
>>>>>>>> Flink also has a standalone mode.
>>>>>>>>
>>>>>>>> On Thu, 28 Jul 2016 at 13:42 Isma�l Mej�a <ie...@gmail.com>
>>>> wrote:
>>>>>>>>
>>>>>>>>> Good subject,  YARN is the de-facto standard at least from
>> the
>>>>> point
>>>>>> of
>>>>>>>>> view of the Big Data Distributions (Cloudera, Hortonworks,
>> etc)
>>>> and
>>>>>>> Cloud
>>>>>>>>> offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc),
>> and
>>>>> given
>>>>>>>> that
>>>>>>>>> it is supported by both Spark and Flink I think it is
>> valuable
>>> to
>>>>>> test
>>>>>>>> the
>>>>>>>>> support for YARN. The question is, should the tests be run on
>>>>>>>> 'Standalone'
>>>>>>>>> OR YARN' or maybe we can have  tests for 'Standalone AND
>> YARN'
>>> ?
>>>>>>>>>
>>>>>>>>> Ismael.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <
>>>> amitsela33@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Following a discussion I had with Kenneth and Dan here
>>>>>>>>>> <https://github.com/apache/incubator-beam/pull/711>. I
>> want
>>> to
>>>>>> raise
>>>>>>>> the
>>>>>>>>>> issue of which resource manager we should use for on going
>>>> tests
>>>>>> that
>>>>>>>>> will
>>>>>>>>>> run on actual clusters (on top of local/in-mem tests).
>>>>>>>>>> If we plan to test all runners on all their supported
>>> resource
>>>>>>>> managers,
>>>>>>>>>> great! But I guess this won't be the case, at least not at
>>> the
>>>>>>>> beginning.
>>>>>>>>>>
>>>>>>>>>> Spark can run it's own (Standalone Mode) resource manager,
>>> use
>>>>> YARN
>>>>>>> or
>>>>>>>>> use
>>>>>>>>>> Mesos. According to the latest survey
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
>>>>>>>>>>>
>>>>>>>>>> by
>>>>>>>>>> Databricks Standalone is in the lead (48%), with YARN
>> tailing
>>>> it
>>>>>>>>>> (40%) while Mesos looks like the least favourite.
>>>>>>>>>> For Spark, I'd vote for Standalone as it is the most
>> popular
>>>> use
>>>>>>> case +
>>>>>>>>> it
>>>>>>>>>> avoids the additional complexity of maintaining YARN on
>> this
>>>>>> cluster.
>>>>>>>>>> Having said that, AFAIK Flink is a "first-class" YARN
>> citizen
>>>>>> (right
>>>>>>> ?)
>>>>>>>>> and
>>>>>>>>>> I don't know what available resource managers can be used
>> by
>>>>> other
>>>>>>>>> runners,
>>>>>>>>>> so I think runner authors should give their input here.
>>>>>>>>>>
>>>>>>>>>> *Summary:*
>>>>>>>>>> *Spark* - StandaloneMode or YARN (in that order).
>>>>>>>>>> *Flink * - ?
>>>>>>>>>> *Others* - ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Amit
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Manu Zhang <ow...@gmail.com>.

In Gearpump, we use docker to launch standalone Gearpump (with master and
worker on different containers), Kafka cluster (source) and HDFS cluster
(checkpoint store) to test message delivery guarantee, i.e. at least once,
exactly-once.

Thanks,
Manu

On Fri, Jul 29, 2016 at 5:10 AM Ismaël Mejía <ie...@gmail.com> wrote:

> Hi again,
>
> Understood, if later on we (beam) decide to put this in place, I can help a
> bit, since this is a subject I like, and it is clear for me that this idea
> can have immediate benefits (better integration tests and of course better
> IOs/runners).
>
> Ismael
>
> On Thu, Jul 28, 2016 at 10:38 PM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > Hi Ismaël,
> >
> > I was just talking in general about what any project would want to do. I
> > don't have any specific plans.
> >
> > Kenn
> >
> > On Thu, Jul 28, 2016 at 12:53 PM, Ismaël Mejía <ie...@gmail.com>
> wrote:
> >
> > > Kenneth this is great news (I am talking about the addtional
> services), I
> > > was just discussing with JB the other day, about how nice it would be
> to
> > > have this kind of tests, with the right infrastructure, since we are
> > > working on new IOs, e.g. to test certain particular behaviors with
> Kafka
> > or
> > > other systems, how do the IO react to failure, etc.
> > >
> > > It is nice to know that this can be supported. Any concrete plans of
> how
> > > will to make this work ? Do you intend to deploy such systems via
> > > containers or just having them in some test cluster ?
> > >
> > > As Aljoscha mentions just kafka or yarn both need quite a bit of
> 'extra'
> > > dependencies at deploy time.
> > >
> > > Thanks again for this idea,
> > > Ismael.
> > >
> > >
> > >
> > > On Thu, Jul 28, 2016 at 6:48 PM, Aljoscha Krettek <aljoscha@apache.org
> >
> > > wrote:
> > >
> > > > For Flink, Yarn is fine and I guess it's the common denominator for
> all
> > > > runners (except DataflowRunner, of course).
> > > >
> > > > @Kenn IMHO the common deployment is Kafka (running standalone,
> because
> > it
> > > > only works that way), which also requires Zookeeper (if I'm not
> > mistaken)
> > > > and YARN, which all runners should be able to run on.
> > > >
> > > > On Thu, 28 Jul 2016 at 18:36 Kenneth Knowles <klk@google.com.invalid
> >
> > > > wrote:
> > > >
> > > > > Presumably we'll eventually also run additional services alongside
> > > (like
> > > > > Kafka) to have true integration tests for I/O connectors. What is
> the
> > > > > common deployment in this case?
> > > > >
> > > > > On Jul 28, 2016 06:35, "Amit Sela" <am...@gmail.com> wrote:
> > > > >
> > > > > > So what would be the preferred resource manager to test Flink on
> ?
> > > > > >
> > > > > > On Thu, Jul 28, 2016, 16:34 Aljoscha Krettek <
> aljoscha@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Flink also has a standalone mode.
> > > > > > >
> > > > > > > On Thu, 28 Jul 2016 at 13:42 Ismaël Mejía <ie...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Good subject,  YARN is the de-facto standard at least from
> the
> > > > point
> > > > > of
> > > > > > > > view of the Big Data Distributions (Cloudera, Hortonworks,
> etc)
> > > and
> > > > > > Cloud
> > > > > > > > offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc),
> and
> > > > given
> > > > > > > that
> > > > > > > > it is supported by both Spark and Flink I think it is
> valuable
> > to
> > > > > test
> > > > > > > the
> > > > > > > > support for YARN. The question is, should the tests be run on
> > > > > > > 'Standalone'
> > > > > > > > OR YARN' or maybe we can have  tests for 'Standalone AND
> YARN'
> > ?
> > > > > > > >
> > > > > > > > Ismael.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <
> > > amitsela33@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Following a discussion I had with Kenneth and Dan here
> > > > > > > > > <https://github.com/apache/incubator-beam/pull/711>. I
> want
> > to
> > > > > raise
> > > > > > > the
> > > > > > > > > issue of which resource manager we should use for on going
> > > tests
> > > > > that
> > > > > > > > will
> > > > > > > > > run on actual clusters (on top of local/in-mem tests).
> > > > > > > > > If we plan to test all runners on all their supported
> > resource
> > > > > > > managers,
> > > > > > > > > great! But I guess this won't be the case, at least not at
> > the
> > > > > > > beginning.
> > > > > > > > >
> > > > > > > > > Spark can run it's own (Standalone Mode) resource manager,
> > use
> > > > YARN
> > > > > > or
> > > > > > > > use
> > > > > > > > > Mesos. According to the latest survey
> > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
> > > > > > > > > >
> > > > > > > > > by
> > > > > > > > > Databricks Standalone is in the lead (48%), with YARN
> tailing
> > > it
> > > > > > > > > (40%) while Mesos looks like the least favourite.
> > > > > > > > > For Spark, I'd vote for Standalone as it is the most
> popular
> > > use
> > > > > > case +
> > > > > > > > it
> > > > > > > > > avoids the additional complexity of maintaining YARN on
> this
> > > > > cluster.
> > > > > > > > > Having said that, AFAIK Flink is a "first-class" YARN
> citizen
> > > > > (right
> > > > > > ?)
> > > > > > > > and
> > > > > > > > > I don't know what available resource managers can be used
> by
> > > > other
> > > > > > > > runners,
> > > > > > > > > so I think runner authors should give their input here.
> > > > > > > > >
> > > > > > > > > *Summary:*
> > > > > > > > > *Spark* - StandaloneMode or YARN (in that order).
> > > > > > > > > *Flink * - ?
> > > > > > > > > *Others* - ?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Amit
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Ismaël Mejía <ie...@gmail.com>.

Hi again,

Understood, if later on we (beam) decide to put this in place, I can help a
bit, since this is a subject I like, and it is clear for me that this idea
can have immediate benefits (better integration tests and of course better
IOs/runners).

Ismael

On Thu, Jul 28, 2016 at 10:38 PM, Kenneth Knowles <kl...@google.com.invalid>
wrote:

> Hi Ismaël,
>
> I was just talking in general about what any project would want to do. I
> don't have any specific plans.
>
> Kenn
>
> On Thu, Jul 28, 2016 at 12:53 PM, Ismaël Mejía <ie...@gmail.com> wrote:
>
> > Kenneth this is great news (I am talking about the addtional services), I
> > was just discussing with JB the other day, about how nice it would be to
> > have this kind of tests, with the right infrastructure, since we are
> > working on new IOs, e.g. to test certain particular behaviors with Kafka
> or
> > other systems, how do the IO react to failure, etc.
> >
> > It is nice to know that this can be supported. Any concrete plans of how
> > will to make this work ? Do you intend to deploy such systems via
> > containers or just having them in some test cluster ?
> >
> > As Aljoscha mentions just kafka or yarn both need quite a bit of 'extra'
> > dependencies at deploy time.
> >
> > Thanks again for this idea,
> > Ismael.
> >
> >
> >
> > On Thu, Jul 28, 2016 at 6:48 PM, Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> > > For Flink, Yarn is fine and I guess it's the common denominator for all
> > > runners (except DataflowRunner, of course).
> > >
> > > @Kenn IMHO the common deployment is Kafka (running standalone, because
> it
> > > only works that way), which also requires Zookeeper (if I'm not
> mistaken)
> > > and YARN, which all runners should be able to run on.
> > >
> > > On Thu, 28 Jul 2016 at 18:36 Kenneth Knowles <kl...@google.com.invalid>
> > > wrote:
> > >
> > > > Presumably we'll eventually also run additional services alongside
> > (like
> > > > Kafka) to have true integration tests for I/O connectors. What is the
> > > > common deployment in this case?
> > > >
> > > > On Jul 28, 2016 06:35, "Amit Sela" <am...@gmail.com> wrote:
> > > >
> > > > > So what would be the preferred resource manager to test Flink on ?
> > > > >
> > > > > On Thu, Jul 28, 2016, 16:34 Aljoscha Krettek <al...@apache.org>
> > > > wrote:
> > > > >
> > > > > > Flink also has a standalone mode.
> > > > > >
> > > > > > On Thu, 28 Jul 2016 at 13:42 Ismaël Mejía <ie...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Good subject,  YARN is the de-facto standard at least from the
> > > point
> > > > of
> > > > > > > view of the Big Data Distributions (Cloudera, Hortonworks, etc)
> > and
> > > > > Cloud
> > > > > > > offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc), and
> > > given
> > > > > > that
> > > > > > > it is supported by both Spark and Flink I think it is valuable
> to
> > > > test
> > > > > > the
> > > > > > > support for YARN. The question is, should the tests be run on
> > > > > > 'Standalone'
> > > > > > > OR YARN' or maybe we can have  tests for 'Standalone AND YARN'
> ?
> > > > > > >
> > > > > > > Ismael.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <
> > amitsela33@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Following a discussion I had with Kenneth and Dan here
> > > > > > > > <https://github.com/apache/incubator-beam/pull/711>. I want
> to
> > > > raise
> > > > > > the
> > > > > > > > issue of which resource manager we should use for on going
> > tests
> > > > that
> > > > > > > will
> > > > > > > > run on actual clusters (on top of local/in-mem tests).
> > > > > > > > If we plan to test all runners on all their supported
> resource
> > > > > > managers,
> > > > > > > > great! But I guess this won't be the case, at least not at
> the
> > > > > > beginning.
> > > > > > > >
> > > > > > > > Spark can run it's own (Standalone Mode) resource manager,
> use
> > > YARN
> > > > > or
> > > > > > > use
> > > > > > > > Mesos. According to the latest survey
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
> > > > > > > > >
> > > > > > > > by
> > > > > > > > Databricks Standalone is in the lead (48%), with YARN tailing
> > it
> > > > > > > > (40%) while Mesos looks like the least favourite.
> > > > > > > > For Spark, I'd vote for Standalone as it is the most popular
> > use
> > > > > case +
> > > > > > > it
> > > > > > > > avoids the additional complexity of maintaining YARN on this
> > > > cluster.
> > > > > > > > Having said that, AFAIK Flink is a "first-class" YARN citizen
> > > > (right
> > > > > ?)
> > > > > > > and
> > > > > > > > I don't know what available resource managers can be used by
> > > other
> > > > > > > runners,
> > > > > > > > so I think runner authors should give their input here.
> > > > > > > >
> > > > > > > > *Summary:*
> > > > > > > > *Spark* - StandaloneMode or YARN (in that order).
> > > > > > > > *Flink * - ?
> > > > > > > > *Others* - ?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Amit
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Kenneth Knowles <kl...@google.com.INVALID>.

Hi Ismaël,

I was just talking in general about what any project would want to do. I
don't have any specific plans.

Kenn

On Thu, Jul 28, 2016 at 12:53 PM, Ismaël Mejía <ie...@gmail.com> wrote:

> Kenneth this is great news (I am talking about the addtional services), I
> was just discussing with JB the other day, about how nice it would be to
> have this kind of tests, with the right infrastructure, since we are
> working on new IOs, e.g. to test certain particular behaviors with Kafka or
> other systems, how do the IO react to failure, etc.
>
> It is nice to know that this can be supported. Any concrete plans of how
> will to make this work ? Do you intend to deploy such systems via
> containers or just having them in some test cluster ?
>
> As Aljoscha mentions just kafka or yarn both need quite a bit of 'extra'
> dependencies at deploy time.
>
> Thanks again for this idea,
> Ismael.
>
>
>
> On Thu, Jul 28, 2016 at 6:48 PM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
> > For Flink, Yarn is fine and I guess it's the common denominator for all
> > runners (except DataflowRunner, of course).
> >
> > @Kenn IMHO the common deployment is Kafka (running standalone, because it
> > only works that way), which also requires Zookeeper (if I'm not mistaken)
> > and YARN, which all runners should be able to run on.
> >
> > On Thu, 28 Jul 2016 at 18:36 Kenneth Knowles <kl...@google.com.invalid>
> > wrote:
> >
> > > Presumably we'll eventually also run additional services alongside
> (like
> > > Kafka) to have true integration tests for I/O connectors. What is the
> > > common deployment in this case?
> > >
> > > On Jul 28, 2016 06:35, "Amit Sela" <am...@gmail.com> wrote:
> > >
> > > > So what would be the preferred resource manager to test Flink on ?
> > > >
> > > > On Thu, Jul 28, 2016, 16:34 Aljoscha Krettek <al...@apache.org>
> > > wrote:
> > > >
> > > > > Flink also has a standalone mode.
> > > > >
> > > > > On Thu, 28 Jul 2016 at 13:42 Ismaël Mejía <ie...@gmail.com>
> wrote:
> > > > >
> > > > > > Good subject,  YARN is the de-facto standard at least from the
> > point
> > > of
> > > > > > view of the Big Data Distributions (Cloudera, Hortonworks, etc)
> and
> > > > Cloud
> > > > > > offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc), and
> > given
> > > > > that
> > > > > > it is supported by both Spark and Flink I think it is valuable to
> > > test
> > > > > the
> > > > > > support for YARN. The question is, should the tests be run on
> > > > > 'Standalone'
> > > > > > OR YARN' or maybe we can have  tests for 'Standalone AND YARN' ?
> > > > > >
> > > > > > Ismael.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <
> amitsela33@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Following a discussion I had with Kenneth and Dan here
> > > > > > > <https://github.com/apache/incubator-beam/pull/711>. I want to
> > > raise
> > > > > the
> > > > > > > issue of which resource manager we should use for on going
> tests
> > > that
> > > > > > will
> > > > > > > run on actual clusters (on top of local/in-mem tests).
> > > > > > > If we plan to test all runners on all their supported resource
> > > > > managers,
> > > > > > > great! But I guess this won't be the case, at least not at the
> > > > > beginning.
> > > > > > >
> > > > > > > Spark can run it's own (Standalone Mode) resource manager, use
> > YARN
> > > > or
> > > > > > use
> > > > > > > Mesos. According to the latest survey
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
> > > > > > > >
> > > > > > > by
> > > > > > > Databricks Standalone is in the lead (48%), with YARN tailing
> it
> > > > > > > (40%) while Mesos looks like the least favourite.
> > > > > > > For Spark, I'd vote for Standalone as it is the most popular
> use
> > > > case +
> > > > > > it
> > > > > > > avoids the additional complexity of maintaining YARN on this
> > > cluster.
> > > > > > > Having said that, AFAIK Flink is a "first-class" YARN citizen
> > > (right
> > > > ?)
> > > > > > and
> > > > > > > I don't know what available resource managers can be used by
> > other
> > > > > > runners,
> > > > > > > so I think runner authors should give their input here.
> > > > > > >
> > > > > > > *Summary:*
> > > > > > > *Spark* - StandaloneMode or YARN (in that order).
> > > > > > > *Flink * - ?
> > > > > > > *Others* - ?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Amit
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Ismaël Mejía <ie...@gmail.com>.

Kenneth this is great news (I am talking about the addtional services), I
was just discussing with JB the other day, about how nice it would be to
have this kind of tests, with the right infrastructure, since we are
working on new IOs, e.g. to test certain particular behaviors with Kafka or
other systems, how do the IO react to failure, etc.

It is nice to know that this can be supported. Any concrete plans of how
will to make this work ? Do you intend to deploy such systems via
containers or just having them in some test cluster ?

As Aljoscha mentions just kafka or yarn both need quite a bit of 'extra'
dependencies at deploy time.

Thanks again for this idea,
Ismael.



On Thu, Jul 28, 2016 at 6:48 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> For Flink, Yarn is fine and I guess it's the common denominator for all
> runners (except DataflowRunner, of course).
>
> @Kenn IMHO the common deployment is Kafka (running standalone, because it
> only works that way), which also requires Zookeeper (if I'm not mistaken)
> and YARN, which all runners should be able to run on.
>
> On Thu, 28 Jul 2016 at 18:36 Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > Presumably we'll eventually also run additional services alongside (like
> > Kafka) to have true integration tests for I/O connectors. What is the
> > common deployment in this case?
> >
> > On Jul 28, 2016 06:35, "Amit Sela" <am...@gmail.com> wrote:
> >
> > > So what would be the preferred resource manager to test Flink on ?
> > >
> > > On Thu, Jul 28, 2016, 16:34 Aljoscha Krettek <al...@apache.org>
> > wrote:
> > >
> > > > Flink also has a standalone mode.
> > > >
> > > > On Thu, 28 Jul 2016 at 13:42 Ismaël Mejía <ie...@gmail.com> wrote:
> > > >
> > > > > Good subject,  YARN is the de-facto standard at least from the
> point
> > of
> > > > > view of the Big Data Distributions (Cloudera, Hortonworks, etc) and
> > > Cloud
> > > > > offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc), and
> given
> > > > that
> > > > > it is supported by both Spark and Flink I think it is valuable to
> > test
> > > > the
> > > > > support for YARN. The question is, should the tests be run on
> > > > 'Standalone'
> > > > > OR YARN' or maybe we can have  tests for 'Standalone AND YARN' ?
> > > > >
> > > > > Ismael.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <am...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Following a discussion I had with Kenneth and Dan here
> > > > > > <https://github.com/apache/incubator-beam/pull/711>. I want to
> > raise
> > > > the
> > > > > > issue of which resource manager we should use for on going tests
> > that
> > > > > will
> > > > > > run on actual clusters (on top of local/in-mem tests).
> > > > > > If we plan to test all runners on all their supported resource
> > > > managers,
> > > > > > great! But I guess this won't be the case, at least not at the
> > > > beginning.
> > > > > >
> > > > > > Spark can run it's own (Standalone Mode) resource manager, use
> YARN
> > > or
> > > > > use
> > > > > > Mesos. According to the latest survey
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
> > > > > > >
> > > > > > by
> > > > > > Databricks Standalone is in the lead (48%), with YARN tailing it
> > > > > > (40%) while Mesos looks like the least favourite.
> > > > > > For Spark, I'd vote for Standalone as it is the most popular use
> > > case +
> > > > > it
> > > > > > avoids the additional complexity of maintaining YARN on this
> > cluster.
> > > > > > Having said that, AFAIK Flink is a "first-class" YARN citizen
> > (right
> > > ?)
> > > > > and
> > > > > > I don't know what available resource managers can be used by
> other
> > > > > runners,
> > > > > > so I think runner authors should give their input here.
> > > > > >
> > > > > > *Summary:*
> > > > > > *Spark* - StandaloneMode or YARN (in that order).
> > > > > > *Flink * - ?
> > > > > > *Others* - ?
> > > > > >
> > > > > > Thanks,
> > > > > > Amit
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Aljoscha Krettek <al...@apache.org>.

For Flink, Yarn is fine and I guess it's the common denominator for all
runners (except DataflowRunner, of course).

@Kenn IMHO the common deployment is Kafka (running standalone, because it
only works that way), which also requires Zookeeper (if I'm not mistaken)
and YARN, which all runners should be able to run on.

On Thu, 28 Jul 2016 at 18:36 Kenneth Knowles <kl...@google.com.invalid> wrote:

> Presumably we'll eventually also run additional services alongside (like
> Kafka) to have true integration tests for I/O connectors. What is the
> common deployment in this case?
>
> On Jul 28, 2016 06:35, "Amit Sela" <am...@gmail.com> wrote:
>
> > So what would be the preferred resource manager to test Flink on ?
> >
> > On Thu, Jul 28, 2016, 16:34 Aljoscha Krettek <al...@apache.org>
> wrote:
> >
> > > Flink also has a standalone mode.
> > >
> > > On Thu, 28 Jul 2016 at 13:42 Ismaël Mejía <ie...@gmail.com> wrote:
> > >
> > > > Good subject,  YARN is the de-facto standard at least from the point
> of
> > > > view of the Big Data Distributions (Cloudera, Hortonworks, etc) and
> > Cloud
> > > > offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc), and given
> > > that
> > > > it is supported by both Spark and Flink I think it is valuable to
> test
> > > the
> > > > support for YARN. The question is, should the tests be run on
> > > 'Standalone'
> > > > OR YARN' or maybe we can have  tests for 'Standalone AND YARN' ?
> > > >
> > > > Ismael.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <am...@gmail.com>
> > > wrote:
> > > >
> > > > > Following a discussion I had with Kenneth and Dan here
> > > > > <https://github.com/apache/incubator-beam/pull/711>. I want to
> raise
> > > the
> > > > > issue of which resource manager we should use for on going tests
> that
> > > > will
> > > > > run on actual clusters (on top of local/in-mem tests).
> > > > > If we plan to test all runners on all their supported resource
> > > managers,
> > > > > great! But I guess this won't be the case, at least not at the
> > > beginning.
> > > > >
> > > > > Spark can run it's own (Standalone Mode) resource manager, use YARN
> > or
> > > > use
> > > > > Mesos. According to the latest survey
> > > > > <
> > > > >
> > > >
> > >
> >
> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
> > > > > >
> > > > > by
> > > > > Databricks Standalone is in the lead (48%), with YARN tailing it
> > > > > (40%) while Mesos looks like the least favourite.
> > > > > For Spark, I'd vote for Standalone as it is the most popular use
> > case +
> > > > it
> > > > > avoids the additional complexity of maintaining YARN on this
> cluster.
> > > > > Having said that, AFAIK Flink is a "first-class" YARN citizen
> (right
> > ?)
> > > > and
> > > > > I don't know what available resource managers can be used by other
> > > > runners,
> > > > > so I think runner authors should give their input here.
> > > > >
> > > > > *Summary:*
> > > > > *Spark* - StandaloneMode or YARN (in that order).
> > > > > *Flink * - ?
> > > > > *Others* - ?
> > > > >
> > > > > Thanks,
> > > > > Amit
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Kenneth Knowles <kl...@google.com.INVALID>.

Presumably we'll eventually also run additional services alongside (like
Kafka) to have true integration tests for I/O connectors. What is the
common deployment in this case?

On Jul 28, 2016 06:35, "Amit Sela" <am...@gmail.com> wrote:

> So what would be the preferred resource manager to test Flink on ?
>
> On Thu, Jul 28, 2016, 16:34 Aljoscha Krettek <al...@apache.org> wrote:
>
> > Flink also has a standalone mode.
> >
> > On Thu, 28 Jul 2016 at 13:42 Ismaël Mejía <ie...@gmail.com> wrote:
> >
> > > Good subject,  YARN is the de-facto standard at least from the point of
> > > view of the Big Data Distributions (Cloudera, Hortonworks, etc) and
> Cloud
> > > offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc), and given
> > that
> > > it is supported by both Spark and Flink I think it is valuable to test
> > the
> > > support for YARN. The question is, should the tests be run on
> > 'Standalone'
> > > OR YARN' or maybe we can have  tests for 'Standalone AND YARN' ?
> > >
> > > Ismael.
> > >
> > >
> > >
> > >
> > > On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <am...@gmail.com>
> > wrote:
> > >
> > > > Following a discussion I had with Kenneth and Dan here
> > > > <https://github.com/apache/incubator-beam/pull/711>. I want to raise
> > the
> > > > issue of which resource manager we should use for on going tests that
> > > will
> > > > run on actual clusters (on top of local/in-mem tests).
> > > > If we plan to test all runners on all their supported resource
> > managers,
> > > > great! But I guess this won't be the case, at least not at the
> > beginning.
> > > >
> > > > Spark can run it's own (Standalone Mode) resource manager, use YARN
> or
> > > use
> > > > Mesos. According to the latest survey
> > > > <
> > > >
> > >
> >
> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
> > > > >
> > > > by
> > > > Databricks Standalone is in the lead (48%), with YARN tailing it
> > > > (40%) while Mesos looks like the least favourite.
> > > > For Spark, I'd vote for Standalone as it is the most popular use
> case +
> > > it
> > > > avoids the additional complexity of maintaining YARN on this cluster.
> > > > Having said that, AFAIK Flink is a "first-class" YARN citizen (right
> ?)
> > > and
> > > > I don't know what available resource managers can be used by other
> > > runners,
> > > > so I think runner authors should give their input here.
> > > >
> > > > *Summary:*
> > > > *Spark* - StandaloneMode or YARN (in that order).
> > > > *Flink * - ?
> > > > *Others* - ?
> > > >
> > > > Thanks,
> > > > Amit
> > > >
> > >
> >
>

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Amit Sela <am...@gmail.com>.

So what would be the preferred resource manager to test Flink on ?

On Thu, Jul 28, 2016, 16:34 Aljoscha Krettek <al...@apache.org> wrote:

> Flink also has a standalone mode.
>
> On Thu, 28 Jul 2016 at 13:42 Ismaël Mejía <ie...@gmail.com> wrote:
>
> > Good subject,  YARN is the de-facto standard at least from the point of
> > view of the Big Data Distributions (Cloudera, Hortonworks, etc) and Cloud
> > offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc), and given
> that
> > it is supported by both Spark and Flink I think it is valuable to test
> the
> > support for YARN. The question is, should the tests be run on
> 'Standalone'
> > OR YARN' or maybe we can have  tests for 'Standalone AND YARN' ?
> >
> > Ismael.
> >
> >
> >
> >
> > On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <am...@gmail.com>
> wrote:
> >
> > > Following a discussion I had with Kenneth and Dan here
> > > <https://github.com/apache/incubator-beam/pull/711>. I want to raise
> the
> > > issue of which resource manager we should use for on going tests that
> > will
> > > run on actual clusters (on top of local/in-mem tests).
> > > If we plan to test all runners on all their supported resource
> managers,
> > > great! But I guess this won't be the case, at least not at the
> beginning.
> > >
> > > Spark can run it's own (Standalone Mode) resource manager, use YARN or
> > use
> > > Mesos. According to the latest survey
> > > <
> > >
> >
> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
> > > >
> > > by
> > > Databricks Standalone is in the lead (48%), with YARN tailing it
> > > (40%) while Mesos looks like the least favourite.
> > > For Spark, I'd vote for Standalone as it is the most popular use case +
> > it
> > > avoids the additional complexity of maintaining YARN on this cluster.
> > > Having said that, AFAIK Flink is a "first-class" YARN citizen (right ?)
> > and
> > > I don't know what available resource managers can be used by other
> > runners,
> > > so I think runner authors should give their input here.
> > >
> > > *Summary:*
> > > *Spark* - StandaloneMode or YARN (in that order).
> > > *Flink * - ?
> > > *Others* - ?
> > >
> > > Thanks,
> > > Amit
> > >
> >
>

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Aljoscha Krettek <al...@apache.org>.

Flink also has a standalone mode.

On Thu, 28 Jul 2016 at 13:42 Ismaël Mejía <ie...@gmail.com> wrote:

> Good subject,  YARN is the de-facto standard at least from the point of
> view of the Big Data Distributions (Cloudera, Hortonworks, etc) and Cloud
> offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc), and given that
> it is supported by both Spark and Flink I think it is valuable to test the
> support for YARN. The question is, should the tests be run on 'Standalone'
> OR YARN' or maybe we can have  tests for 'Standalone AND YARN' ?
>
> Ismael.
>
>
>
>
> On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <am...@gmail.com> wrote:
>
> > Following a discussion I had with Kenneth and Dan here
> > <https://github.com/apache/incubator-beam/pull/711>. I want to raise the
> > issue of which resource manager we should use for on going tests that
> will
> > run on actual clusters (on top of local/in-mem tests).
> > If we plan to test all runners on all their supported resource managers,
> > great! But I guess this won't be the case, at least not at the beginning.
> >
> > Spark can run it's own (Standalone Mode) resource manager, use YARN or
> use
> > Mesos. According to the latest survey
> > <
> >
> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
> > >
> > by
> > Databricks Standalone is in the lead (48%), with YARN tailing it
> > (40%) while Mesos looks like the least favourite.
> > For Spark, I'd vote for Standalone as it is the most popular use case +
> it
> > avoids the additional complexity of maintaining YARN on this cluster.
> > Having said that, AFAIK Flink is a "first-class" YARN citizen (right ?)
> and
> > I don't know what available resource managers can be used by other
> runners,
> > so I think runner authors should give their input here.
> >
> > *Summary:*
> > *Spark* - StandaloneMode or YARN (in that order).
> > *Flink * - ?
> > *Others* - ?
> >
> > Thanks,
> > Amit
> >
>

Re: [DISCUSS] cluster infrastructure - resource manager - for on going tests

Posted by Ismaël Mejía <ie...@gmail.com>.

Good subject,  YARN is the de-facto standard at least from the point of
view of the Big Data Distributions (Cloudera, Hortonworks, etc) and Cloud
offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc), and given that
it is supported by both Spark and Flink I think it is valuable to test the
support for YARN. The question is, should the tests be run on 'Standalone'
OR YARN' or maybe we can have  tests for 'Standalone AND YARN' ?

Ismael.




On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela <am...@gmail.com> wrote:

> Following a discussion I had with Kenneth and Dan here
> <https://github.com/apache/incubator-beam/pull/711>. I want to raise the
> issue of which resource manager we should use for on going tests that will
> run on actual clusters (on top of local/in-mem tests).
> If we plan to test all runners on all their supported resource managers,
> great! But I guess this won't be the case, at least not at the beginning.
>
> Spark can run it's own (Standalone Mode) resource manager, use YARN or use
> Mesos. According to the latest survey
> <
> http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
> >
> by
> Databricks Standalone is in the lead (48%), with YARN tailing it
> (40%) while Mesos looks like the least favourite.
> For Spark, I'd vote for Standalone as it is the most popular use case + it
> avoids the additional complexity of maintaining YARN on this cluster.
> Having said that, AFAIK Flink is a "first-class" YARN citizen (right ?) and
> I don't know what available resource managers can be used by other runners,
> so I think runner authors should give their input here.
>
> *Summary:*
> *Spark* - StandaloneMode or YARN (in that order).
> *Flink * - ?
> *Others* - ?
>
> Thanks,
> Amit
>