You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sean Owen <so...@cloudera.com> on 2014/12/31 19:21:40 UTC
Re: Big performance difference between "client" and "cluster"
deployment mode; is this expected?
-dev, +user
A decent guess: Does your 'save' function entail collecting data back
to the driver? and are you running this from a machine that's not in
your Spark cluster? Then in client mode you're shipping data back to a
less-nearby machine, compared to with cluster mode. That could explain
the bottleneck.
On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <es...@gmail.com> wrote:
> Hi,
>
> I have a very, very simple streaming job. When I deploy this on the exact
> same cluster, with the exact same parameters, I see big (40%) performance
> difference between "client" and "cluster" deployment mode. This seems a bit
> surprising.. Is this expected?
>
> The streaming job is:
>
> val msgStream = kafkaStream
> .map { case (k, v) => v}
> .map(DatatypeConverter.printBase64Binary)
> .foreachRDD(save)
> .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>
> I tried several times, but the job deployed with "client" mode can only
> write at 60% throughput of the job deployed with "cluster" mode and this
> happens consistently. I'm logging at INFO level, but my application code
> doesn't log anything so it's only Spark logs. The logs I see in "client"
> mode doesn't seem like a crazy amount.
>
> The setup is:
> spark-ec2 [...] \
> --copy-aws-credentials \
> --instance-type=m3.2xlarge \
> -s 2 launch test_cluster
>
> And all the deployment was done from the master machine.
>
> ᐧ
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Big performance difference between "client" and "cluster"
deployment mode; is this expected?
Posted by Enno Shioji <es...@gmail.com>.
Hi Tathagata,
It's a standalone cluster. The submit commands are:
== CLIENT
spark-submit --class com.fake.Test \
--deploy-mode client --master spark://fake.com:7077 \
fake.jar <arguments>
== CLUSTER
spark-submit --class com.fake.Test \
--deploy-mode cluster --master spark://fake.com:7077 \
s3n://fake.jar <arguments>
And they are both occupying all available slots. (8 * 2 machine = 16 slots).
ᐧ
On Thu, Jan 1, 2015 at 12:21 AM, Tathagata Das <ta...@gmail.com>
wrote:
> Whats your spark-submit commands in both cases? Is it Spark Standalone or
> YARN (both support client and cluster)? Accordingly what is the number of
> executors/cores requested?
>
> TD
>
> On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji <es...@gmail.com> wrote:
>
>> Also the job was deployed from the master machine in the cluster.
>>
>> On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji <es...@gmail.com> wrote:
>>
>>> Oh sorry that was a edit mistake. The code is essentially:
>>>
>>> val msgStream = kafkaStream
>>> .map { case (k, v) => v}
>>> .map(DatatypeConverter.printBase64Binary)
>>> .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>>>
>>> I.e. there is essentially no original code (I was calling saveAsTextFile
>>> in a "save" function but that was just a remnant from previous debugging).
>>>
>>>
>>>
>>> On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> -dev, +user
>>>>
>>>> A decent guess: Does your 'save' function entail collecting data back
>>>> to the driver? and are you running this from a machine that's not in
>>>> your Spark cluster? Then in client mode you're shipping data back to a
>>>> less-nearby machine, compared to with cluster mode. That could explain
>>>> the bottleneck.
>>>>
>>>> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <es...@gmail.com> wrote:
>>>> > Hi,
>>>> >
>>>> > I have a very, very simple streaming job. When I deploy this on the
>>>> exact
>>>> > same cluster, with the exact same parameters, I see big (40%)
>>>> performance
>>>> > difference between "client" and "cluster" deployment mode. This seems
>>>> a bit
>>>> > surprising.. Is this expected?
>>>> >
>>>> > The streaming job is:
>>>> >
>>>> > val msgStream = kafkaStream
>>>> > .map { case (k, v) => v}
>>>> > .map(DatatypeConverter.printBase64Binary)
>>>> > .foreachRDD(save)
>>>> > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>>>> >
>>>> > I tried several times, but the job deployed with "client" mode can
>>>> only
>>>> > write at 60% throughput of the job deployed with "cluster" mode and
>>>> this
>>>> > happens consistently. I'm logging at INFO level, but my application
>>>> code
>>>> > doesn't log anything so it's only Spark logs. The logs I see in
>>>> "client"
>>>> > mode doesn't seem like a crazy amount.
>>>> >
>>>> > The setup is:
>>>> > spark-ec2 [...] \
>>>> > --copy-aws-credentials \
>>>> > --instance-type=m3.2xlarge \
>>>> > -s 2 launch test_cluster
>>>> >
>>>> > And all the deployment was done from the master machine.
>>>> >
>>>> > ᐧ
>>>>
>>>
>>>
>>
>
Re: Big performance difference between "client" and "cluster"
deployment mode; is this expected?
Posted by Tathagata Das <ta...@gmail.com>.
Whats your spark-submit commands in both cases? Is it Spark Standalone or
YARN (both support client and cluster)? Accordingly what is the number of
executors/cores requested?
TD
On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji <es...@gmail.com> wrote:
> Also the job was deployed from the master machine in the cluster.
> ᐧ
>
> On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji <es...@gmail.com> wrote:
>
>> Oh sorry that was a edit mistake. The code is essentially:
>>
>> val msgStream = kafkaStream
>> .map { case (k, v) => v}
>> .map(DatatypeConverter.printBase64Binary)
>> .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>>
>> I.e. there is essentially no original code (I was calling saveAsTextFile
>> in a "save" function but that was just a remnant from previous debugging).
>>
>>
>> ᐧ
>>
>> On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> -dev, +user
>>>
>>> A decent guess: Does your 'save' function entail collecting data back
>>> to the driver? and are you running this from a machine that's not in
>>> your Spark cluster? Then in client mode you're shipping data back to a
>>> less-nearby machine, compared to with cluster mode. That could explain
>>> the bottleneck.
>>>
>>> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <es...@gmail.com> wrote:
>>> > Hi,
>>> >
>>> > I have a very, very simple streaming job. When I deploy this on the
>>> exact
>>> > same cluster, with the exact same parameters, I see big (40%)
>>> performance
>>> > difference between "client" and "cluster" deployment mode. This seems
>>> a bit
>>> > surprising.. Is this expected?
>>> >
>>> > The streaming job is:
>>> >
>>> > val msgStream = kafkaStream
>>> > .map { case (k, v) => v}
>>> > .map(DatatypeConverter.printBase64Binary)
>>> > .foreachRDD(save)
>>> > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>>> >
>>> > I tried several times, but the job deployed with "client" mode can only
>>> > write at 60% throughput of the job deployed with "cluster" mode and
>>> this
>>> > happens consistently. I'm logging at INFO level, but my application
>>> code
>>> > doesn't log anything so it's only Spark logs. The logs I see in
>>> "client"
>>> > mode doesn't seem like a crazy amount.
>>> >
>>> > The setup is:
>>> > spark-ec2 [...] \
>>> > --copy-aws-credentials \
>>> > --instance-type=m3.2xlarge \
>>> > -s 2 launch test_cluster
>>> >
>>> > And all the deployment was done from the master machine.
>>> >
>>> > ᐧ
>>>
>>
>>
>
Re: Big performance difference between "client" and "cluster"
deployment mode; is this expected?
Posted by Enno Shioji <es...@gmail.com>.
Also the job was deployed from the master machine in the cluster.
ᐧ
On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji <es...@gmail.com> wrote:
> Oh sorry that was a edit mistake. The code is essentially:
>
> val msgStream = kafkaStream
> .map { case (k, v) => v}
> .map(DatatypeConverter.printBase64Binary)
> .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>
> I.e. there is essentially no original code (I was calling saveAsTextFile
> in a "save" function but that was just a remnant from previous debugging).
>
>
> ᐧ
>
> On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> -dev, +user
>>
>> A decent guess: Does your 'save' function entail collecting data back
>> to the driver? and are you running this from a machine that's not in
>> your Spark cluster? Then in client mode you're shipping data back to a
>> less-nearby machine, compared to with cluster mode. That could explain
>> the bottleneck.
>>
>> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <es...@gmail.com> wrote:
>> > Hi,
>> >
>> > I have a very, very simple streaming job. When I deploy this on the
>> exact
>> > same cluster, with the exact same parameters, I see big (40%)
>> performance
>> > difference between "client" and "cluster" deployment mode. This seems a
>> bit
>> > surprising.. Is this expected?
>> >
>> > The streaming job is:
>> >
>> > val msgStream = kafkaStream
>> > .map { case (k, v) => v}
>> > .map(DatatypeConverter.printBase64Binary)
>> > .foreachRDD(save)
>> > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>> >
>> > I tried several times, but the job deployed with "client" mode can only
>> > write at 60% throughput of the job deployed with "cluster" mode and this
>> > happens consistently. I'm logging at INFO level, but my application code
>> > doesn't log anything so it's only Spark logs. The logs I see in "client"
>> > mode doesn't seem like a crazy amount.
>> >
>> > The setup is:
>> > spark-ec2 [...] \
>> > --copy-aws-credentials \
>> > --instance-type=m3.2xlarge \
>> > -s 2 launch test_cluster
>> >
>> > And all the deployment was done from the master machine.
>> >
>> > ᐧ
>>
>
>
Re: Big performance difference between "client" and "cluster"
deployment mode; is this expected?
Posted by Enno Shioji <es...@gmail.com>.
Oh sorry that was a edit mistake. The code is essentially:
val msgStream = kafkaStream
.map { case (k, v) => v}
.map(DatatypeConverter.printBase64Binary)
.saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
I.e. there is essentially no original code (I was calling saveAsTextFile in
a "save" function but that was just a remnant from previous debugging).
ᐧ
On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <so...@cloudera.com> wrote:
> -dev, +user
>
> A decent guess: Does your 'save' function entail collecting data back
> to the driver? and are you running this from a machine that's not in
> your Spark cluster? Then in client mode you're shipping data back to a
> less-nearby machine, compared to with cluster mode. That could explain
> the bottleneck.
>
> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <es...@gmail.com> wrote:
> > Hi,
> >
> > I have a very, very simple streaming job. When I deploy this on the exact
> > same cluster, with the exact same parameters, I see big (40%) performance
> > difference between "client" and "cluster" deployment mode. This seems a
> bit
> > surprising.. Is this expected?
> >
> > The streaming job is:
> >
> > val msgStream = kafkaStream
> > .map { case (k, v) => v}
> > .map(DatatypeConverter.printBase64Binary)
> > .foreachRDD(save)
> > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
> >
> > I tried several times, but the job deployed with "client" mode can only
> > write at 60% throughput of the job deployed with "cluster" mode and this
> > happens consistently. I'm logging at INFO level, but my application code
> > doesn't log anything so it's only Spark logs. The logs I see in "client"
> > mode doesn't seem like a crazy amount.
> >
> > The setup is:
> > spark-ec2 [...] \
> > --copy-aws-credentials \
> > --instance-type=m3.2xlarge \
> > -s 2 launch test_cluster
> >
> > And all the deployment was done from the master machine.
> >
> > ᐧ
>