You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Alessandro Baretta <al...@gmail.com> on 2014/12/18 07:08:29 UTC

Spark Shell slowness on Google Cloud

All,

I'm using the Spark shell to interact with a small test deployment of
Spark, built from the current master branch. I'm processing a dataset
comprising a few thousand objects on Google Cloud Storage, split into a
half dozen directories. My code constructs an object--let me call it the
Dataset object--that defines a distinct RDD for each directory. The
constructor of the object only defines the RDDs; it does not actually
evaluate them, so I would expect it to return very quickly. Indeed, the
logging code in the constructor prints a line signaling the completion of
the code almost immediately after invocation, but the Spark shell does not
show the prompt right away. Instead, it spends a few minutes seemingly
frozen, eventually producing the following output:

14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process
: 9

14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process
: 759

14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process
: 228

14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process
: 3076

14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process
: 1013

14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process
: 156

This stage is inexplicably slow. What could be happening?

Thanks.


Alex

Re: Spark Shell slowness on Google Cloud

Posted by Denny Lee <de...@gmail.com>.

For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars
for connectivity. I'm wondering if it's those connection points that are
ultimately slowing down the connection between Spark and GCS.

The reason I was asking if you could run bdutil is because it would be
basically Hadoop connecting to GCS. If it's just as slow than that would
point to the root cause. That is, it's the "Hadoop" connection that is
slowing things vs something explicitly out of Spark per se.
On Wed, Dec 17, 2014 at 23:25 Alessandro Baretta <al...@gmail.com>
wrote:

> Well, what do you suggest I run to test this? But more importantly, what
> information would this give me?
>
> On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee <de...@gmail.com> wrote:
>>
>> Oh, it makes sense of gsutil scans through this quickly, but I was
>> wondering if running a Hadoop job / bdutil would result in just as fast
>> scans?
>>
>>
>> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <
>> alexbaretta@gmail.com> wrote:
>>
>>> Denny,
>>>
>>> No, gsutil scans through the listing of the bucket quickly. See the
>>> following.
>>>
>>> alex@hadoop-m:~/split$ time bash -c "gsutil ls
>>> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>>>
>>> 6860
>>>
>>> real    0m6.971s
>>> user    0m1.052s
>>> sys     0m0.096s
>>>
>>> Alex
>>>
>>>
>>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <de...@gmail.com>
>>> wrote:
>>>>
>>>> I'm curious if you're seeing the same thing when using bdutil against
>>>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>>>> Spark -> Hadoop -> GCS Connector -> GCS.
>>>>
>>>>
>>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>>>> alexbaretta@gmail.com> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> I'm using the Spark shell to interact with a small test deployment of
>>>>> Spark, built from the current master branch. I'm processing a dataset
>>>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>>>> half dozen directories. My code constructs an object--let me call it the
>>>>> Dataset object--that defines a distinct RDD for each directory. The
>>>>> constructor of the object only defines the RDDs; it does not actually
>>>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>>>> logging code in the constructor prints a line signaling the completion of
>>>>> the code almost immediately after invocation, but the Spark shell does not
>>>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>>>> frozen, eventually producing the following output:
>>>>>
>>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 9
>>>>>
>>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 759
>>>>>
>>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 228
>>>>>
>>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 3076
>>>>>
>>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 1013
>>>>>
>>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 156
>>>>>
>>>>> This stage is inexplicably slow. What could be happening?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>

Re: Spark Shell slowness on Google Cloud

Posted by Alessandro Baretta <al...@gmail.com>.

Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?

On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta <al...@gmail.com>
wrote:
>
> Well, what do you suggest I run to test this? But more importantly, what
> information would this give me?
>
> On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee <de...@gmail.com> wrote:
>>
>> Oh, it makes sense of gsutil scans through this quickly, but I was
>> wondering if running a Hadoop job / bdutil would result in just as fast
>> scans?
>>
>>
>> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <
>> alexbaretta@gmail.com> wrote:
>>
>>> Denny,
>>>
>>> No, gsutil scans through the listing of the bucket quickly. See the
>>> following.
>>>
>>> alex@hadoop-m:~/split$ time bash -c "gsutil ls
>>> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>>>
>>> 6860
>>>
>>> real    0m6.971s
>>> user    0m1.052s
>>> sys     0m0.096s
>>>
>>> Alex
>>>
>>>
>>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <de...@gmail.com>
>>> wrote:
>>>>
>>>> I'm curious if you're seeing the same thing when using bdutil against
>>>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>>>> Spark -> Hadoop -> GCS Connector -> GCS.
>>>>
>>>>
>>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>>>> alexbaretta@gmail.com> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> I'm using the Spark shell to interact with a small test deployment of
>>>>> Spark, built from the current master branch. I'm processing a dataset
>>>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>>>> half dozen directories. My code constructs an object--let me call it the
>>>>> Dataset object--that defines a distinct RDD for each directory. The
>>>>> constructor of the object only defines the RDDs; it does not actually
>>>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>>>> logging code in the constructor prints a line signaling the completion of
>>>>> the code almost immediately after invocation, but the Spark shell does not
>>>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>>>> frozen, eventually producing the following output:
>>>>>
>>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 9
>>>>>
>>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 759
>>>>>
>>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 228
>>>>>
>>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 3076
>>>>>
>>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 1013
>>>>>
>>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 156
>>>>>
>>>>> This stage is inexplicably slow. What could be happening?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>

Re: Spark Shell slowness on Google Cloud

Posted by Alessandro Baretta <al...@gmail.com>.

Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?

On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta <al...@gmail.com>
wrote:
>
> Well, what do you suggest I run to test this? But more importantly, what
> information would this give me?
>
> On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee <de...@gmail.com> wrote:
>>
>> Oh, it makes sense of gsutil scans through this quickly, but I was
>> wondering if running a Hadoop job / bdutil would result in just as fast
>> scans?
>>
>>
>> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <
>> alexbaretta@gmail.com> wrote:
>>
>>> Denny,
>>>
>>> No, gsutil scans through the listing of the bucket quickly. See the
>>> following.
>>>
>>> alex@hadoop-m:~/split$ time bash -c "gsutil ls
>>> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>>>
>>> 6860
>>>
>>> real    0m6.971s
>>> user    0m1.052s
>>> sys     0m0.096s
>>>
>>> Alex
>>>
>>>
>>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <de...@gmail.com>
>>> wrote:
>>>>
>>>> I'm curious if you're seeing the same thing when using bdutil against
>>>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>>>> Spark -> Hadoop -> GCS Connector -> GCS.
>>>>
>>>>
>>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>>>> alexbaretta@gmail.com> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> I'm using the Spark shell to interact with a small test deployment of
>>>>> Spark, built from the current master branch. I'm processing a dataset
>>>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>>>> half dozen directories. My code constructs an object--let me call it the
>>>>> Dataset object--that defines a distinct RDD for each directory. The
>>>>> constructor of the object only defines the RDDs; it does not actually
>>>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>>>> logging code in the constructor prints a line signaling the completion of
>>>>> the code almost immediately after invocation, but the Spark shell does not
>>>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>>>> frozen, eventually producing the following output:
>>>>>
>>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 9
>>>>>
>>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 759
>>>>>
>>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 228
>>>>>
>>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 3076
>>>>>
>>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 1013
>>>>>
>>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 156
>>>>>
>>>>> This stage is inexplicably slow. What could be happening?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>

Re: Spark Shell slowness on Google Cloud

Posted by Alessandro Baretta <al...@gmail.com>.

Well, what do you suggest I run to test this? But more importantly, what
information would this give me?

On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee <de...@gmail.com> wrote:
>
> Oh, it makes sense of gsutil scans through this quickly, but I was
> wondering if running a Hadoop job / bdutil would result in just as fast
> scans?
>
>
> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <
> alexbaretta@gmail.com> wrote:
>
>> Denny,
>>
>> No, gsutil scans through the listing of the bucket quickly. See the
>> following.
>>
>> alex@hadoop-m:~/split$ time bash -c "gsutil ls
>> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>>
>> 6860
>>
>> real    0m6.971s
>> user    0m1.052s
>> sys     0m0.096s
>>
>> Alex
>>
>>
>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <de...@gmail.com>
>> wrote:
>>>
>>> I'm curious if you're seeing the same thing when using bdutil against
>>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>>> Spark -> Hadoop -> GCS Connector -> GCS.
>>>
>>>
>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>>> alexbaretta@gmail.com> wrote:
>>>
>>>> All,
>>>>
>>>> I'm using the Spark shell to interact with a small test deployment of
>>>> Spark, built from the current master branch. I'm processing a dataset
>>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>>> half dozen directories. My code constructs an object--let me call it the
>>>> Dataset object--that defines a distinct RDD for each directory. The
>>>> constructor of the object only defines the RDDs; it does not actually
>>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>>> logging code in the constructor prints a line signaling the completion of
>>>> the code almost immediately after invocation, but the Spark shell does not
>>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>>> frozen, eventually producing the following output:
>>>>
>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 9
>>>>
>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 759
>>>>
>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 228
>>>>
>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 3076
>>>>
>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 1013
>>>>
>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 156
>>>>
>>>> This stage is inexplicably slow. What could be happening?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Alex
>>>>
>>>

Re: Spark Shell slowness on Google Cloud

Posted by Denny Lee <de...@gmail.com>.

Oh, it makes sense of gsutil scans through this quickly, but I was
wondering if running a Hadoop job / bdutil would result in just as fast
scans?

On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <al...@gmail.com>
wrote:

> Denny,
>
> No, gsutil scans through the listing of the bucket quickly. See the
> following.
>
> alex@hadoop-m:~/split$ time bash -c "gsutil ls
> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>
> 6860
>
> real    0m6.971s
> user    0m1.052s
> sys     0m0.096s
>
> Alex
>
>
> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <de...@gmail.com> wrote:
>>
>> I'm curious if you're seeing the same thing when using bdutil against
>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>> Spark -> Hadoop -> GCS Connector -> GCS.
>>
>>
>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>> alexbaretta@gmail.com> wrote:
>>
>>> All,
>>>
>>> I'm using the Spark shell to interact with a small test deployment of
>>> Spark, built from the current master branch. I'm processing a dataset
>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>> half dozen directories. My code constructs an object--let me call it the
>>> Dataset object--that defines a distinct RDD for each directory. The
>>> constructor of the object only defines the RDDs; it does not actually
>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>> logging code in the constructor prints a line signaling the completion of
>>> the code almost immediately after invocation, but the Spark shell does not
>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>> frozen, eventually producing the following output:
>>>
>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>> process : 9
>>>
>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>> process : 759
>>>
>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>> process : 228
>>>
>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>> process : 3076
>>>
>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>> process : 1013
>>>
>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>> process : 156
>>>
>>> This stage is inexplicably slow. What could be happening?
>>>
>>> Thanks.
>>>
>>>
>>> Alex
>>>
>>

Re: Spark Shell slowness on Google Cloud

Posted by Alessandro Baretta <al...@gmail.com>.

Denny,

No, gsutil scans through the listing of the bucket quickly. See the
following.

alex@hadoop-m:~/split$ time bash -c "gsutil ls
gs://my-bucket/20141205/csv/*/*/* | wc -l"

6860

real    0m6.971s
user    0m1.052s
sys     0m0.096s

Alex

On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <de...@gmail.com> wrote:
>
> I'm curious if you're seeing the same thing when using bdutil against
> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
> Spark -> Hadoop -> GCS Connector -> GCS.
>
>
> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
> alexbaretta@gmail.com> wrote:
>
>> All,
>>
>> I'm using the Spark shell to interact with a small test deployment of
>> Spark, built from the current master branch. I'm processing a dataset
>> comprising a few thousand objects on Google Cloud Storage, split into a
>> half dozen directories. My code constructs an object--let me call it the
>> Dataset object--that defines a distinct RDD for each directory. The
>> constructor of the object only defines the RDDs; it does not actually
>> evaluate them, so I would expect it to return very quickly. Indeed, the
>> logging code in the constructor prints a line signaling the completion of
>> the code almost immediately after invocation, but the Spark shell does not
>> show the prompt right away. Instead, it spends a few minutes seemingly
>> frozen, eventually producing the following output:
>>
>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>> process : 9
>>
>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>> process : 759
>>
>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>> process : 228
>>
>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>> process : 3076
>>
>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>> process : 1013
>>
>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>> process : 156
>>
>> This stage is inexplicably slow. What could be happening?
>>
>> Thanks.
>>
>>
>> Alex
>>
>

Re: Spark Shell slowness on Google Cloud

Posted by Denny Lee <de...@gmail.com>.

I'm curious if you're seeing the same thing when using bdutil against GCS?
I'm wondering if this may be an issue concerning the transfer rate of Spark
-> Hadoop -> GCS Connector -> GCS.

On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <al...@gmail.com>
wrote:

> All,
>
> I'm using the Spark shell to interact with a small test deployment of
> Spark, built from the current master branch. I'm processing a dataset
> comprising a few thousand objects on Google Cloud Storage, split into a
> half dozen directories. My code constructs an object--let me call it the
> Dataset object--that defines a distinct RDD for each directory. The
> constructor of the object only defines the RDDs; it does not actually
> evaluate them, so I would expect it to return very quickly. Indeed, the
> logging code in the constructor prints a line signaling the completion of
> the code almost immediately after invocation, but the Spark shell does not
> show the prompt right away. Instead, it spends a few minutes seemingly
> frozen, eventually producing the following output:
>
> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
> process : 9
>
> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
> process : 759
>
> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
> process : 228
>
> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
> process : 3076
>
> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
> process : 1013
>
> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
> process : 156
>
> This stage is inexplicably slow. What could be happening?
>
> Thanks.
>
>
> Alex
>