You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Alessandro Baretta <al...@gmail.com> on 2014/12/18 08:38:16 UTC

Re: Spark Shell slowness on Google Cloud

Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?

On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta <al...@gmail.com>
wrote:
>
> Well, what do you suggest I run to test this? But more importantly, what
> information would this give me?
>
> On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee <de...@gmail.com> wrote:
>>
>> Oh, it makes sense of gsutil scans through this quickly, but I was
>> wondering if running a Hadoop job / bdutil would result in just as fast
>> scans?
>>
>>
>> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <
>> alexbaretta@gmail.com> wrote:
>>
>>> Denny,
>>>
>>> No, gsutil scans through the listing of the bucket quickly. See the
>>> following.
>>>
>>> alex@hadoop-m:~/split$ time bash -c "gsutil ls
>>> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>>>
>>> 6860
>>>
>>> real    0m6.971s
>>> user    0m1.052s
>>> sys     0m0.096s
>>>
>>> Alex
>>>
>>>
>>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <de...@gmail.com>
>>> wrote:
>>>>
>>>> I'm curious if you're seeing the same thing when using bdutil against
>>>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>>>> Spark -> Hadoop -> GCS Connector -> GCS.
>>>>
>>>>
>>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>>>> alexbaretta@gmail.com> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> I'm using the Spark shell to interact with a small test deployment of
>>>>> Spark, built from the current master branch. I'm processing a dataset
>>>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>>>> half dozen directories. My code constructs an object--let me call it the
>>>>> Dataset object--that defines a distinct RDD for each directory. The
>>>>> constructor of the object only defines the RDDs; it does not actually
>>>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>>>> logging code in the constructor prints a line signaling the completion of
>>>>> the code almost immediately after invocation, but the Spark shell does not
>>>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>>>> frozen, eventually producing the following output:
>>>>>
>>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 9
>>>>>
>>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 759
>>>>>
>>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 228
>>>>>
>>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 3076
>>>>>
>>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 1013
>>>>>
>>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 156
>>>>>
>>>>> This stage is inexplicably slow. What could be happening?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>