You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Diana Carroll <dc...@cloudera.com> on 2014/04/07 17:42:00 UTC

non-lazy execution of sortByKey?

Until today, I was under the impression that *all* Spark transformations
were "lazy"...that is, they wouldn't actually execute until an *action*
such as count or take was performed.

However today I'm using the "sortByKey" transformation, which would appear
to execute immediately, rather than as a result of an operation.  Am I
misunderstanding something, is this a bug, or is this a deliberate
difference between sortByKey and other transformations?

Here's my test. I'm parsing a bunch of weblog files and I want to know
which users made the most requests.  So my code pull out the 2nd field of
each line (the user ID), add up the total number of hits for each user ID,
swap user ID/hit count, and sort of hitcount.

var userreqs =
sc.textFile("file:/home/training/training_materials/sparkdev/data/weblogs/*").
   map(_.split(" ")).
   map(words => (words(2),1)).
   reduceByKey(_ + _).
   map(pair => (pair._2,pair._1)).
   sortByKey(false)

I thought nothing would actually happen here until I did userreqs.take(10)
but actually it did execute without the take(). It took about a minute for
it to complete and if I look at the web UI I see completed execution of 3
stages:  (Why is sortByKey two stages?)

[image: Inline image 2]

Something else about this strikes me as odd, too.  If I follow this command
by userreqs.take(10), I think it executes the whole thing all over again,
but doesn't show all the stages: stage 3 is missing in the UI:
[image: Inline image 3]


Plus it seems to automatically be caching my results?  Because when I
execute "take(10)" repeatedly, subsequent executions are very fast, and
trigger only a single stage:

[image: Inline image 4]

And I confirmed it is caching because i tried deleting the underlying files
and the take() still worked.

Anyone have any insight?

Diana

Re: non-lazy execution of sortByKey?

Posted by Matei Zaharia <ma...@gmail.com>.

Yeah, the reason it happens is that sortByKey tries to sample the data to figure out the right range partitions for it. But we could do this later, as the suggestion in there says.

Matei

On Apr 7, 2014, at 10:06 AM, Diana Carroll <dc...@cloudera.com> wrote:

> Aha!  Well I'm not crazy then, thanks.
> 
> 
> On Mon, Apr 7, 2014 at 11:51 AM, Mark Hamstra <ma...@clearstorydata.com> wrote:
> https://issues.apache.org/jira/browse/SPARK-1021?jql=text%20~%20%22sortByKey%22
> 
> 
> On Mon, Apr 7, 2014 at 8:42 AM, Diana Carroll <dc...@cloudera.com> wrote:
> Until today, I was under the impression that *all* Spark transformations were "lazy"...that is, they wouldn't actually execute until an *action* such as count or take was performed.
> 
> However today I'm using the "sortByKey" transformation, which would appear to execute immediately, rather than as a result of an operation.  Am I misunderstanding something, is this a bug, or is this a deliberate difference between sortByKey and other transformations?
> 
> Here's my test. I'm parsing a bunch of weblog files and I want to know which users made the most requests.  So my code pull out the 2nd field of each line (the user ID), add up the total number of hits for each user ID, swap user ID/hit count, and sort of hitcount.
> 
> var userreqs = sc.textFile("file:/home/training/training_materials/sparkdev/data/weblogs/*").
>    map(_.split(" ")).
>    map(words => (words(2),1)).  
>    reduceByKey(_ + _).
>    map(pair => (pair._2,pair._1)).
>    sortByKey(false)
> 
> I thought nothing would actually happen here until I did userreqs.take(10) but actually it did execute without the take(). It took about a minute for it to complete and if I look at the web UI I see completed execution of 3 stages:  (Why is sortByKey two stages?)
> 
> <sparkdev-2014-03-26.png>
> 
> Something else about this strikes me as odd, too.  If I follow this command by userreqs.take(10), I think it executes the whole thing all over again, but doesn't show all the stages: stage 3 is missing in the UI:
> <sparkdev-2014-03-26.png>
> 
> 
> Plus it seems to automatically be caching my results?  Because when I execute "take(10)" repeatedly, subsequent executions are very fast, and trigger only a single stage:
> 
> <sparkdev-2014-03-26.png>
> 
> And I confirmed it is caching because i tried deleting the underlying files and the take() still worked.
> 
> Anyone have any insight?
> 
> Diana
> 
>

Re: non-lazy execution of sortByKey?

Posted by Diana Carroll <dc...@cloudera.com>.

Aha!  Well I'm not crazy then, thanks.


On Mon, Apr 7, 2014 at 11:51 AM, Mark Hamstra <ma...@clearstorydata.com>wrote:

>
> https://issues.apache.org/jira/browse/SPARK-1021?jql=text%20~%20%22sortByKey%22
>
>
> On Mon, Apr 7, 2014 at 8:42 AM, Diana Carroll <dc...@cloudera.com>wrote:
>
>> Until today, I was under the impression that *all* Spark transformations
>> were "lazy"...that is, they wouldn't actually execute until an *action*
>> such as count or take was performed.
>>
>> However today I'm using the "sortByKey" transformation, which would
>> appear to execute immediately, rather than as a result of an operation.  Am
>> I misunderstanding something, is this a bug, or is this a deliberate
>> difference between sortByKey and other transformations?
>>
>> Here's my test. I'm parsing a bunch of weblog files and I want to know
>> which users made the most requests.  So my code pull out the 2nd field of
>> each line (the user ID), add up the total number of hits for each user ID,
>> swap user ID/hit count, and sort of hitcount.
>>
>> var userreqs =
>> sc.textFile("file:/home/training/training_materials/sparkdev/data/weblogs/*").
>>    map(_.split(" ")).
>>    map(words => (words(2),1)).
>>    reduceByKey(_ + _).
>>    map(pair => (pair._2,pair._1)).
>>    sortByKey(false)
>>
>> I thought nothing would actually happen here until I did
>> userreqs.take(10) but actually it did execute without the take(). It took
>> about a minute for it to complete and if I look at the web UI I see
>> completed execution of 3 stages:  (Why is sortByKey two stages?)
>>
>> [image: Inline image 2]
>>
>> Something else about this strikes me as odd, too.  If I follow this
>> command by userreqs.take(10), I think it executes the whole thing all over
>> again, but doesn't show all the stages: stage 3 is missing in the UI:
>> [image: Inline image 3]
>>
>>
>> Plus it seems to automatically be caching my results?  Because when I
>> execute "take(10)" repeatedly, subsequent executions are very fast, and
>> trigger only a single stage:
>>
>> [image: Inline image 4]
>>
>> And I confirmed it is caching because i tried deleting the underlying
>> files and the take() still worked.
>>
>> Anyone have any insight?
>>
>> Diana
>>
>
>

Re: non-lazy execution of sortByKey?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

https://issues.apache.org/jira/browse/SPARK-1021?jql=text%20~%20%22sortByKey%22


On Mon, Apr 7, 2014 at 8:42 AM, Diana Carroll <dc...@cloudera.com> wrote:

> Until today, I was under the impression that *all* Spark transformations
> were "lazy"...that is, they wouldn't actually execute until an *action*
> such as count or take was performed.
>
> However today I'm using the "sortByKey" transformation, which would appear
> to execute immediately, rather than as a result of an operation.  Am I
> misunderstanding something, is this a bug, or is this a deliberate
> difference between sortByKey and other transformations?
>
> Here's my test. I'm parsing a bunch of weblog files and I want to know
> which users made the most requests.  So my code pull out the 2nd field of
> each line (the user ID), add up the total number of hits for each user ID,
> swap user ID/hit count, and sort of hitcount.
>
> var userreqs =
> sc.textFile("file:/home/training/training_materials/sparkdev/data/weblogs/*").
>    map(_.split(" ")).
>    map(words => (words(2),1)).
>    reduceByKey(_ + _).
>    map(pair => (pair._2,pair._1)).
>    sortByKey(false)
>
> I thought nothing would actually happen here until I did userreqs.take(10)
> but actually it did execute without the take(). It took about a minute for
> it to complete and if I look at the web UI I see completed execution of 3
> stages:  (Why is sortByKey two stages?)
>
> [image: Inline image 2]
>
> Something else about this strikes me as odd, too.  If I follow this
> command by userreqs.take(10), I think it executes the whole thing all over
> again, but doesn't show all the stages: stage 3 is missing in the UI:
> [image: Inline image 3]
>
>
> Plus it seems to automatically be caching my results?  Because when I
> execute "take(10)" repeatedly, subsequent executions are very fast, and
> trigger only a single stage:
>
> [image: Inline image 4]
>
> And I confirmed it is caching because i tried deleting the underlying
> files and the take() still worked.
>
> Anyone have any insight?
>
> Diana
>