You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jakub Dubovsky <sp...@gmail.com> on 2016/09/01 11:53:48 UTC

Why there is no top method in dataset api

Hey all,

in RDD api there is very usefull method called top. It finds top n records
in according to certain ordering without sorting all records. Very usefull!

There is no top method nor similar functionality in Dataset api. Has
anybody any clue why? Is there any specific reason for this?

Any thoughts?

thanks

Jakub D.

Re: Why there is no top method in dataset api

Posted by Jakub Dubovsky <sp...@gmail.com>.

Thanks Sean,

the important part of your answer for me is that orderBy + limit is doing
only "partial sort" because of optimizer. That's what I was missing. I will
give it a try...

J.D.

On Mon, Sep 5, 2016 at 2:26 PM, Sean Owen <so...@cloudera.com> wrote:

> No, 
> I'm not advising you to use .rdd, just saying it is possible.
> Although I'd only use RDDs if you had a good reason to, given Datasets
> now, they are not gone or even deprecated.
>
> You do not need to order the whole data set to get the top eleme
> nt. That isn't what top does though. You might be interested to look at
> the source code. Nor is it what orderBy does if the optimizer is any good.
>
> Computing .rdd doesn't materialize an RDD. It involves some non-zero
> overhead in creating a plan, which should be minor compared to execution.
> So would any computation of "top N" on a Dataset, so I don't think this is
> relevant.
>
>
> orderBy + take is already the way to accomplish "Dataset.top". It works
> on Datasets, and therefore DataFrames too, for the reason you give. I'm not
> sure what you're asking there.
>
>
> On Mon, Sep 5, 2016, 13:01 Jakub Dubovsky <sp...@gmail.com>
> wrote:
>
>> Thanks Sean,
>>
>> I was under impression that spark creators are trying to persuade user
>> community not to use RDD api directly. Spark summit I attended was full of
>> this. So I am a bit surprised that I hear use-rdd-api as an advice from
>> you. But if this is a way then I have a second question. For conversion
>> from dataset to rdd I would use Dataset.rdd lazy val. Since it is a lazy
>> val it suggests there is some computation going on to create rdd as a copy.
>> The question is how much computationally expansive is this conversion? If
>> there is a significant overhead then it is clear why one would want to have
>> top method directly on Dataset class.
>>
>> Ordering whole dataset only to take first 10 or so top records is not
>> really an acceptable option for us. Comparison function can be expansive
>> and the size of dataset is (unsurprisingly) big.
>>
>> To be honest I do not really understand what do you mean by b). Since
>> DataFrame is now only an alias for Dataset[Row] what do you mean by
>> "DataFrame-like counterpart"?
>>
>> Thanks
>>
>> On Thu, Sep 1, 2016 at 2:31 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> You can always call .rdd.top(n) of course. Although it's slightly
>>> clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
>>> easier way.
>>>
>>> I don't think if there's a strong reason other than it wasn't worth it
>>> to write this and many other utility wrappers that a) already exist on
>>> the underlying RDD API if you want them, and b) have a DataFrame-like
>>> counterpart already that doesn't really need wrapping in a different
>>> API.
>>>
>>> On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky
>>> <sp...@gmail.com> wrote:
>>> > Hey all,
>>> >
>>> > in RDD api there is very usefull method called top. It finds top n
>>> records
>>> > in according to certain ordering without sorting all records. Very
>>> usefull!
>>> >
>>> > There is no top method nor similar functionality in Dataset api. Has
>>> anybody
>>> > any clue why? Is there any specific reason for this?
>>> >
>>> > Any thoughts?
>>> >
>>> > thanks
>>> >
>>> > Jakub D.
>>>
>>
>>

Re: Why there is no top method in dataset api

Posted by Sean Owen <so...@cloudera.com>.

No, 
I'm not advising you to use .rdd, just saying it is possible.
Although I'd only use RDDs if you had a good reason to, given Datasets
now, they are not gone or even deprecated.

You do not need to order the whole data set to get the top eleme
nt. That isn't what top does though. You might be interested to look at
the source code. Nor is it what orderBy does if the optimizer is any good.

Computing .rdd doesn't materialize an RDD. It involves some non-zero
overhead in creating a plan, which should be minor compared to execution.
So would any computation of "top N" on a Dataset, so I don't think this is
relevant.


orderBy + take is already the way to accomplish "Dataset.top". It works on
Datasets, and therefore DataFrames too, for the reason you give. I'm not
sure what you're asking there.


On Mon, Sep 5, 2016, 13:01 Jakub Dubovsky <sp...@gmail.com>
wrote:

> Thanks Sean,
>
> I was under impression that spark creators are trying to persuade user
> community not to use RDD api directly. Spark summit I attended was full of
> this. So I am a bit surprised that I hear use-rdd-api as an advice from
> you. But if this is a way then I have a second question. For conversion
> from dataset to rdd I would use Dataset.rdd lazy val. Since it is a lazy
> val it suggests there is some computation going on to create rdd as a copy.
> The question is how much computationally expansive is this conversion? If
> there is a significant overhead then it is clear why one would want to have
> top method directly on Dataset class.
>
> Ordering whole dataset only to take first 10 or so top records is not
> really an acceptable option for us. Comparison function can be expansive
> and the size of dataset is (unsurprisingly) big.
>
> To be honest I do not really understand what do you mean by b). Since
> DataFrame is now only an alias for Dataset[Row] what do you mean by
> "DataFrame-like counterpart"?
>
> Thanks
>
> On Thu, Sep 1, 2016 at 2:31 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> You can always call .rdd.top(n) of course. Although it's slightly
>> clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
>> easier way.
>>
>> I don't think if there's a strong reason other than it wasn't worth it
>> to write this and many other utility wrappers that a) already exist on
>> the underlying RDD API if you want them, and b) have a DataFrame-like
>> counterpart already that doesn't really need wrapping in a different
>> API.
>>
>> On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky
>> <sp...@gmail.com> wrote:
>> > Hey all,
>> >
>> > in RDD api there is very usefull method called top. It finds top n
>> records
>> > in according to certain ordering without sorting all records. Very
>> usefull!
>> >
>> > There is no top method nor similar functionality in Dataset api. Has
>> anybody
>> > any clue why? Is there any specific reason for this?
>> >
>> > Any thoughts?
>> >
>> > thanks
>> >
>> > Jakub D.
>>
>
>

Re: Why there is no top method in dataset api

Posted by Jakub Dubovsky <sp...@gmail.com>.

Thanks Sean,

I was under impression that spark creators are trying to persuade user
community not to use RDD api directly. Spark summit I attended was full of
this. So I am a bit surprised that I hear use-rdd-api as an advice from
you. But if this is a way then I have a second question. For conversion
from dataset to rdd I would use Dataset.rdd lazy val. Since it is a lazy
val it suggests there is some computation going on to create rdd as a copy.
The question is how much computationally expansive is this conversion? If
there is a significant overhead then it is clear why one would want to have
top method directly on Dataset class.

Ordering whole dataset only to take first 10 or so top records is not
really an acceptable option for us. Comparison function can be expansive
and the size of dataset is (unsurprisingly) big.

To be honest I do not really understand what do you mean by b). Since
DataFrame is now only an alias for Dataset[Row] what do you mean by
"DataFrame-like counterpart"?

Thanks

On Thu, Sep 1, 2016 at 2:31 PM, Sean Owen <so...@cloudera.com> wrote:

> You can always call .rdd.top(n) of course. Although it's slightly
> clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
> easier way.
>
> I don't think if there's a strong reason other than it wasn't worth it
> to write this and many other utility wrappers that a) already exist on
> the underlying RDD API if you want them, and b) have a DataFrame-like
> counterpart already that doesn't really need wrapping in a different
> API.
>
> On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky
> <sp...@gmail.com> wrote:
> > Hey all,
> >
> > in RDD api there is very usefull method called top. It finds top n
> records
> > in according to certain ordering without sorting all records. Very
> usefull!
> >
> > There is no top method nor similar functionality in Dataset api. Has
> anybody
> > any clue why? Is there any specific reason for this?
> >
> > Any thoughts?
> >
> > thanks
> >
> > Jakub D.
>

Re: Why there is no top method in dataset api

Posted by Sean Owen <so...@cloudera.com>.

You can always call .rdd.top(n) of course. Although it's slightly
clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
easier way.

I don't think if there's a strong reason other than it wasn't worth it
to write this and many other utility wrappers that a) already exist on
the underlying RDD API if you want them, and b) have a DataFrame-like
counterpart already that doesn't really need wrapping in a different
API.

On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky
<sp...@gmail.com> wrote:
> Hey all,
>
> in RDD api there is very usefull method called top. It finds top n records
> in according to certain ordering without sorting all records. Very usefull!
>
> There is no top method nor similar functionality in Dataset api. Has anybody
> any clue why? Is there any specific reason for this?
>
> Any thoughts?
>
> thanks
>
> Jakub D.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org