You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Kidong Lee <my...@gmail.com> on 2015/06/23 14:52:02 UTC

secondary sort in crunch on spark.

Hi,

I have been using spark to implement our recommendation algorithm, for
which it was hard to get secondary sort by value, thus, I have implemented
this algorithm with the help of hive.
I think, spark does not support secondary sort yet.

I have recently implemented the same recommendation algorithm in crunch
running on spark with using crunch secondary sort API.

I am wondering how to implement secondary sort in crunch running on spark.

Anybody can give me some explanations about the implementation of secondary
sort in crunch spark?

thanks,

- Kidong.

Re: secondary sort in crunch on spark.

Posted by Micah Whitacre <mk...@gmail.com>.

There has been some investigation into Crunch on Tez here[1].  I don't
believe anyone is currently actively working on it but we'd love patches if
someone had the time to bake it.

[1] - https://issues.apache.org/jira/browse/CRUNCH-441

On Wed, Jun 24, 2015 at 8:20 AM, Kidong Lee <my...@gmail.com> wrote:

> thanks for your reply, your answer is very helpful to understand it.
>
> I have another question.  Is there any plan to support Tez on which crunch
> can be run?
>
> - Kidong.
>
>
>
>
>
> 2015-06-23 23:06 GMT+09:00 Josh Wills <jw...@cloudera.com>:
>
>> Hey Kidong,
>>
>> The short answer is that we cheat. The class to look at for the
>> implementation details is:
>>
>>
>> https://github.com/apache/crunch/blob/master/crunch-spark/src/main/java/org/apache/crunch/impl/spark/collect/PGroupedTableImpl.java
>>
>> ...and you sort of have to walk through three different tricks we do to
>> make MapReduce partitioners, sorting classes, and grouping classes-- all of
>> which we use in the secondary sort implementation-- to work on Spark.
>>
>> J
>>
>> On Tue, Jun 23, 2015 at 6:57 AM, David Ortiz <dp...@gmail.com> wrote:
>>
>>> Correct me if I'm wrong, but if you are using an avro record or a Tuple
>>> data structure, couldn't you get a secondary sort by just sticking the
>>> fields in the order you want to apply the sort, and then using the regular
>>> sort api?  For example, if I had say, itemid, itemprice, nosold and I
>>> wanted to do something like....
>>>
>>> select itemid, itemprice, sum(nosold) from table group by itemid,
>>> itemprice, order by itemid, itemprice asc;
>>>
>>> I could implement that as...
>>> PTable<Pair<Integer, Double>, Long> items = {...some code to load the
>>> data into this
>>> structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and
>>> get something similar right?
>>>
>>>
>>> On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <my...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have been using spark to implement our recommendation algorithm, for
>>>> which it was hard to get secondary sort by value, thus, I have implemented
>>>> this algorithm with the help of hive.
>>>> I think, spark does not support secondary sort yet.
>>>>
>>>> I have recently implemented the same recommendation algorithm in crunch
>>>> running on spark with using crunch secondary sort API.
>>>>
>>>> I am wondering how to implement secondary sort in crunch running on
>>>> spark.
>>>>
>>>> Anybody can give me some explanations about the implementation of
>>>> secondary sort in crunch spark?
>>>>
>>>> thanks,
>>>>
>>>> - Kidong.
>>>>
>>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Re: secondary sort in crunch on spark.

Posted by Kidong Lee <my...@gmail.com>.

thanks for your reply, your answer is very helpful to understand it.

I have another question.  Is there any plan to support Tez on which crunch
can be run?

- Kidong.





2015-06-23 23:06 GMT+09:00 Josh Wills <jw...@cloudera.com>:

> Hey Kidong,
>
> The short answer is that we cheat. The class to look at for the
> implementation details is:
>
>
> https://github.com/apache/crunch/blob/master/crunch-spark/src/main/java/org/apache/crunch/impl/spark/collect/PGroupedTableImpl.java
>
> ...and you sort of have to walk through three different tricks we do to
> make MapReduce partitioners, sorting classes, and grouping classes-- all of
> which we use in the secondary sort implementation-- to work on Spark.
>
> J
>
> On Tue, Jun 23, 2015 at 6:57 AM, David Ortiz <dp...@gmail.com> wrote:
>
>> Correct me if I'm wrong, but if you are using an avro record or a Tuple
>> data structure, couldn't you get a secondary sort by just sticking the
>> fields in the order you want to apply the sort, and then using the regular
>> sort api?  For example, if I had say, itemid, itemprice, nosold and I
>> wanted to do something like....
>>
>> select itemid, itemprice, sum(nosold) from table group by itemid,
>> itemprice, order by itemid, itemprice asc;
>>
>> I could implement that as...
>> PTable<Pair<Integer, Double>, Long> items = {...some code to load the
>> data into this
>> structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and
>> get something similar right?
>>
>>
>> On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <my...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have been using spark to implement our recommendation algorithm, for
>>> which it was hard to get secondary sort by value, thus, I have implemented
>>> this algorithm with the help of hive.
>>> I think, spark does not support secondary sort yet.
>>>
>>> I have recently implemented the same recommendation algorithm in crunch
>>> running on spark with using crunch secondary sort API.
>>>
>>> I am wondering how to implement secondary sort in crunch running on
>>> spark.
>>>
>>> Anybody can give me some explanations about the implementation of
>>> secondary sort in crunch spark?
>>>
>>> thanks,
>>>
>>> - Kidong.
>>>
>>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: secondary sort in crunch on spark.

Posted by Josh Wills <jw...@cloudera.com>.

Hey Kidong,

The short answer is that we cheat. The class to look at for the
implementation details is:

https://github.com/apache/crunch/blob/master/crunch-spark/src/main/java/org/apache/crunch/impl/spark/collect/PGroupedTableImpl.java

...and you sort of have to walk through three different tricks we do to
make MapReduce partitioners, sorting classes, and grouping classes-- all of
which we use in the secondary sort implementation-- to work on Spark.

J

On Tue, Jun 23, 2015 at 6:57 AM, David Ortiz <dp...@gmail.com> wrote:

> Correct me if I'm wrong, but if you are using an avro record or a Tuple
> data structure, couldn't you get a secondary sort by just sticking the
> fields in the order you want to apply the sort, and then using the regular
> sort api?  For example, if I had say, itemid, itemprice, nosold and I
> wanted to do something like....
>
> select itemid, itemprice, sum(nosold) from table group by itemid,
> itemprice, order by itemid, itemprice asc;
>
> I could implement that as...
> PTable<Pair<Integer, Double>, Long> items = {...some code to load the data
> into this
> structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and
> get something similar right?
>
>
> On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <my...@gmail.com> wrote:
>
>> Hi,
>>
>> I have been using spark to implement our recommendation algorithm, for
>> which it was hard to get secondary sort by value, thus, I have implemented
>> this algorithm with the help of hive.
>> I think, spark does not support secondary sort yet.
>>
>> I have recently implemented the same recommendation algorithm in crunch
>> running on spark with using crunch secondary sort API.
>>
>> I am wondering how to implement secondary sort in crunch running on spark.
>>
>> Anybody can give me some explanations about the implementation of
>> secondary sort in crunch spark?
>>
>> thanks,
>>
>> - Kidong.
>>
>>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: secondary sort in crunch on spark.

Posted by David Ortiz <dp...@gmail.com>.

Correct me if I'm wrong, but if you are using an avro record or a Tuple
data structure, couldn't you get a secondary sort by just sticking the
fields in the order you want to apply the sort, and then using the regular
sort api?  For example, if I had say, itemid, itemprice, nosold and I
wanted to do something like....

select itemid, itemprice, sum(nosold) from table group by itemid,
itemprice, order by itemid, itemprice asc;

I could implement that as...
PTable<Pair<Integer, Double>, Long> items = {...some code to load the data
into this
structure...}.groupByKey().combineValues(Aggregators.SUM_LONGS).sort() and
get something similar right?

On Tue, Jun 23, 2015 at 8:52 AM Kidong Lee <my...@gmail.com> wrote:

> Hi,
>
> I have been using spark to implement our recommendation algorithm, for
> which it was hard to get secondary sort by value, thus, I have implemented
> this algorithm with the help of hive.
> I think, spark does not support secondary sort yet.
>
> I have recently implemented the same recommendation algorithm in crunch
> running on spark with using crunch secondary sort API.
>
> I am wondering how to implement secondary sort in crunch running on spark.
>
> Anybody can give me some explanations about the implementation of
> secondary sort in crunch spark?
>
> thanks,
>
> - Kidong.
>
>