You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yifan LI <ia...@gmail.com> on 2015/10/12 12:03:58 UTC

"dynamically" sort a large collection?

Hey,

I need to scan a large "key-value" collection as below:

1) sort it on an attribute of “value” 
2) scan it one by one, from element with largest value
2.1) if the current element matches a pre-defined condition, its value will be reduced and the element will be inserted back to collection. 
if not, this current element should be removed from collection.


In my previous program, the 1) step can be easily conducted in Spark(RDD operation), but I am not sure how to do 2.1) step, esp. the “put/inserted back” operation on a sorted RDD.
I have tried to make a new RDD at every-time an element was found to inserted, but it is very costly due to a re-sorting…


Is there anyone having some ideas?

Thanks so much!

******************
an example:

the sorted result of initial collection C(on bold value), sortedC:
(1, (71, “aaa"))
(2, (60, “bbb"))
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(5, (29, “eee"))
…

pre-condition: its_value%2 == 0
if pre-condition is matched, its value will be reduce on half.

Thus:

#1:
71 is not matched, so this element is removed.
(1, (71, “aaa”)) —> removed!
(2, (60, “bbb"))
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(5, (29, “eee"))
…

#2:
60 is matched! 60/2 = 30, the collection right now should be as:
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(2, (30, “bbb”)) <— inserted back here
(5, (29, “eee"))
…






Best,
Yifan LI






Re: "dynamically" sort a large collection?

Posted by Yifan LI <ia...@gmail.com>.
Shiwei, yes, you might be right. Thanks. :)

Best,
Yifan LI





> On 12 Oct 2015, at 12:55, 郭士伟 <gu...@gmail.com> wrote:
> 
> I think this is not a problem Spark can solve effectively, cause RDD in immutable. Every time you want to change an RDD, you create a new one, and sort again. Maybe hbase or some other DB system will be a more suitable solution. Or, if the data can fit into memory, use a simple heap will work.
> 
> 
> 2015-10-12 18:29 GMT+08:00 Yifan LI <iamyifanli@gmail.com <ma...@gmail.com>>:
> Hey Adrian,
> 
> Thanks for your fast reply. :)
> 
> Actually the “pre-condition” is not fixed in real application, e.g. it would change based on counting of previous unmatched elements.
> So I need to use iterator operator, rather than flatMap-like operators…
> 
> Besides, do you have any idea on how to avoid that “sort again”? it is too costly… :(
> 
> Anyway thank you again!
> 
> Best,
> Yifan LI
> 
> 
> 
> 
> 
>> On 12 Oct 2015, at 12:19, Adrian Tanase <atanase@adobe.com <ma...@adobe.com>> wrote:
>> 
>> I think you’re looking for the flatMap (or flatMapValues) operator – you can do something like
>> 
>> sortedRdd.flatMapValues( v =>
>> If (v % 2 == 0) {
>> Some(v / 2)
>> } else {
>> None
>> }
>> )
>> 
>> Then you need to sort again.
>> 
>> -adrian
>> 
>> From: Yifan LI
>> Date: Monday, October 12, 2015 at 1:03 PM
>> To: spark users
>> Subject: "dynamically" sort a large collection?
>> 
>> Hey,
>> 
>> I need to scan a large "key-value" collection as below:
>> 
>> 1) sort it on an attribute of “value” 
>> 2) scan it one by one, from element with largest value
>> 2.1) if the current element matches a pre-defined condition, its value will be reduced and the element will be inserted back to collection. 
>> if not, this current element should be removed from collection.
>> 
>> 
>> In my previous program, the 1) step can be easily conducted in Spark(RDD operation), but I am not sure how to do 2.1) step, esp. the “put/inserted back” operation on a sorted RDD.
>> I have tried to make a new RDD at every-time an element was found to inserted, but it is very costly due to a re-sorting…
>> 
>> 
>> Is there anyone having some ideas?
>> 
>> Thanks so much!
>> 
>> ******************
>> an example:
>> 
>> the sorted result of initial collection C(on bold value), sortedC:
>> (1, (71, “aaa"))
>> (2, (60, “bbb"))
>> (3, (53.5, “ccc”))
>> (4, (48, “ddd”))
>> (5, (29, “eee"))
>> …
>> 
>> pre-condition: its_value%2 == 0
>> if pre-condition is matched, its value will be reduce on half.
>> 
>> Thus:
>> 
>> #1:
>> 71 is not matched, so this element is removed.
>> (1, (71, “aaa”)) —> removed!
>> (2, (60, “bbb"))
>> (3, (53.5, “ccc”))
>> (4, (48, “ddd”))
>> (5, (29, “eee"))
>> …
>> 
>> #2:
>> 60 is matched! 60/2 = 30, the collection right now should be as:
>> (3, (53.5, “ccc”))
>> (4, (48, “ddd”))
>> (2, (30, “bbb”)) <— inserted back here
>> (5, (29, “eee"))
>> …
>> 
>> 
>> 
>> 
>> 
>> 
>> Best,
>> Yifan LI
>> 
>> 
>> 
>> 
>> 
> 
> 


Re: "dynamically" sort a large collection?

Posted by Yifan LI <ia...@gmail.com>.
Hey Adrian,

Thanks for your fast reply. :)

Actually the “pre-condition” is not fixed in real application, e.g. it would change based on counting of previous unmatched elements.
So I need to use iterator operator, rather than flatMap-like operators…

Besides, do you have any idea on how to avoid that “sort again”? it is too costly… :(

Anyway thank you again!

Best,
Yifan LI





> On 12 Oct 2015, at 12:19, Adrian Tanase <at...@adobe.com> wrote:
> 
> I think you’re looking for the flatMap (or flatMapValues) operator – you can do something like
> 
> sortedRdd.flatMapValues( v =>
> If (v % 2 == 0) {
> Some(v / 2)
> } else {
> None
> }
> )
> 
> Then you need to sort again.
> 
> -adrian
> 
> From: Yifan LI
> Date: Monday, October 12, 2015 at 1:03 PM
> To: spark users
> Subject: "dynamically" sort a large collection?
> 
> Hey,
> 
> I need to scan a large "key-value" collection as below:
> 
> 1) sort it on an attribute of “value” 
> 2) scan it one by one, from element with largest value
> 2.1) if the current element matches a pre-defined condition, its value will be reduced and the element will be inserted back to collection. 
> if not, this current element should be removed from collection.
> 
> 
> In my previous program, the 1) step can be easily conducted in Spark(RDD operation), but I am not sure how to do 2.1) step, esp. the “put/inserted back” operation on a sorted RDD.
> I have tried to make a new RDD at every-time an element was found to inserted, but it is very costly due to a re-sorting…
> 
> 
> Is there anyone having some ideas?
> 
> Thanks so much!
> 
> ******************
> an example:
> 
> the sorted result of initial collection C(on bold value), sortedC:
> (1, (71, “aaa"))
> (2, (60, “bbb"))
> (3, (53.5, “ccc”))
> (4, (48, “ddd”))
> (5, (29, “eee"))
> …
> 
> pre-condition: its_value%2 == 0
> if pre-condition is matched, its value will be reduce on half.
> 
> Thus:
> 
> #1:
> 71 is not matched, so this element is removed.
> (1, (71, “aaa”)) —> removed!
> (2, (60, “bbb"))
> (3, (53.5, “ccc”))
> (4, (48, “ddd”))
> (5, (29, “eee"))
> …
> 
> #2:
> 60 is matched! 60/2 = 30, the collection right now should be as:
> (3, (53.5, “ccc”))
> (4, (48, “ddd”))
> (2, (30, “bbb”)) <— inserted back here
> (5, (29, “eee"))
> …
> 
> 
> 
> 
> 
> 
> Best,
> Yifan LI
> 
> 
> 
> 
> 


Re: "dynamically" sort a large collection?

Posted by Adrian Tanase <at...@adobe.com>.
I think you’re looking for the flatMap (or flatMapValues) operator – you can do something like

sortedRdd.flatMapValues( v =>
If (v % 2 == 0) {
Some(v / 2)
} else {
None
}
)

Then you need to sort again.

-adrian

From: Yifan LI
Date: Monday, October 12, 2015 at 1:03 PM
To: spark users
Subject: "dynamically" sort a large collection?

Hey,

I need to scan a large "key-value" collection as below:

1) sort it on an attribute of “value”
2) scan it one by one, from element with largest value
2.1) if the current element matches a pre-defined condition, its value will be reduced and the element will be inserted back to collection.
if not, this current element should be removed from collection.


In my previous program, the 1) step can be easily conducted in Spark(RDD operation), but I am not sure how to do 2.1) step, esp. the “put/inserted back” operation on a sorted RDD.
I have tried to make a new RDD at every-time an element was found to inserted, but it is very costly due to a re-sorting…


Is there anyone having some ideas?

Thanks so much!

******************
an example:

the sorted result of initial collection C(on bold value), sortedC:
(1, (71, “aaa"))
(2, (60, “bbb"))
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(5, (29, “eee"))
…

pre-condition: its_value%2 == 0
if pre-condition is matched, its value will be reduce on half.

Thus:

#1:
71 is not matched, so this element is removed.
(1, (71, “aaa”)) —> removed!
(2, (60, “bbb"))
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(5, (29, “eee"))
…

#2:
60 is matched! 60/2 = 30, the collection right now should be as:
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(2, (30, “bbb”)) <— inserted back here
(5, (29, “eee"))
…






Best,
Yifan LI