You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Saba Sehrish <ss...@fnal.gov> on 2015/03/09 22:21:39 UTC

Top, takeOrdered, sortByKey


From: Saba Sehrish <ss...@fnal.gov>>
Date: March 9, 2015 at 4:11:07 PM CDT
To: <us...@spark.apache.org>>
Subject: Using top, takeOrdered, sortByKey

I am using spark for a template matching problem. We have 77 million events in the template library, and we compare energy of each of the input event with the each of the template event and return a score. In the end we return best 10000 matches with lowest score. A score of 0 is a perfect match.

I down sampled the problem to use only 50k events. For a single event matching across all the events in the template (50k) I see 150-200ms for score calculation on 25 cores (using YARN cluster), but after that when I perform either a top or takeOrdered or even sortByKey the time reaches to 25-50s.
So far I am not able to figure out why such a huge gap going from a list of scores to a list of top 1000 scores and why sorting or getting best X matches is being dominant by a large factor. One thing I have noticed is that it doesn’t matter how many elements I return the time range is the same 25-50s for 10 - 10000 elements.

Any suggestions? if I am not using API properly?

scores is JavaPairRDD<Integer, Double>, and I do something like
numbestmatches is 10, 100, 10000 or any number.

List <Tuple2<Integer, Double>> bestscores_list = scores.takeOrdered(numbestmatches, new TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.top(numbestmatches, new TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.sortByKey();

Re: Top, takeOrdered, sortByKey

Posted by Saba Sehrish <ss...@fnal.gov>.
Let me clarify - taking 10000 elements of 50000 elements using top or takeOrdered is taking about 25-50s which seems to be slow. I also try to use sortByKey to sort the elements to get a time estimate and I get numbers in the same range.
I'm running this application on a cluster with 5 nodes and using 3 cores each.

I'm interested in knowing what tuning can be done to get better performance for top/takeOrdered  than 25-50s. I want to eventually scale it up to use 77 million elements but right now I want to see how performance of either top or takeOrdered could be improved for smaller sample I'm using.

On Mar 11, 2015, at 7:24 PM, Imran Rashid <ir...@cloudera.com>> wrote:

I am not entirely sure I understand your question -- are you saying:

* scoring a sample of 50k events is fast
* taking the top N scores of 77M events is slow, no matter what N is

?

if so, this shouldn't come as a huge surprise.  You can't find the top scoring elements (no matter how small N is) unless you score all 77M of them.  Very naively, you would expect scoring 77M events to take ~1000 times as long as scoring 50k events, right?  The fact that it doesn't take that much longer is probably b/c of the overhead of just launching the jobs.



On Mon, Mar 9, 2015 at 4:21 PM, Saba Sehrish <ss...@fnal.gov>> wrote:


From: Saba Sehrish <ss...@fnal.gov>>
Date: March 9, 2015 at 4:11:07 PM CDT
To: <us...@spark.apache.org>>
Subject: Using top, takeOrdered, sortByKey

I am using spark for a template matching problem. We have 77 million events in the template library, and we compare energy of each of the input event with the each of the template event and return a score. In the end we return best 10000 matches with lowest score. A score of 0 is a perfect match.

I down sampled the problem to use only 50k events. For a single event matching across all the events in the template (50k) I see 150-200ms for score calculation on 25 cores (using YARN cluster), but after that when I perform either a top or takeOrdered or even sortByKey the time reaches to 25-50s.
So far I am not able to figure out why such a huge gap going from a list of scores to a list of top 1000 scores and why sorting or getting best X matches is being dominant by a large factor. One thing I have noticed is that it doesn’t matter how many elements I return the time range is the same 25-50s for 10 - 10000 elements.

Any suggestions? if I am not using API properly?

scores is JavaPairRDD<Integer, Double>, and I do something like
numbestmatches is 10, 100, 10000 or any number.

List <Tuple2<Integer, Double>> bestscores_list = scores.takeOrdered(numbestmatches, new TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.top(numbestmatches, new TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.sortByKey();


Re: Top, takeOrdered, sortByKey

Posted by Imran Rashid <ir...@cloudera.com>.
I am not entirely sure I understand your question -- are you saying:

* scoring a sample of 50k events is fast
* taking the top N scores of 77M events is slow, no matter what N is

?

if so, this shouldn't come as a huge surprise.  You can't find the top
scoring elements (no matter how small N is) unless you score all 77M of
them.  Very naively, you would expect scoring 77M events to take ~1000
times as long as scoring 50k events, right?  The fact that it doesn't take
that much longer is probably b/c of the overhead of just launching the jobs.



On Mon, Mar 9, 2015 at 4:21 PM, Saba Sehrish <ss...@fnal.gov> wrote:

>
>
>  *From:* Saba Sehrish <ss...@fnal.gov>
> *Date:* March 9, 2015 at 4:11:07 PM CDT
> *To:* <us...@spark.apache.org>
> *Subject:* *Using top, takeOrdered, sortByKey*
>
>   I am using spark for a template matching problem. We have 77 million
> events in the template library, and we compare energy of each of the input
> event with the each of the template event and return a score. In the end we
> return best 10000 matches with lowest score. A score of 0 is a perfect
> match.
>
>  I down sampled the problem to use only 50k events. For a single event
> matching across all the events in the template (50k) I see 150-200ms for
> score calculation on 25 cores (using YARN cluster), but after that when I
> perform either a top or takeOrdered or even sortByKey the time reaches to
> 25-50s.
> So far I am not able to figure out why such a huge gap going from a list
> of scores to a list of top 1000 scores and why sorting or getting best X
> matches is being dominant by a large factor. One thing I have noticed is
> that it doesn’t matter how many elements I return the time range is the
> same 25-50s for 10 - 10000 elements.
>
>  Any suggestions? if I am not using API properly?
>
>  scores is JavaPairRDD<Integer, Double>, and I do something like
> numbestmatches is 10, 100, 10000 or any number.
>
>   List <Tuple2<Integer, Double>> bestscores_list =
> scores.takeOrdered(numbestmatches, new TupleComparator());
>  Or
>  List <Tuple2<Integer, Double>> bestscores_list =
> scores.top(numbestmatches, new TupleComparator());
>  Or
>  List <Tuple2<Integer, Double>> bestscores_list = scores.sortByKey();
>
>