You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Takeshi Yamamuro <li...@gmail.com> on 2017/02/02 06:18:04 UTC

Re: increasing cross join speed

Hi,

I'm not sure how to improve this kind of queries only on vanilla spark
though,
you can write custom physical plans for top-k queries.
You can check the link below as a reference;
benchmark: https://github.com/apache/incubator-hivemall/pull/33
manual:
https://github.com/apache/incubator-hivemall/blob/master/docs/gitbook/spark/misc/topk_join.md

I hope this helps for you.
Thanks,

// maropu

On Wed, Feb 1, 2017 at 6:35 AM, Kürşat Kurt <ku...@kursatkurt.com> wrote:

> Hi;
>
>
>
> I have 2 dataframes. I am trying to cross join for finding vector
> distances. Then i can choose the most similiar vectors.
>
> Cross join speed is too slow. How can i increase the speed, or have you
> any suggestion for this comparision?
>
>
>
>
>
> *val* result=myDict.join(mainDataset).map(x=>{
>
>
>
>                *val* orgClassName1 =x.getAs[SparseVector](1);
>
>                *val* orgClassName2 =x.getAs[SparseVector](2);
>
>                *val* f1=x.getAs[SparseVector](3);
>
>                *val* f2=x.getAs[SparseVector](4);
>
>                *val* dist=Vectors.sqdist(f1,f2);
>
>
>
>                (orgClassName1, orgClassName2,dist)
>
>              }).toDF("orgClassName1","orgClassName2,"dist");
>
>
>
>
>
>
>

-- 
---
Takeshi Yamamuro