You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Priya Ch <le...@gmail.com> on 2016/05/25 07:05:12 UTC

Cartesian join on RDDs taking too much time

Hi All,

  I have two RDDs A and B where in A is of size 30 MB and B is of size 7
MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
cartesian operation ?

I am using spark 1.6.0 version

Regards,
Padma Ch

Re: Cartesian join on RDDs taking too much time

Posted by Max Sperlich <ma...@gmail.com>.

Cartesian joins tend to give a huge result size, and are inherently slow.
If RDD B has N records then your result size will be at least N * 30 MB,
since you have to replicate all the rows of A for a single record in B.

Assuming RDD B has 10,000 records then you can see that your cartesian join
will give an RDD that takes at least 300 GB, presumably more than the RAM
on your system...

On Wed, May 25, 2016 at 3:05 AM, Priya Ch <le...@gmail.com>
wrote:

> Hi All,
>
>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
> cartesian operation ?
>
> I am using spark 1.6.0 version
>
> Regards,
> Padma Ch
>

Re: Cartesian join on RDDs taking too much time

Posted by Sonal Goyal <so...@gmail.com>.

You can look at ways to group records from both rdds together instead of
doing Cartesian.  Say generate pair rdd from each with first letter as key.
Then do a partition and a join.
On May 25, 2016 8:04 PM, "Priya Ch" <le...@gmail.com> wrote:

> Hi,
>   RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would
> like to filter out the strings that have greater than 85% match and
> generate a score for it which is used in the susbsequent calculations.
>
> I tried generating pair rdd from both rdds A and B with same key for all
> the records. Now performing A.join(B) is also resulting in huge execution
> time..
>
> How do I go about with map and reduce here ? To generate pairs from 2 rdds
> I dont think map can be used because we cannot have rdd inside another rdd.
>
> Would be glad if you can throw me some light on this.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 7:39 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Solr or Elastic search provide much more functionality and are faster in
>> this context. The decision for or against them depends on your current and
>> future use cases. Your current use case is still very abstract so in order
>> to get a more proper recommendation you need to provide more details
>> including size of dataset, what you do with the result of the matching do
>> you just need the match number or also the pairs in the results etc.
>>
>> Your concrete problem can also be solved in Spark (though it is not the
>> best and most efficient tool for this, but it has other strength) using the
>> map reduce steps. There are different ways to implement this (Generate
>> pairs from the input datasets in the map step or (maybe less recommendable)
>> broadcast the smaller dataset to all nodes and do the matching with the
>> bigger dataset there.
>> This highly depends on the data in your data set. How they compare in
>> size etc.
>>
>>
>>
>> On 25 May 2016, at 13:27, Priya Ch <le...@gmail.com> wrote:
>>
>> Why do i need to deploy solr for text anaytics...i have files placed in
>> HDFS. just need to look for matches against each string in both files and
>> generate those records whose match is > 85%. We trying to Fuzzy match
>> logic.
>>
>> How can use map/reduce operations across 2 rdds ?
>>
>> Thanks,
>> Padma Ch
>>
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jo...@gmail.com>
>> wrote:
>>
>>>
>>> Alternatively depending on the exact use case you may employ solr on
>>> Hadoop for text analytics
>>>
>>> > On 25 May 2016, at 12:57, Priya Ch <le...@gmail.com>
>>> wrote:
>>> >
>>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD
>>> B of
>>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i
>>> need
>>> > to check the matches found in rdd B as such for string "hi" i have to
>>> check
>>> > the matches against all strings in RDD B which means I need generate
>>> every
>>> > possible combination r
>>>
>>
>>
>

Re: Cartesian join on RDDs taking too much time

Posted by Priya Ch <le...@gmail.com>.

Hi,
  RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would
like to filter out the strings that have greater than 85% match and
generate a score for it which is used in the susbsequent calculations.

I tried generating pair rdd from both rdds A and B with same key for all
the records. Now performing A.join(B) is also resulting in huge execution
time..

How do I go about with map and reduce here ? To generate pairs from 2 rdds
I dont think map can be used because we cannot have rdd inside another rdd.

Would be glad if you can throw me some light on this.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 7:39 PM, Jörn Franke <jo...@gmail.com> wrote:

> Solr or Elastic search provide much more functionality and are faster in
> this context. The decision for or against them depends on your current and
> future use cases. Your current use case is still very abstract so in order
> to get a more proper recommendation you need to provide more details
> including size of dataset, what you do with the result of the matching do
> you just need the match number or also the pairs in the results etc.
>
> Your concrete problem can also be solved in Spark (though it is not the
> best and most efficient tool for this, but it has other strength) using the
> map reduce steps. There are different ways to implement this (Generate
> pairs from the input datasets in the map step or (maybe less recommendable)
> broadcast the smaller dataset to all nodes and do the matching with the
> bigger dataset there.
> This highly depends on the data in your data set. How they compare in size
> etc.
>
>
>
> On 25 May 2016, at 13:27, Priya Ch <le...@gmail.com> wrote:
>
> Why do i need to deploy solr for text anaytics...i have files placed in
> HDFS. just need to look for matches against each string in both files and
> generate those records whose match is > 85%. We trying to Fuzzy match
> logic.
>
> How can use map/reduce operations across 2 rdds ?
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>>
>> Alternatively depending on the exact use case you may employ solr on
>> Hadoop for text analytics
>>
>> > On 25 May 2016, at 12:57, Priya Ch <le...@gmail.com>
>> wrote:
>> >
>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD
>> B of
>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i
>> need
>> > to check the matches found in rdd B as such for string "hi" i have to
>> check
>> > the matches against all strings in RDD B which means I need generate
>> every
>> > possible combination r
>>
>
>

Re: Cartesian join on RDDs taking too much time

Posted by Jörn Franke <jo...@gmail.com>.

Solr or Elastic search provide much more functionality and are faster in this context. The decision for or against them depends on your current and future use cases. Your current use case is still very abstract so in order to get a more proper recommendation you need to provide more details including size of dataset, what you do with the result of the matching do you just need the match number or also the pairs in the results etc.

Your concrete problem can also be solved in Spark (though it is not the best and most efficient tool for this, but it has other strength) using the map reduce steps. There are different ways to implement this (Generate pairs from the input datasets in the map step or (maybe less recommendable) broadcast the smaller dataset to all nodes and do the matching with the bigger dataset there.
This highly depends on the data in your data set. How they compare in size etc.

> On 25 May 2016, at 13:27, Priya Ch <le...@gmail.com> wrote:
> 
> Why do i need to deploy solr for text anaytics...i have files placed in HDFS. just need to look for matches against each string in both files and generate those records whose match is > 85%. We trying to Fuzzy match logic. 
> 
> How can use map/reduce operations across 2 rdds ?
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jo...@gmail.com> wrote:
>> 
>> Alternatively depending on the exact use case you may employ solr on Hadoop for text analytics
>> 
>> > On 25 May 2016, at 12:57, Priya Ch <le...@gmail.com> wrote:
>> >
>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
>> > to check the matches found in rdd B as such for string "hi" i have to check
>> > the matches against all strings in RDD B which means I need generate every
>> > possible combination r
>

Re: Cartesian join on RDDs taking too much time

Posted by Priya Ch <le...@gmail.com>.

Why do i need to deploy solr for text anaytics...i have files placed in
HDFS. just need to look for matches against each string in both files and
generate those records whose match is > 85%. We trying to Fuzzy match
logic.

How can use map/reduce operations across 2 rdds ?

Thanks,
Padma Ch

On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jo...@gmail.com> wrote:

>
> Alternatively depending on the exact use case you may employ solr on
> Hadoop for text analytics
>
> > On 25 May 2016, at 12:57, Priya Ch <le...@gmail.com> wrote:
> >
> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B
> of
> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
> > to check the matches found in rdd B as such for string "hi" i have to
> check
> > the matches against all strings in RDD B which means I need generate
> every
> > possible combination r
>

Re: Cartesian join on RDDs taking too much time

Posted by Jörn Franke <jo...@gmail.com>.

Alternatively depending on the exact use case you may employ solr on Hadoop for text analytics 

> On 25 May 2016, at 12:57, Priya Ch <le...@gmail.com> wrote:
> 
> Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
> strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
> to check the matches found in rdd B as such for string "hi" i have to check
> the matches against all strings in RDD B which means I need generate every
> possible combination r

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Cartesian join on RDDs taking too much time

Posted by Priya Ch <le...@gmail.com>.

Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
to check the matches found in rdd B as such for string "hi" i have to check
the matches against all strings in RDD B which means I need generate every
possible combination right.. Hence generating cartesian product and then
 using map transformation on cartesian rdd I am trying to check the matches
found.

Is there any better way I could do other than performaing cartesian. Till
now application took 30 mins and on top of that I see executor lost issues.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 4:22 PM, Jörn Franke <jo...@gmail.com> wrote:

> What is the use case of this ? A Cartesian product is by definition slow
> in any system. Why do you need this? How long does your application take
> now?
>
> On 25 May 2016, at 12:42, Priya Ch <le...@gmail.com> wrote:
>
> I tried
> dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
> this is taking too much time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
>> parquet, orc, ...?
>>
>> // maropu
>>
>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <le...@gmail.com>
>> wrote:
>>
>>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using
>>> saveAsTextFile, trying to save it. However, this is also taking too much
>>> time.
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin.m.s@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> Seems you'd be better off using DataFrame#join instead of  RDD
>>>> .cartesian
>>>> because it always needs shuffle operations which have alot of
>>>> overheads such as reflection, serialization, ...
>>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>>>> broadcast strategy.
>>>> This is a little more efficient than  RDD.cartesian.
>>>>
>>>> // maropu
>>>>
>>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> It is basically a Cartesian join like RDBMS
>>>>>
>>>>> Example:
>>>>>
>>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>>
>>>>> The results of this query matches every row in the FinancialCodes
>>>>> table with every row in the FinancialData table.  Each row consists
>>>>> of all columns from the FinancialCodes table followed by all columns from
>>>>> the FinancialData table.
>>>>>
>>>>>
>>>>> Not very useful
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 25 May 2016 at 08:05, Priya Ch <le...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of
>>>>>> size 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck
>>>>>> in cartesian operation ?
>>>>>>
>>>>>> I am using spark 1.6.0 version
>>>>>>
>>>>>> Regards,
>>>>>> Padma Ch
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>

Re: Cartesian join on RDDs taking too much time

Posted by Jörn Franke <jo...@gmail.com>.

What is the use case of this ? A Cartesian product is by definition slow in any system. Why do you need this? How long does your application take now?

> On 25 May 2016, at 12:42, Priya Ch <le...@gmail.com> wrote:
> 
> I tried dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even this is taking too much time.
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <li...@gmail.com> wrote:
>> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as parquet, orc, ...?
>> 
>> // maropu
>> 
>>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <le...@gmail.com> wrote:
>>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I am converting the joined dataframe to rdd (dataframe.rdd) and using saveAsTextFile, trying to save it. However, this is also taking too much time.
>>> 
>>> Thanks,
>>> Padma Ch
>>> 
>>>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <li...@gmail.com> wrote:
>>>> Hi, 
>>>> 
>>>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>>>> because it always needs shuffle operations which have alot of overheads such as reflection, serialization, ...
>>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a broadcast strategy.
>>>> This is a little more efficient than  RDD.cartesian.
>>>> 
>>>> // maropu
>>>> 
>>>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>> It is basically a Cartesian join like RDBMS 
>>>>> 
>>>>> Example:
>>>>> 
>>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>> 
>>>>> The results of this query matches every row in the FinancialCodes table with every row in the FinancialData table.  Each row consists of all columns from the FinancialCodes table followed by all columns from the FinancialData table.
>>>>> 
>>>>> Not very useful 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>>  
>>>>> 
>>>>>> On 25 May 2016 at 08:05, Priya Ch <le...@gmail.com> wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in cartesian operation ?
>>>>>> 
>>>>>> I am using spark 1.6.0 version
>>>>>> 
>>>>>> Regards,
>>>>>> Padma Ch
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> ---
>>>> Takeshi Yamamuro
>> 
>> 
>> 
>> -- 
>> ---
>> Takeshi Yamamuro
>

Re: Cartesian join on RDDs taking too much time

Posted by Priya Ch <le...@gmail.com>.

I tried
dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
this is taking too much time.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <li...@gmail.com>
wrote:

> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
> parquet, orc, ...?
>
> // maropu
>
> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <le...@gmail.com>
> wrote:
>
>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using
>> saveAsTextFile, trying to save it. However, this is also taking too much
>> time.
>>
>> Thanks,
>> Padma Ch
>>
>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <li...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>>> because it always needs shuffle operations which have alot of overheads
>>> such as reflection, serialization, ...
>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>>> broadcast strategy.
>>> This is a little more efficient than  RDD.cartesian.
>>>
>>> // maropu
>>>
>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> It is basically a Cartesian join like RDBMS
>>>>
>>>> Example:
>>>>
>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>
>>>> The results of this query matches every row in the FinancialCodes table
>>>> with every row in the FinancialData table.  Each row consists of all
>>>> columns from the FinancialCodes table followed by all columns from the
>>>> FinancialData table.
>>>>
>>>>
>>>> Not very useful
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 25 May 2016 at 08:05, Priya Ch <le...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size
>>>>> 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>>>> cartesian operation ?
>>>>>
>>>>> I am using spark 1.6.0 version
>>>>>
>>>>> Regards,
>>>>> Padma Ch
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Cartesian join on RDDs taking too much time

Posted by Takeshi Yamamuro <li...@gmail.com>.

Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
parquet, orc, ...?

// maropu

On Wed, May 25, 2016 at 7:10 PM, Priya Ch <le...@gmail.com>
wrote:

> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
> am converting the joined dataframe to rdd (dataframe.rdd) and using
> saveAsTextFile, trying to save it. However, this is also taking too much
> time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>> because it always needs shuffle operations which have alot of overheads
>> such as reflection, serialization, ...
>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>> broadcast strategy.
>> This is a little more efficient than  RDD.cartesian.
>>
>> // maropu
>>
>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> It is basically a Cartesian join like RDBMS
>>>
>>> Example:
>>>
>>> SELECT * FROM FinancialCodes,  FinancialData
>>>
>>> The results of this query matches every row in the FinancialCodes table
>>> with every row in the FinancialData table.  Each row consists of all
>>> columns from the FinancialCodes table followed by all columns from the
>>> FinancialData table.
>>>
>>>
>>> Not very useful
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 25 May 2016 at 08:05, Priya Ch <le...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size
>>>> 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>>> cartesian operation ?
>>>>
>>>> I am using spark 1.6.0 version
>>>>
>>>> Regards,
>>>> Padma Ch
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro

Re: Cartesian join on RDDs taking too much time

Posted by Priya Ch <le...@gmail.com>.

Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
am converting the joined dataframe to rdd (dataframe.rdd) and using
saveAsTextFile, trying to save it. However, this is also taking too much
time.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <li...@gmail.com>
wrote:

> Hi,
>
> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
> because it always needs shuffle operations which have alot of overheads
> such as reflection, serialization, ...
> In your case,  since the smaller table is 7mb, DataFrame#join uses a
> broadcast strategy.
> This is a little more efficient than  RDD.cartesian.
>
> // maropu
>
> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> It is basically a Cartesian join like RDBMS
>>
>> Example:
>>
>> SELECT * FROM FinancialCodes,  FinancialData
>>
>> The results of this query matches every row in the FinancialCodes table
>> with every row in the FinancialData table.  Each row consists of all
>> columns from the FinancialCodes table followed by all columns from the
>> FinancialData table.
>>
>>
>> Not very useful
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 25 May 2016 at 08:05, Priya Ch <le...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>> cartesian operation ?
>>>
>>> I am using spark 1.6.0 version
>>>
>>> Regards,
>>> Padma Ch
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Cartesian join on RDDs taking too much time

Posted by Takeshi Yamamuro <li...@gmail.com>.

Hi,

Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
because it always needs shuffle operations which have alot of overheads
such as reflection, serialization, ...
In your case,  since the smaller table is 7mb, DataFrame#join uses a
broadcast strategy.
This is a little more efficient than  RDD.cartesian.

// maropu

On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> It is basically a Cartesian join like RDBMS
>
> Example:
>
> SELECT * FROM FinancialCodes,  FinancialData
>
> The results of this query matches every row in the FinancialCodes table
> with every row in the FinancialData table.  Each row consists of all
> columns from the FinancialCodes table followed by all columns from the
> FinancialData table.
>
>
> Not very useful
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 25 May 2016 at 08:05, Priya Ch <le...@gmail.com> wrote:
>
>> Hi All,
>>
>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>> cartesian operation ?
>>
>> I am using spark 1.6.0 version
>>
>> Regards,
>> Padma Ch
>>
>
>


-- 
---
Takeshi Yamamuro

Re: Cartesian join on RDDs taking too much time

Posted by Mich Talebzadeh <mi...@gmail.com>.

It is basically a Cartesian join like RDBMS

Example:

SELECT * FROM FinancialCodes,  FinancialData

The results of this query matches every row in the FinancialCodes table
with every row in the FinancialData table.  Each row consists of all
columns from the FinancialCodes table followed by all columns from the
FinancialData table.

Not very useful

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 25 May 2016 at 08:05, Priya Ch <le...@gmail.com> wrote:

> Hi All,
>
>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
> cartesian operation ?
>
> I am using spark 1.6.0 version
>
> Regards,
> Padma Ch
>