You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Bryński (JIRA)" <ji...@apache.org> on 2015/10/23 14:59:27 UTC
[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour

     [ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maciej Bryński updated SPARK-11282:
-----------------------------------
    Description: 
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

	spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true
	Creating test tables...
	Joining tables...
	Joined table schema:
	root
	 |-- id: long (nullable = true)
	 |-- val: long (nullable = true)
	 |-- id2: long (nullable = true)
	 |-- val2: long (nullable = true)

	Selecting data for id = 5...
	[Row(id=5, val=5, id2=5, val2=5)]
	spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true
	Creating test tables...
	Joining tables...
	Joined table schema:
	root
	 |-- id: long (nullable = true)
	 |-- val: long (nullable = true)
	 |-- id2: long (nullable = true)
	 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.

  was:
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.


> Very strange broadcast join behaviour
> -------------------------------------
>
>                 Key: SPARK-11282
>                 URL: https://issues.apache.org/jira/browse/SPARK-11282
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.5.1
>            Reporter: Maciej Bryński
>            Priority: Critical
>         Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> 	spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true
> 	Creating test tables...
> 	Joining tables...
> 	Joined table schema:
> 	root
> 	 |-- id: long (nullable = true)
> 	 |-- val: long (nullable = true)
> 	 |-- id2: long (nullable = true)
> 	 |-- val2: long (nullable = true)
> 	Selecting data for id = 5...
> 	[Row(id=5, val=5, id2=5, val2=5)]
> 	spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true
> 	Creating test tables...
> 	Joining tables...
> 	Joined table schema:
> 	root
> 	 |-- id: long (nullable = true)
> 	 |-- val: long (nullable = true)
> 	 |-- id2: long (nullable = true)
> 	 |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org