You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2016/12/10 14:43:58 UTC

[jira] [Resolved] (SPARK-17460) Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception

     [ https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenchen Fan resolved SPARK-17460.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 2.1.0

Issue resolved by pull request 16175
[https://github.com/apache/spark/pull/16175]

> Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception
> ----------------------------------------------------------------------
>
>                 Key: SPARK-17460
>                 URL: https://issues.apache.org/jira/browse/SPARK-17460
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>         Environment: Spark 2.0 in local mode as well as on GoogleDataproc
>            Reporter: Chris Perluss
>             Fix For: 2.1.0
>
>
> Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0.
> The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of datatype Int.  In my dataset, there is an Array column whose data size exceeds the limits of an Int and so the data size becomes negative.
> The issue can be repeated by running this code in REPL:
> val ds = (0 to 10000).map( i => (i, Seq((i, Seq((i, "This is really not that long of a string")))))).toDS()
> // You might have to remove private[sql] from Dataset.logicalPlan to get this to work
> val stats = ds.logicalPlan.statistics
> yields
> stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(-1890686892,false)
> This causes joinWith to performWith to perform a broadcast join even tho my data is gigabytes in size, which of course causes the executors to run out of memory.
> Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the logicalPlan.statistics.sizeInBytes is a large negative number and thus it is less than the join threshold of -1.
> I've been able to work around this issue by setting autoBroadcastJoinThreshold to a very large negative number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org