You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2016/04/01 22:29:11 UTC

Re: What influences the space complexity of Spark operations?

Blocking operators like Sort, Join or Aggregate will put all of the data
for a whole partition into a hash table or array.  However, if you are
running Spark 1.5+ we should be spilling to disk.  In Spark 1.6 if you are
seeing OOMs for SQL operations you should report it as a bug.

On Thu, Mar 31, 2016 at 9:26 AM, Steve Johnston <sjohnston@algebraixdata.com
> wrote:

> *What we’ve observed*
>
> Increasing the number of partitions (and thus decreasing the partition
> size) seems to reliably help avoid OOM errors. To demonstrate this we used
> a single executor and loaded a small table into a DataFrame, persisted it
> with MEMORY_AND_DISK, repartitioned it and joined it to itself. Varying the
> number of partitions identifies a threshold between completing the join and
> incurring an OOM error.
>
>
> lineitem = sc.textFile('lineitem.tbl').map(converter)
> lineitem = sqlContext.createDataFrame(lineitem, schema)
> lineitem.persist(StorageLevel.MEMORY_AND_DISK)
> repartitioned = lineitem.repartition(partition_count)
> joined = repartitioned.join(repartitioned)
> joined.show()
>
>
> *Questions*
>
> Generally, what influences the space complexity of Spark operations? Is it
> the case that a single partition of each operand’s data set + a single
> partition of the resulting data set all need to fit in memory at the same
> time? We can see where the transformations (for say joins) are implemented
> in the source code (for the example above BroadcastNestedLoopJoin), but
> they seem to be based on virtualized iterators; where in the code is the
> partition data for the inputs and outputs actually materialized?
> ------------------------------
> View this message in context: What influences the space complexity of
> Spark operations?
> <http://apache-spark-developers-list.1001551.n3.nabble.com/What-influences-the-space-complexity-of-Spark-operations-tp16944.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>

Re: What influences the space complexity of Spark operations?

Posted by Steve Johnston <sj...@algebraixdata.com>.

Submitted: SPARK-14389 - OOM during BroadcastNestedLoopJoin.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-influences-the-space-complexity-of-Spark-operations-tp16944p17029.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org