You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zenglinxi (JIRA)" <ji...@apache.org> on 2018/07/18 04:10:00 UTC
[jira] [Comment Edited] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

    [ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547337#comment-16547337 ] 

zenglinxi edited comment on SPARK-24809 at 7/18/18 4:09 AM:
------------------------------------------------------------

[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

 
{code:java}
// code in HashedRelation.scala
private def write(
    writeBoolean: (Boolean) => Unit,
    writeLong: (Long) => Unit,
    writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = {
  writeBoolean(isDense)
  writeLong(minKey)
  writeLong(maxKey)
  writeLong(numKeys)
  writeLong(numValues)
  writeLong(numKeyLookups)
  writeLong(numProbes)

  writeLong(array.length)
  writeLongArray(writeBuffer, array, array.length)
  val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
  writeLong(used)
  writeLongArray(writeBuffer, page, used)
}
{code}
This write func in HashedRelation.scala will be called when executor didn't have enough memory for the LongToUnsafeRowMap in which the data of broadcast table been saved, however, the value of cursor in executor may not changed after initialization by 
{code:java}
// code placeholder
private var cursor: Long = Platform.LONG_ARRAY_OFFSET
{code}
which makes the value of "used" in write func been zero when write to disk, then in the case of deserializing this data in disk will get wrong pointer. Finally, we may get the wrong data from broadcast join.

 

 

 


was (Author: gostop_zlx):
[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

> Serializing LongHashedRelation in executor may result in data error
> -------------------------------------------------------------------
>
>                 Key: SPARK-24809
>                 URL: https://issues.apache.org/jira/browse/SPARK-24809
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>         Environment: Spark 2.2.1
> hadoop 2.7.1
>            Reporter: Lijia Liu
>            Priority: Critical
>         Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if the broadcast value is abnormal big, executor will serialize it to disk. But, data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org