You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Giambattista (JIRA)" <ji...@apache.org> on 2017/03/01 14:06:46 UTC
[jira] [Comment Edited] (SPARK-17931) taskScheduler has some unneeded serialization

    [ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890226#comment-15890226 ] 

Giambattista edited comment on SPARK-17931 at 3/1/17 2:06 PM:
--------------------------------------------------------------

I just wanted to report that after this change Spark is failing in executing long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

{noformat}
--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
     dataOut.writeInt(taskDescription.properties.size())
     taskDescription.properties.asScala.foreach { case (key, value) =>
       dataOut.writeUTF(key)
-      dataOut.writeUTF(value)
+      dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
     }
{noformat}



was (Author: gbloisi):
I just wanted to report that after this change Spark is failing in executing long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
     dataOut.writeInt(taskDescription.properties.size())
     taskDescription.properties.asScala.foreach { case (key, value) =>
       dataOut.writeUTF(key)
-      dataOut.writeUTF(value)
+      dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
     }



> taskScheduler has some unneeded serialization
> ---------------------------------------------
>
>                 Key: SPARK-17931
>                 URL: https://issues.apache.org/jira/browse/SPARK-17931
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: Guoqiang Li
>            Assignee: Kay Ousterhout
>             Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org