You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2015/08/05 04:16:06 UTC

[jira] [Commented] (SPARK-9604) Unsafe ArrayData and MapData is very very slow

    [ https://issues.apache.org/jira/browse/SPARK-9604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654696#comment-14654696 ] 

Wenchen Fan commented on SPARK-9604:
------------------------------------

there is a known issue: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L156-L161

I used a quick solution to turn unsafe array/map to safe ones, which is very efficient. The reason is that we may have UnsafeRows inside array and map, and we have to turn them into safe version as currently `toSeq` is not supported in UnsafeRow.

I'm working on changing `toSeq` to `toSeq(schema: StructType)` so that it can work on UnsafeRow. After that we can remove the unsafe->safe conversion and solve this issue.

> Unsafe ArrayData and MapData is very very slow
> ----------------------------------------------
>
>                 Key: SPARK-9604
>                 URL: https://issues.apache.org/jira/browse/SPARK-9604
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Davies Liu
>            Assignee: Wenchen Fan
>            Priority: Blocker
>
> After the unsafe ArrayData and MapData merged in, this test become very slow (from less than 1 second to more than 35 seconds).
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/3157/testReport/org.apache.spark.sql.columnar/InMemoryColumnarQuerySuite/test_different_data_types/history/
> I tried to disable the cache, it's still very slow (also most the same), once remove ArrayData and ArrayMap, it become much faster (still take about 10 seconds).
> Related changes: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/3148/changes
> Also the duration of Hive tests increased from 32min to 45min 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/3154/testReport/junit/org.apache.spark.sql.hive.execution/history/
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org