You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/01 20:51:58 UTC

[jira] [Updated] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

     [ https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin updated SPARK-16196:
--------------------------------
    Target Version/s: 2.2.0  (was: 2.1.0)

> Optimize in-memory scan performance using ColumnarBatches
> ---------------------------------------------------------
>
>                 Key: SPARK-16196
>                 URL: https://issues.apache.org/jira/browse/SPARK-16196
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Andrew Or
>            Assignee: Andrew Or
>
> A simple benchmark such as the following reveals inefficiencies in the existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 10000) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression takes a long time. The second is that there are a lot of virtual function calls in this hot code path since the rows are processed using iterators. Further, the rows are converted to and from ByteBuffers, which are slow to read in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org