You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2018/02/01 04:07:00 UTC

[jira] [Resolved] (SPARK-23247) combines Unsafe operations and statistics operations in Scan Data Source

     [ https://issues.apache.org/jira/browse/SPARK-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenchen Fan resolved SPARK-23247.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 2.4.0

Issue resolved by pull request 20415
[https://github.com/apache/spark/pull/20415]

> combines Unsafe operations and statistics operations in Scan Data Source
> ------------------------------------------------------------------------
>
>                 Key: SPARK-23247
>                 URL: https://issues.apache.org/jira/browse/SPARK-23247
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: caoxuewen
>            Assignee: caoxuewen
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Currently, we scan the execution plan of the data source, first the unsafe operation of each row of data, and then re traverse the data for the count of rows. In terms of performance, this is not necessary. this PR combines the two operations and makes statistics on the number of rows while performing the unsafe operation.
> *Before modified,*
> {color:#cc7832}val {color}unsafeRow = rdd.mapPartitionsWithIndexInternal { (index{color:#cc7832}, {color}iter) =>
>  {color:#cc7832}val {color}proj = UnsafeProjection.create({color:#9876aa}schema{color})
>  proj.initialize(index)
>  {color:#FF0000}iter.map(proj){color}
> }
> {color:#cc7832}val {color}numOutputRows = longMetric({color:#6a8759}"numOutputRows"{color})
> unsafeRow.map { r =>
>  {color:#FF0000}numOutputRows += {color}{color:#6897bb}{color:#FF0000}1{color}
> {color} r
> }
> *After modified,*
>     val numOutputRows = longMetric("numOutputRows")
>     rdd.mapPartitionsWithIndexInternal { (index, iter) =>
>       val proj = UnsafeProjection.create(schema)
>       proj.initialize(index)
>       iter.map( r => {
> {color:#FF0000}        numOutputRows += 1{color}
> {color:#FF0000}        proj(r){color}
>       })
>     }
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org