You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by shahidki31 <gi...@git.apache.org> on 2018/09/20 19:09:33 UTC

[GitHub] spark pull request #22502: [SPARK-25474][SQL]Size in bytes of the query is c...

GitHub user shahidki31 opened a pull request:

    https://github.com/apache/spark/pull/22502

    [SPARK-25474][SQL]Size in bytes of the query is coming in EB in case of parquet datasource

    ## What changes were proposed in this pull request?
    In case of CatalogFileIndex datasource table, sizeInBytes is always coming as default size in bytes, which is  8.0EB. So, the datasource table which has CatalogFileIndex, always prefer SortMergeJoin, instead of BroadcastJoin, even if the size is below broadcast join threshold.
    In this PR, In case of CatalogFileIndex table, if we enable "fallBackToHdfsForStatsEnabled=true", then the computeStatistics  get the sizeInBytes from the hdfs and we get the actual size of the table. Hence, during join operation, when the table size is below broadcast threshold, it will prefer broadCastHashJoin instead of SortMergeJoin.
    
    ## How was this patch tested?
    Added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shahidki31/spark SPARK-25474

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22502.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22502
    
----
commit 79d0794d098ed15030dde3a7fea8b65952fa0d72
Author: Shahid <sh...@...>
Date:   2018-09-20T18:58:22Z

    [SPARK-25474][SQL]Size in bytes of the query is coming in EB in case of parquet datasource

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStatsEnable...

Posted by shahidki31 <gi...@git.apache.org>.

Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/22502
  
    Hi @cloud-fan , could you please review the code. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStats= true...

Posted by shahidki31 <gi...@git.apache.org>.

Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/22502
  
    @cloud-fan Thanks. I will check and update the PR.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStats= true...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22502
  
    @shahidki31 thanks for fixing it!
    
    Do you know where we read `fallBackToHdfsForStats` currently and see if we can have a unified place to do it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22502: [SPARK-25474][SQL]Size in bytes of the query is coming i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22502
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22502: [SPARK-25474][SQL]Size in bytes of the query is coming i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22502
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStatsEnable...

Posted by shahidki31 <gi...@git.apache.org>.

Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/22502
  
    @dongjoon-hyun . Thanks for the comment. I have modified the title.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStatsEnable...

Posted by shahidki31 <gi...@git.apache.org>.

Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/22502
  
    cc @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStat...

Posted by wangyum <gi...@git.apache.org>.

Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22502#discussion_r230734089
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -86,10 +89,28 @@ case class HadoopFsRelation(
       }
     
       override def sizeInBytes: Long = {
    --- End diff --
    
    May be you need to implement a rule similar to `DetermineTableStats` for the datasource table?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStat...

Posted by shahidki31 <gi...@git.apache.org>.

Github user shahidki31 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22502#discussion_r230811926
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala ---
    @@ -86,10 +89,28 @@ case class HadoopFsRelation(
       }
     
       override def sizeInBytes: Long = {
    --- End diff --
    
    Hi @wangyum , The issue here  is catalogFileIndex always take stats as default stats and it never gets updated, even if the user enable 'fallBackToHdfsForStats'
    So, In this fix, if the user enable the 'fallBackToHdfsForStats', it reads the sizeInBytes from the fileSystem, rather than relying on the default table stats.
    Thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22502: [SPARK-25474][SQL]Size in bytes of the query is coming i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22502
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org