You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by guoxiaolongzte <gi...@git.apache.org> on 2018/04/11 02:51:11 UTC

[GitHub] spark pull request #21036: [SPARK-23958][CORE] HadoopRdd filters empty files...

GitHub user guoxiaolongzte opened a pull request:

    https://github.com/apache/spark/pull/21036

    [SPARK-23958][CORE] HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.

    ## What changes were proposed in this pull request?
    
    HadoopRdd filter empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.
    
    Empty file's length is zero.
    
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/guoxiaolongzte/spark SPARK-23958

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21036.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21036
    
----
commit e4ccdf913157b45f11efe8b8900d1f805d853278
Author: guoxiaolong <gu...@...>
Date:   2018-04-11T02:48:51Z

    [SPARK-23958][CORE] HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/21036
  
    Yes, this is already supported in Spark, seems like the PR is invalid.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21036: [SPARK-23958][CORE] HadoopRdd filters empty files...

Posted by Ngone51 <gi...@git.apache.org>.

Github user Ngone51 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21036#discussion_r180650706
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -86,8 +88,7 @@ private[spark] class HadoopPartition(rddId: Int, override val index: Int, s: Inp
      * @param keyClass Class of the key associated with the inputFormatClass.
      * @param valueClass Class of the value associated with the inputFormatClass.
      * @param minPartitions Minimum number of HadoopRDD partitions (Hadoop Splits) to generate.
    - *
    - * @note Instantiating this class directly is not recommended, please use
    +  * @note Instantiating this class directly is not recommended, please use
    --- End diff --
    
    ditto.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21036
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21036: [SPARK-23958][CORE] HadoopRdd filters empty files...

Posted by guoxiaolongzte <gi...@git.apache.org>.

Github user guoxiaolongzte commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21036#discussion_r180655799
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -55,7 +56,8 @@ private[spark] class HadoopPartition(rddId: Int, override val index: Int, s: Inp
     
       /**
        * Get any environment variables that should be added to the users environment when running pipes
    -   * @return a Map with the environment variables and corresponding values, it could be empty
    +    *
    +    * @return a Map with the environment variables and corresponding values, it could be empty
    --- End diff --
    
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21036: [SPARK-23958][CORE] HadoopRdd filters empty files...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21036#discussion_r181963024
  
    --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -323,7 +323,7 @@ package object config {
           .internal()
           .doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for empty input splits.")
           .booleanConf
    -      .createWithDefault(false)
    +      .createWithDefault(true)
    --- End diff --
    
    This seems silently changed the behavior. I would suggest not to do it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...

Posted by guoxiaolongzte <gi...@git.apache.org>.

Github user guoxiaolongzte commented on the issue:

    https://github.com/apache/spark/pull/21036
  
    Thank you for your comments, I will close this PR, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21036: [SPARK-23958][CORE] HadoopRdd filters empty files...

Posted by guoxiaolongzte <gi...@git.apache.org>.

Github user guoxiaolongzte commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21036#discussion_r180652894
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -55,7 +56,8 @@ private[spark] class HadoopPartition(rddId: Int, override val index: Int, s: Inp
     
       /**
        * Get any environment variables that should be added to the users environment when running pipes
    -   * @return a Map with the environment variables and corresponding values, it could be empty
    +    *
    +    * @return a Map with the environment variables and corresponding values, it could be empty
    --- End diff --
    
    what mean?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21036: [SPARK-23958][CORE] HadoopRdd filters empty files...

Posted by Ngone51 <gi...@git.apache.org>.

Github user Ngone51 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21036#discussion_r180650637
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -55,7 +56,8 @@ private[spark] class HadoopPartition(rddId: Int, override val index: Int, s: Inp
     
       /**
        * Get any environment variables that should be added to the users environment when running pipes
    -   * @return a Map with the environment variables and corresponding values, it could be empty
    +    *
    +    * @return a Map with the environment variables and corresponding values, it could be empty
    --- End diff --
    
    nit: comment style


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...

Posted by guoxiaolongzte <gi...@git.apache.org>.

Github user guoxiaolongzte commented on the issue:

    https://github.com/apache/spark/pull/21036
  
    1.No need to loop twice to filter to determine if the length is greater than 0 
    2.This feature is to improve performance, the default switch needs to open


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21036: [SPARK-23958][CORE] HadoopRdd filters empty files...

Posted by guoxiaolongzte <gi...@git.apache.org>.

Github user guoxiaolongzte closed the pull request at:

    https://github.com/apache/spark/pull/21036


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/21036
  
    @guoxiaolongzte Have you tried the config `spark.hadoopRDD.ignoreEmptySplits` ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21036
  
    nope it's a radical change that affects many of integrations. I wouldn't enable it by default for now.  here is non-critical path. It's fine to loop twice if it's more readable.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21036
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...

Posted by guoxiaolongzte <gi...@git.apache.org>.

Github user guoxiaolongzte commented on the issue:

    https://github.com/apache/spark/pull/21036
  
    Thanks, I will try to add test cases. @felixcheung 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org