You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gengliangwang <gi...@git.apache.org> on 2018/01/25 16:50:46 UTC

[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

GitHub user gengliangwang opened a pull request:

    https://github.com/apache/spark/pull/20397

    [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory in data source v2

    ## What changes were proposed in this pull request?
    
    Currently we have `ReadTask` in data source v2 reader, while in writer we have `DataWriterFactory`.
    To make the naming consistent and better, renaming `ReadTask` to `DataReaderFactory`.
    
    ## How was this patch tested?
    
    Unit test


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gengliangwang/spark rename

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20397.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20397
    
----
commit ce1196e2e600cf6382ce3e721a4189c6324756e5
Author: Wang Gengliang <lt...@...>
Date:   2018-01-25T15:54:47Z

    rename ReadTask to DataReaderFactory

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    cc @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/336/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164425827
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java ---
    @@ -22,21 +22,23 @@
     import org.apache.spark.annotation.InterfaceStability;
     
     /**
    - * A read task returned by {@link DataSourceV2Reader#createReadTasks()} and is responsible for
    - * creating the actual data reader. The relationship between {@link ReadTask} and {@link DataReader}
    + * A reader factory returned by {@link DataSourceV2Reader#createDataReaderFactories()} and is
    + * responsible for creating the actual data reader. The relationship between
    + * {@link DataReaderFactory} and {@link DataReader}
      * is similar to the relationship between {@link Iterable} and {@link java.util.Iterator}.
      *
    - * Note that, the read task will be serialized and sent to executors, then the data reader will be
    - * created on executors and do the actual reading. So {@link ReadTask} must be serializable and
    - * {@link DataReader} doesn't need to be.
    + * Note that, the reader factory will be serialized and sent to executors, then the data reader
    + * will be created on executors and do the actual reading. So {@link DataReaderFactory} must be
    + * serializable and {@link DataReader} doesn't need to be.
      */
     @InterfaceStability.Evolving
    -public interface ReadTask<T> extends Serializable {
    +public interface DataReaderFactory<T> extends Serializable {
     
       /**
    -   * The preferred locations where this read task can run faster, but Spark does not guarantee that
    -   * this task will always run on these locations. The implementations should make sure that it can
    -   * be run on any location. The location is a string representing the host name.
    +   * The preferred locations where this data reader returned by this reader factory can run faster,
    --- End diff --
    
    `this data reader` -> `the data reader`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86646/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86678 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86678/testReport)** for PR 20397 at commit [`45131ea`](https://github.com/apache/spark/commit/45131eacfcc84883fd2104ac4ec8eed7edb83761).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86763 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86763/testReport)** for PR 20397 at commit [`e570fca`](https://github.com/apache/spark/commit/e570fca9b16d320372869039305a12519e9b9b62).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86763 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86763/testReport)** for PR 20397 at commit [`e570fca`](https://github.com/apache/spark/commit/e570fca9b16d320372869039305a12519e9b9b62).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    My point is, `ReadTask` is more precise, but `DataReaderFactor` also works(Spark serialize and send it to executors, and ask it to create data reader, it's reasonable to call it a factory). If we keep the name unchanged, we are forcing users to look into this subtle difference, which they don't have to.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/237/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164348810
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java ---
    @@ -63,7 +63,7 @@
       StructType readSchema();
     
       /**
    -   * Returns a list of read tasks. Each task is responsible for outputting data for one RDD
    +   * Returns a list of reader factories. Each task is responsible for outputting data for one RDD
        * partition. That means the number of tasks returned here is same as the number of RDD
    --- End diff --
    
    `the number of factories ...`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86644/testReport)** for PR 20397 at commit [`ce1196e`](https://github.com/apache/spark/commit/ce1196e2e600cf6382ce3e721a4189c6324756e5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    This is more confusing, not less. Look at @jiangxb1987's comment above: "We shall create only one DataReaderFactory, and have that create multiple data readers." It is not clear why the API requires a list of factories, instead of using just one. If this is renamed to factory, is it a requirement that the factory can create more than one data reader for the same task?
    
    To the point about serializing and sending to executors, "factory" doesn't imply that any more than "task" does. The fact that these are serialized needs to be clear in documentation.
    
    The read and write side behave differently. They do not need to mirror one another's naming when that makes names less precise. This isn't forcing users to look at a subtle difference. It is just breaking the (wrong) assumption that both read and write sides have the same behavior.
    
    @rxin, any opinion here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86763/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/288/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    I am careful to say this out .. but let me leave my +0 for https://github.com/apache/spark/pull/20397#issuecomment-361345426. One option might be a similar name with `DataWriterFactory` but including what `ReadTask` implies to note that the difference ... 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86646 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86646/testReport)** for PR 20397 at commit [`baa453a`](https://github.com/apache/spark/commit/baa453a6bb800b7136686a8b596542d8ae93ec1d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86644/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86770 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86770/testReport)** for PR 20397 at commit [`06739ca`](https://github.com/apache/spark/commit/06739ca2d5ef741dafb6a3024d72461a41657da6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86770 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86770/testReport)** for PR 20397 at commit [`06739ca`](https://github.com/apache/spark/commit/06739ca2d5ef741dafb6a3024d72461a41657da6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164347650
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -30,21 +30,21 @@
     @InterfaceStability.Evolving
     public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
       @Override
    -  default List<ReadTask<Row>> createReadTasks() {
    +  default List<DataReaderFactory<Row>> createDataReaderFactories() {
    --- End diff --
    
    I would also mention this:
    ```
     data reader creation must be done at executor side
    ```
    so it makes it more clear why we shall have a list of `DataReaderFactory`s, but it's only my personal opinions, it's totally fine if we don't include the implementation details in the comment.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164347780
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java ---
    @@ -22,19 +22,19 @@
     import org.apache.spark.annotation.InterfaceStability;
     
     /**
    - * A read task returned by {@link DataSourceV2Reader#createReadTasks()} and is responsible for
    - * creating the actual data reader. The relationship between {@link ReadTask} and {@link DataReader}
    + * A reader factory returned by {@link DataSourceV2Reader#createDataReaderFactories()} and is responsible for
    + * creating the actual data reader. The relationship between {@link DataReaderFactory} and {@link DataReader}
      * is similar to the relationship between {@link Iterable} and {@link java.util.Iterator}.
      *
    - * Note that, the read task will be serialized and sent to executors, then the data reader will be
    - * created on executors and do the actual reading. So {@link ReadTask} must be serializable and
    + * Note that, the reader factory will be serialized and sent to executors, then the data reader will be
    + * created on executors and do the actual reading. So {@link DataReaderFactory} must be serializable and
      * {@link DataReader} doesn't need to be.
      */
     @InterfaceStability.Evolving
    -public interface ReadTask<T> extends Serializable {
    +public interface DataReaderFactory<T> extends Serializable {
     
       /**
    -   * The preferred locations where this read task can run faster, but Spark does not guarantee that
    +   * The preferred locations where this data reader factory can run faster, but Spark does not guarantee that
    --- End diff --
    
    `... where the data reader returned by this reader factory can run faster ...`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86708/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164425194
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -30,21 +30,21 @@
     @InterfaceStability.Evolving
     public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
       @Override
    -  default List<ReadTask<Row>> createReadTasks() {
    +  default List<DataReaderFactory<Row>> createDataReaderFactories() {
    --- End diff --
    
    We mentioned it in the classdoc of `DataReaderFactory`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164425992
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java ---
    @@ -22,21 +22,23 @@
     import org.apache.spark.annotation.InterfaceStability;
     
     /**
    - * A read task returned by {@link DataSourceV2Reader#createReadTasks()} and is responsible for
    - * creating the actual data reader. The relationship between {@link ReadTask} and {@link DataReader}
    + * A reader factory returned by {@link DataSourceV2Reader#createDataReaderFactories()} and is
    + * responsible for creating the actual data reader. The relationship between
    + * {@link DataReaderFactory} and {@link DataReader}
      * is similar to the relationship between {@link Iterable} and {@link java.util.Iterator}.
      *
    - * Note that, the read task will be serialized and sent to executors, then the data reader will be
    - * created on executors and do the actual reading. So {@link ReadTask} must be serializable and
    - * {@link DataReader} doesn't need to be.
    + * Note that, the reader factory will be serialized and sent to executors, then the data reader
    + * will be created on executors and do the actual reading. So {@link DataReaderFactory} must be
    + * serializable and {@link DataReader} doesn't need to be.
      */
     @InterfaceStability.Evolving
    -public interface ReadTask<T> extends Serializable {
    +public interface DataReaderFactory<T> extends Serializable {
     
       /**
    -   * The preferred locations where this read task can run faster, but Spark does not guarantee that
    -   * this task will always run on these locations. The implementations should make sure that it can
    -   * be run on any location. The location is a string representing the host name.
    +   * The preferred locations where this data reader returned by this reader factory can run faster,
    +   * but Spark does not guarantee that this task will always run on these locations.
    --- End diff --
    
    `not guarantee to always run the data reader on these locations.` 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86772 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86772/testReport)** for PR 20397 at commit [`5d6c4fe`](https://github.com/apache/spark/commit/5d6c4fe7bd4b6ffaa9dbf6a0d6eeed169e76228c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    > I think the renaming is worth to remove future confusions.
    
    What future confusion?
    
    I understand that the difference isn't obvious, but making the names less accurate isn't a good fix. The read and write sides don't have to mirror one another. They behave differently and that's okay. Names should be based on what the classes actually do.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164349108
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Distribution.java ---
    @@ -21,9 +21,9 @@
     
     /**
      * An interface to represent data distribution requirement, which specifies how the records should
    - * be distributed among the {@link ReadTask}s that are returned by
    - * {@link DataSourceV2Reader#createReadTasks()}. Note that this interface has nothing to do with
    - * the data ordering inside one partition(the output records of a single {@link ReadTask}).
    + * be distributed among the {@link DataReaderFactory}s that are returned by
    + * {@link DataSourceV2Reader#createDataReaderFactories()}. Note that this interface has nothing to do with
    + * the data ordering inside one partition(the output records of a single {@link DataReaderFactory}).
    --- End diff --
    
    a single DataReader


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/339/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    what do streaming guys think? cc @tdas @jose-torres 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    About your last point, it's mostly my fault that I didn't schedule the work well and missed this one. Since the last RC failed and next RC is not started yet, I think this is a good window to get this change, and hopefully it does not bring too much trouble to the release process.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/331/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164335994
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -30,21 +30,21 @@
     @InterfaceStability.Evolving
     public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
       @Override
    -  default List<ReadTask<Row>> createReadTasks() {
    +  default List<DataReaderFactory<Row>> createDataReaderFactories() {
    --- End diff --
    
    `DataReaderFactory` is responsible to do serialization and initialize the actual data readers, so data reader creation must be done at executor side, and before that we need to determine how many RDD partitions we want, which is this method doing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/298/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    The previous commit passed all test, and the last commit just changed some comment and has nothing to do with the failed test, I'm merging it to master/2.3, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/262/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    About the renaming, a lot of people complained to me about why the namings are not consistent, including @rxin . I named it `ReadTask` at the beginning because it really works like a task. But I believe after 2.3 more and more people will complain about the naming inconsistency because the difference between `ReadTask` and `DataWriterFactory` is too subtle: both of them are responsible for serializing information and initializing the actual reader/writer at executor side. The only difference is, we only get one `DataWriterFactor`, serialize and send it to all partitions, which means we implicitly "copy" the writer factory to all partitions. While for `ReakTask`, we get many of them, and send each one to its corresponding partition, which means there is no "copy". I think the renaming is worth to remove future confusions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164248501
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -30,21 +30,21 @@
     @InterfaceStability.Evolving
     public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
       @Override
    -  default List<ReadTask<Row>> createReadTasks() {
    +  default List<DataReaderFactory<Row>> createDataReaderFactories() {
    --- End diff --
    
    We shall create only one `DataReaderFactory`, and have that create multiple data readers.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/235/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164348145
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java ---
    @@ -63,7 +63,7 @@
       StructType readSchema();
     
       /**
    -   * Returns a list of read tasks. Each task is responsible for outputting data for one RDD
    +   * Returns a list of reader factories. Each task is responsible for outputting data for one RDD
    --- End diff --
    
    `Each factory is ...`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r163921087
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala ---
    @@ -22,24 +22,24 @@ import scala.reflect.ClassTag
     
     import org.apache.spark.{InterruptibleIterator, Partition, SparkContext, TaskContext}
     import org.apache.spark.rdd.RDD
    -import org.apache.spark.sql.sources.v2.reader.ReadTask
    +import org.apache.spark.sql.sources.v2.reader.DataReaderFactory
     
    -class DataSourceRDDPartition[T : ClassTag](val index: Int, val readTask: ReadTask[T])
    +class DataSourceRDDPartition[T : ClassTag](val index: Int, val readerFactory: DataReaderFactory[T])
       extends Partition with Serializable
     
     class DataSourceRDD[T: ClassTag](
         sc: SparkContext,
    -    @transient private val readTasks: java.util.List[ReadTask[T]])
    +    @transient private val readerFactories: java.util.List[DataReaderFactory[T]])
       extends RDD[T](sc, Nil) {
     
       override protected def getPartitions: Array[Partition] = {
    -    readTasks.asScala.zipWithIndex.map {
    +    readerFactories.asScala.zipWithIndex.map {
           case (readTask, index) => new DataSourceRDDPartition(index, readTask)
    --- End diff --
    
    readTask -> readerFactory


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86772/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    This is replacing the original name so LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86708/testReport)** for PR 20397 at commit [`ce98e09`](https://github.com/apache/spark/commit/ce98e09f72a18776a6fa4c659ea3fa3c6a94801b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86754/testReport)** for PR 20397 at commit [`b7fbd9d`](https://github.com/apache/spark/commit/b7fbd9d54c6d1725905700a9f8cfbab4f550be73).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86772 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86772/testReport)** for PR 20397 at commit [`5d6c4fe`](https://github.com/apache/spark/commit/5d6c4fe7bd4b6ffaa9dbf6a0d6eeed169e76228c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164349078
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Distribution.java ---
    @@ -21,9 +21,9 @@
     
     /**
      * An interface to represent data distribution requirement, which specifies how the records should
    - * be distributed among the {@link ReadTask}s that are returned by
    - * {@link DataSourceV2Reader#createReadTasks()}. Note that this interface has nothing to do with
    - * the data ordering inside one partition(the output records of a single {@link ReadTask}).
    + * be distributed among the {@link DataReaderFactory}s that are returned by
    --- End diff --
    
    `distributed among the data partition(one DataReader outputs data for one partition).`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20397


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86754/testReport)** for PR 20397 at commit [`b7fbd9d`](https://github.com/apache/spark/commit/b7fbd9d54c6d1725905700a9f8cfbab4f550be73).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86754/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    @cloud-fan, thanks for pinging me on this.
    
    -1: I don't think there's a compelling benefit to justify this change, and I think it makes the API more confusing. I think we should revert this.
    
    This class doesn't actually behave as a factory and is used more like an Iterable: it is only used to instantiate one DataReader and carries no explicit guarantee that it can be reused. In addition, the piece of work that each one represents is a task, which becomes an actual task when the stage runs. I would much rather keep the ReadTask name to make that connection clear.
    
    The write side does behave like a factory, so the name is appropriate there. There is little value to uniform names if the names actually make the API more confusing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    I thought the idea behind having a different name was to differentiate from
    the writer factory due to the difference in semantics - ReadTask will
    create only 1 DataReader in a task, whereas a Writer factory can create
    many DataWriters in a task.
    
    On Jan 25, 2018 10:09 PM, "UCB AMPLab" <no...@github.com> wrote:
    
    > Test PASSed.
    > Refer to this link for build results (access rights to CI server needed):
    > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86678/
    > Test PASSed.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/20397#issuecomment-360693080>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AAoerKtRwn6I4_Kny4-L_48tQ2jveqN5ks5tOWwBgaJpZM4RtJ0P>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86708/testReport)** for PR 20397 at commit [`ce98e09`](https://github.com/apache/spark/commit/ce98e09f72a18776a6fa4c659ea3fa3c6a94801b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164347542
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ClusteredDistribution.java ---
    @@ -22,7 +22,7 @@
     /**
      * A concrete implementation of {@link Distribution}. Represents a distribution where records that
      * share the same values for the {@link #clusteredColumns} will be produced by the same
    - * {@link ReadTask}.
    + * {@link DataReaderFactory}.
    --- End diff --
    
    actually `DataReader` is more precise here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/323/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164348008
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java ---
    @@ -50,7 +50,7 @@
       }
     
       /**
    -   * Returns a data reader to do the actual reading work for this read task.
    +   * Returns a data reader to do the actual reading work for this data reader factory.
    --- End diff --
    
    we can follow the `DataWriterFactory` and just say `... to do the actual reading work`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r167393259
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala ---
    @@ -139,15 +139,15 @@ class RateStreamMicroBatchReader(options: DataSourceV2Options)
             outTimeMs += msPerPartitionBetweenRows
           }
     
    -      RateStreamBatchTask(packedRows).asInstanceOf[ReadTask[Row]]
    +      RateStreamBatchTask(packedRows).asInstanceOf[DataReaderFactory[Row]]
         }.toList.asJava
       }
     
       override def commit(end: Offset): Unit = {}
       override def stop(): Unit = {}
     }
     
    -case class RateStreamBatchTask(vals: Seq[(Long, Long)]) extends ReadTask[Row] {
    +case class RateStreamBatchTask(vals: Seq[(Long, Long)]) extends DataReaderFactory[Row] {
    --- End diff --
    
    and should we rename `RateStreamBatchTask` too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86644/testReport)** for PR 20397 at commit [`ce1196e`](https://github.com/apache/spark/commit/ce1196e2e600cf6382ce3e721a4189c6324756e5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    One last point: should significant changes to public APIs like this go in just before or just after a release? 2.3.0 candidates have used ReadTask up to now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    @cloud-fan, can we revert this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r167392663
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceV2Suite.scala ---
    @@ -78,7 +78,7 @@ class RateSourceV2Suite extends StreamTest {
         val reader = new RateStreamMicroBatchReader(
           new DataSourceV2Options(Map("numPartitions" -> "11", "rowsPerSecond" -> "33").asJava))
         reader.setOffsetRange(Optional.empty(), Optional.empty())
    -    val tasks = reader.createReadTasks()
    +    val tasks = reader.createDataReaderFactories()
    --- End diff --
    
    nit: seems we should rename the variable too ..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164345010
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -30,21 +30,21 @@
     @InterfaceStability.Evolving
     public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
       @Override
    -  default List<ReadTask<Row>> createReadTasks() {
    +  default List<DataReaderFactory<Row>> createDataReaderFactories() {
    --- End diff --
    
    I think already did, in `DataSourceV2Reader`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86721/testReport)** for PR 20397 at commit [`ce98e09`](https://github.com/apache/spark/commit/ce98e09f72a18776a6fa4c659ea3fa3c6a94801b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    cc @tdas This PR might resolve your original comment about the inconsistency between v2.reader and v2.writer


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r163920956
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDDIter.scala ---
    @@ -39,14 +39,14 @@ import org.apache.spark.util.{SystemClock, ThreadUtils}
     class ContinuousDataSourceRDD(
         sc: SparkContext,
         sqlContext: SQLContext,
    -    @transient private val readTasks: java.util.List[ReadTask[UnsafeRow]])
    +    @transient private val readerFactories: java.util.List[DataReaderFactory[UnsafeRow]])
       extends RDD[UnsafeRow](sc, Nil) {
     
       private val dataQueueSize = sqlContext.conf.continuousStreamingExecutorQueueSize
       private val epochPollIntervalMs = sqlContext.conf.continuousStreamingExecutorPollIntervalMs
     
       override protected def getPartitions: Array[Partition] = {
    -    readTasks.asScala.zipWithIndex.map {
    +    readerFactories.asScala.zipWithIndex.map {
           case (readTask, index) => new DataSourceRDDPartition(index, readTask)
    --- End diff --
    
    readTask -> readerFactory


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    I'm kind of convinced but maybe it's because I wrote this `ReadTask`. Let's get feedback from more people, cc @RussellSpitzer @VincentPoncet @HyukjinKwon @wzhfy @dongjoon-hyun @j-baker 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    LGTM except a few comments. also cc @rdblue 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86770/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86646 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86646/testReport)** for PR 20397 at commit [`baa453a`](https://github.com/apache/spark/commit/baa453a6bb800b7136686a8b596542d8ae93ec1d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    **[Test build #86678 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86678/testReport)** for PR 20397 at commit [`45131ea`](https://github.com/apache/spark/commit/45131eacfcc84883fd2104ac4ec8eed7edb83761).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20397#discussion_r164342995
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -30,21 +30,21 @@
     @InterfaceStability.Evolving
     public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
       @Override
    -  default List<ReadTask<Row>> createReadTasks() {
    +  default List<DataReaderFactory<Row>> createDataReaderFactories() {
    --- End diff --
    
    I see, make sense. Maybe update the comment to explain why we need a list of `DataReaderFactory`s?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by jose-torres <gi...@git.apache.org>.
Github user jose-torres commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    In general I have very weak opinions on what classes are named :)
    
    I agree that readers and writers are very different in the DataSourceV2 API, and they're even more so in streaming-land. So a change emphasizing parallelism that isn't really there is more confusing than helpful. The only real commonality I see between ReadTask/DataReaderFactory and DataWriterFactory is that they are serializable intermediate representations.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20397
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86678/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org