You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MaxGekk <gi...@git.apache.org> on 2018/04/02 18:50:04 UTC

[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/20963

    [SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource

    ## What changes were proposed in this pull request?
    
    Proposed tests checks that only subset of input dataset is touched during schema inferring.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 json-sampling-tests

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20963.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20963
    
----
commit 76e38a8670303afe20b530fc2a837a1363a01974
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T13:28:32Z

    Adding samplingRation tests for json

commit 1acc3ece4121ffc3209fb65a20526dec574820e5
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T13:34:06Z

    Removing debug code

commit 0d5fcfb59e112cc989753404813d32432257a0f6
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T18:45:18Z

    Adding the ticket ref to test titles

commit a664465d8a3042f06cf7a866283c34a674b24939
Author: Maxim Gekk <ma...@...>
Date:   2018-04-02T18:48:56Z

    Making Scala style checker happy

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20963
  
    **[Test build #88828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88828/testReport)** for PR 20963 at commit [`a664465`](https://github.com/apache/spark/commit/a664465d8a3042f06cf7a866283c34a674b24939).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20963#discussion_r179934815
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    +    withTempPath { path =>
    +      val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
    +        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
    +      for (i <- 0 until 100) {
    +        if (predefinedSample.contains(i)) {
    +          writer.write(s"""{"f1":${i.toString}}""" + "\n")
    +        } else {
    +          writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
    +        }
    +      }
    +      writer.close()
    +
    +      val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
    --- End diff --
    
    Yes. The seed is also given. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20963#discussion_r180535344
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    +    withTempPath { path =>
    +      val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
    +        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
    +      for (i <- 0 until 100) {
    +        if (predefinedSample.contains(i)) {
    +          writer.write(s"""{"f1":${i.toString}}""" + "\n")
    +        } else {
    +          writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
    +        }
    +      }
    +      writer.close()
    +
    +      val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
    --- End diff --
    
    It seems specifying only `spark.sql.files.maxPartitionBytes` is not enough. Please, look at the [formula](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L406) and [slicing input files](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L415):
    
    ```
    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    ```
    
    Is ok if I just check that file size is less than `maxSplitBytes`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20963#discussion_r178691639
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    +    withTempPath { path =>
    +      val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
    +        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
    +      for (i <- 0 until 100) {
    +        if (predefinedSample.contains(i)) {
    +          writer.write(s"""{"f1":${i.toString}}""" + "\n")
    +        } else {
    +          writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
    +        }
    +      }
    +      writer.close()
    +
    +      val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
    --- End diff --
    
    @MaxGekk, wouldn't this test be flaky?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20963
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20963#discussion_r179934861
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    --- End diff --
    
    Not need to have so many elements in this set. Please combine the tests in your CSV PR.
    
    Instead of calling `json()`, we can do it using `format("json")`. Then, you can combine the test cases for both CSV and Json. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20963#discussion_r180605579
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    +    withTempPath { path =>
    +      val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
    +        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
    +      for (i <- 0 until 100) {
    +        if (predefinedSample.contains(i)) {
    +          writer.write(s"""{"f1":${i.toString}}""" + "\n")
    +        } else {
    +          writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
    +        }
    +      }
    +      writer.close()
    +
    +      val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
    --- End diff --
    
    yup, please set the appropriate numbers. I think it's fine if it has some comments so that we read and fix the tests if it's broken.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20963#discussion_r179257489
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    +    withTempPath { path =>
    +      val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
    +        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
    +      for (i <- 0 until 100) {
    +        if (predefinedSample.contains(i)) {
    +          writer.write(s"""{"f1":${i.toString}}""" + "\n")
    +        } else {
    +          writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
    +        }
    +      }
    +      writer.close()
    +
    +      val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
    --- End diff --
    
    It could be if the partitionIndex is flaky here: https://github.com/apache/spark/blob/2ce37b50fc01558f49ad22f89c8659f50544ffec/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L320 but in both tests there is only one partition with stable index `0`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20963#discussion_r180610436
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    +    withTempPath { path =>
    +      val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
    +        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
    +      for (i <- 0 until 100) {
    +        if (predefinedSample.contains(i)) {
    +          writer.write(s"""{"f1":${i.toString}}""" + "\n")
    +        } else {
    +          writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
    +        }
    +      }
    +      writer.close()
    +
    +      val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
    --- End diff --
    
    ^ It's based upon actual experience before. There was a similar case that the test was broken due to the number of partitions and it took me a while to debug it, https://issues.apache.org/jira/browse/SPARK-13728


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20963#discussion_r179937139
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
    @@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
           assert(df.schema === expectedSchema)
         }
       }
    +
    +  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
    +    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
    +      57, 62, 68, 72)
    +    withTempPath { path =>
    +      val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
    +        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
    +      for (i <- 0 until 100) {
    +        if (predefinedSample.contains(i)) {
    +          writer.write(s"""{"f1":${i.toString}}""" + "\n")
    +        } else {
    +          writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
    +        }
    +      }
    +      writer.close()
    +
    +      val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
    --- End diff --
    
    OK but I don't think we are guaranteed to have always one partition here though. Shall we at least explicitly set `spark.sql.files.maxPartitionBytes` big enough with some comments?
    
    I think we shouldn't encourage this way because it should likely be easy to be broken IMHO. I am fine with it anyway as I can't think of a better way on the other hand.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20963
  
    LGTM. Thanks! Merged to master/


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20963
  
    **[Test build #88828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88828/testReport)** for PR 20963 at commit [`a664465`](https://github.com/apache/spark/commit/a664465d8a3042f06cf7a866283c34a674b24939).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20963
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88828/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20963
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20963


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org