You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MaxGekk <gi...@git.apache.org> on 2018/04/02 18:50:04 UTC
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/20963
[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource
## What changes were proposed in this pull request?
Proposed tests checks that only subset of input dataset is touched during schema inferring.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 json-sampling-tests
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20963.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20963
----
commit 76e38a8670303afe20b530fc2a837a1363a01974
Author: Maxim Gekk <ma...@...>
Date: 2018-04-02T13:28:32Z
Adding samplingRation tests for json
commit 1acc3ece4121ffc3209fb65a20526dec574820e5
Author: Maxim Gekk <ma...@...>
Date: 2018-04-02T13:34:06Z
Removing debug code
commit 0d5fcfb59e112cc989753404813d32432257a0f6
Author: Maxim Gekk <ma...@...>
Date: 2018-04-02T18:45:18Z
Adding the ticket ref to test titles
commit a664465d8a3042f06cf7a866283c34a674b24939
Author: Maxim Gekk <ma...@...>
Date: 2018-04-02T18:48:56Z
Making Scala style checker happy
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20963
**[Test build #88828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88828/testReport)** for PR 20963 at commit [`a664465`](https://github.com/apache/spark/commit/a664465d8a3042f06cf7a866283c34a674b24939).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/20963#discussion_r179934815
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(df.schema === expectedSchema)
}
}
+
+ test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
+ val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+ 57, 62, 68, 72)
+ withTempPath { path =>
+ val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+ StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+ for (i <- 0 until 100) {
+ if (predefinedSample.contains(i)) {
+ writer.write(s"""{"f1":${i.toString}}""" + "\n")
+ } else {
+ writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+ }
+ }
+ writer.close()
+
+ val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
--- End diff --
Yes. The seed is also given.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20963#discussion_r180535344
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(df.schema === expectedSchema)
}
}
+
+ test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
+ val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+ 57, 62, 68, 72)
+ withTempPath { path =>
+ val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+ StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+ for (i <- 0 until 100) {
+ if (predefinedSample.contains(i)) {
+ writer.write(s"""{"f1":${i.toString}}""" + "\n")
+ } else {
+ writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+ }
+ }
+ writer.close()
+
+ val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
--- End diff --
It seems specifying only `spark.sql.files.maxPartitionBytes` is not enough. Please, look at the [formula](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L406) and [slicing input files](https://github.com/apache/spark/blob/400a1d9e25c1196f0be87323bd89fb3af0660166/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L415):
```
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
```
Is ok if I just check that file size is less than `maxSplitBytes`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20963#discussion_r178691639
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(df.schema === expectedSchema)
}
}
+
+ test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
+ val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+ 57, 62, 68, 72)
+ withTempPath { path =>
+ val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+ StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+ for (i <- 0 until 100) {
+ if (predefinedSample.contains(i)) {
+ writer.write(s"""{"f1":${i.toString}}""" + "\n")
+ } else {
+ writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+ }
+ }
+ writer.close()
+
+ val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
--- End diff --
@MaxGekk, wouldn't this test be flaky?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20963
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/20963#discussion_r179934861
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(df.schema === expectedSchema)
}
}
+
+ test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
+ val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+ 57, 62, 68, 72)
--- End diff --
Not need to have so many elements in this set. Please combine the tests in your CSV PR.
Instead of calling `json()`, we can do it using `format("json")`. Then, you can combine the test cases for both CSV and Json.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20963#discussion_r180605579
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(df.schema === expectedSchema)
}
}
+
+ test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
+ val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+ 57, 62, 68, 72)
+ withTempPath { path =>
+ val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+ StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+ for (i <- 0 until 100) {
+ if (predefinedSample.contains(i)) {
+ writer.write(s"""{"f1":${i.toString}}""" + "\n")
+ } else {
+ writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+ }
+ }
+ writer.close()
+
+ val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
--- End diff --
yup, please set the appropriate numbers. I think it's fine if it has some comments so that we read and fix the tests if it's broken.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20963#discussion_r179257489
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(df.schema === expectedSchema)
}
}
+
+ test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
+ val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+ 57, 62, 68, 72)
+ withTempPath { path =>
+ val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+ StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+ for (i <- 0 until 100) {
+ if (predefinedSample.contains(i)) {
+ writer.write(s"""{"f1":${i.toString}}""" + "\n")
+ } else {
+ writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+ }
+ }
+ writer.close()
+
+ val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
--- End diff --
It could be if the partitionIndex is flaky here: https://github.com/apache/spark/blob/2ce37b50fc01558f49ad22f89c8659f50544ffec/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L320 but in both tests there is only one partition with stable index `0`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20963#discussion_r180610436
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(df.schema === expectedSchema)
}
}
+
+ test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
+ val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+ 57, 62, 68, 72)
+ withTempPath { path =>
+ val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+ StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+ for (i <- 0 until 100) {
+ if (predefinedSample.contains(i)) {
+ writer.write(s"""{"f1":${i.toString}}""" + "\n")
+ } else {
+ writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+ }
+ }
+ writer.close()
+
+ val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
--- End diff --
^ It's based upon actual experience before. There was a similar case that the test was broken due to the number of partitions and it took me a while to debug it, https://issues.apache.org/jira/browse/SPARK-13728
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20963#discussion_r179937139
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala ---
@@ -2127,4 +2127,39 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(df.schema === expectedSchema)
}
}
+
+ test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
+ val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+ 57, 62, 68, 72)
+ withTempPath { path =>
+ val writer = Files.newBufferedWriter(Paths.get(path.getAbsolutePath),
+ StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)
+ for (i <- 0 until 100) {
+ if (predefinedSample.contains(i)) {
+ writer.write(s"""{"f1":${i.toString}}""" + "\n")
+ } else {
+ writer.write(s"""{"f1":${(i.toDouble + 0.1).toString}}""" + "\n")
+ }
+ }
+ writer.close()
+
+ val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
--- End diff --
OK but I don't think we are guaranteed to have always one partition here though. Shall we at least explicitly set `spark.sql.files.maxPartitionBytes` big enough with some comments?
I think we shouldn't encourage this way because it should likely be easy to be broken IMHO. I am fine with it anyway as I can't think of a better way on the other hand.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/20963
LGTM. Thanks! Merged to master/
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20963
**[Test build #88828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88828/testReport)** for PR 20963 at commit [`a664465`](https://github.com/apache/spark/commit/a664465d8a3042f06cf7a866283c34a674b24939).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20963
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88828/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20963: [SPARK-23849][SQL] Tests for the samplingRatio option of...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20963
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20963: [SPARK-23849][SQL] Tests for the samplingRatio op...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/20963
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org