You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2018/08/20 08:40:36 UTC
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/22152
[SPARK-25159][SQL] json schema inference should only trigger one job
## What changes were proposed in this pull request?
This fixes a perf regression caused by https://github.com/apache/spark/pull/21376 .
We should not use `RDD#toLocalIterator`, which triggers one Spark job per RDD partition. This is very bad for RDDs with a lot of small partitions.
To fix it, this PR introduces a way to access SQLConf in the scheduler event loop thread, so that we don't need to use `RDD#toLocalIterator` anymore in `JsonInferSchema`.
## How was this patch tested?
a new test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark conf
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22152.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22152
----
commit cf13d71cb1b23ad6e5ad4644df8c591bfb7a00f9
Author: Wenchen Fan <we...@...>
Date: 2018-08-17T04:30:31Z
allow accessing SQLConf in the scheduler event loop thread
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95077/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95022/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2373/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2417/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/22152
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2317/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211860311
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
}.reduceOption(typeMerger).toIterator
}
- // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because
- // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have
- // active SparkSession and `SQLConf.get` may point to the wrong configs.
- val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+ // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running
+ // the fold functions in the scheduler event loop thread.
+ val existingConf = SQLConf.get
+ var rootType: DataType = StructType(Nil)
+ val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
+ val mergeResult = (index: Int, taskResult: DataType) => {
+ rootType = SQLConf.withExistingConf(existingConf) {
--- End diff --
yes, makes sense, thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22152
**[Test build #95077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95077/testReport)** for PR 22152 at commit [`23dfcda`](https://github.com/apache/spark/commit/23dfcda279d0a854b0e64263a109dfd8d0b98b93).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22152
**[Test build #95077 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95077/testReport)** for PR 22152 at commit [`23dfcda`](https://github.com/apache/spark/commit/23dfcda279d0a854b0e64263a109dfd8d0b98b93).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211585144
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
}.reduceOption(typeMerger).toIterator
}
- // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because
- // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have
- // active SparkSession and `SQLConf.get` may point to the wrong configs.
- val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+ // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running
+ // the fold functions in the scheduler event loop thread.
+ val existingConf = SQLConf.get
+ var rootType: DataType = StructType(Nil)
+ val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
+ val mergeResult = (index: Int, taskResult: DataType) => {
+ rootType = SQLConf.withExistingConf(existingConf) {
--- End diff --
just a question, wouldn't:
```
val partitionsResult = json.sparkContext.runJob(mergedTypesFromPartitions, foldPartition)
partitionsResult.fold(typeMerger)
```
do the same without requiring these changes?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r212183703
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
}.reduceOption(typeMerger).toIterator
}
- // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because
- // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have
- // active SparkSession and `SQLConf.get` may point to the wrong configs.
- val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+ // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running
+ // the fold functions in the scheduler event loop thread.
+ val existingConf = SQLConf.get
+ var rootType: DataType = StructType(Nil)
+ val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
+ val mergeResult = (index: Int, taskResult: DataType) => {
+ rootType = SQLConf.withExistingConf(existingConf) {
--- End diff --
Same question was in my mind. thanks for clarification.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2409/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211621168
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
}.reduceOption(typeMerger).toIterator
}
- // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because
- // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have
- // active SparkSession and `SQLConf.get` may point to the wrong configs.
- val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+ // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running
+ // the fold functions in the scheduler event loop thread.
+ val existingConf = SQLConf.get
+ var rootType: DataType = StructType(Nil)
+ val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
+ val mergeResult = (index: Int, taskResult: DataType) => {
+ rootType = SQLConf.withExistingConf(existingConf) {
--- End diff --
it would contain one result per partition, do you think this is enough to cause GC problems?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22152
**[Test build #94952 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94952/testReport)** for PR 22152 at commit [`cf13d71`](https://github.com/apache/spark/commit/cf13d71cb1b23ad6e5ad4644df8c591bfb7a00f9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211626457
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
@@ -2528,4 +2529,27 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
checkAnswer(aggPlusFilter1, aggPlusFilter2.collect())
}
}
+
+ test("SPARK-25159: json schema inference should only trigger one job") {
+ withTempPath { path =>
+ // This test is to prove that the `JsonInferSchema` does not use `RDD#toLocalIterator` which
+ // triggers one Spark job per RDD partition.
+ Seq(1 -> "a", 2 -> "b").toDF("i", "p")
+ // The data set has 2 partitions, so Spark will write at least 2 json files.
+ // Use a non-splittable compression (gzip), to make sure the json scan RDD has at lease 2
--- End diff --
nit: `at least`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95068/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/22152
Thanks! Merged to master.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211189624
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
}.reduceOption(typeMerger).toIterator
}
- // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because
- // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have
- // active SparkSession and `SQLConf.get` may point to the wrong configs.
- val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+ // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running
+ // the fold functions in the scheduler event loop thread.
+ val existingConf = SQLConf.get
+ var rootType: DataType = StructType(Nil)
+ val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
--- End diff --
Need to do `sc.clean(typeMerger)` manually here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211584402
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -116,16 +129,24 @@ object SQLConf {
if (TaskContext.get != null) {
new ReadOnlySQLConf(TaskContext.get())
} else {
- if (Utils.isTesting && SparkContext.getActive.isDefined) {
+ val isSchedulerEventLoopThread = SparkContext.getActive
+ .map(_.dagScheduler.eventProcessLoop.eventThread)
+ .exists(_.getId == Thread.currentThread().getId)
+ if (isSchedulerEventLoopThread) {
// DAGScheduler event loop thread does not have an active SparkSession, the `confGetter`
- // will return `fallbackConf` which is unexpected. Here we prevent it from happening.
- val schedulerEventLoopThread =
- SparkContext.getActive.get.dagScheduler.eventProcessLoop.eventThread
- if (schedulerEventLoopThread.getId == Thread.currentThread().getId) {
+ // will return `fallbackConf` which is unexpected. Here we requires the caller to get the
--- End diff --
nit: we require
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/22152
retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22152
cc @gatorsmile @hvanhovell @kiszk @viirya @mgaido91
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22152
**[Test build #95022 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95022/testReport)** for PR 22152 at commit [`95ec4d7`](https://github.com/apache/spark/commit/95ec4d7f196a20a0b6461244523a9418021677f6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211815985
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
}.reduceOption(typeMerger).toIterator
}
- // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because
- // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have
- // active SparkSession and `SQLConf.get` may point to the wrong configs.
- val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+ // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running
+ // the fold functions in the scheduler event loop thread.
+ val existingConf = SQLConf.get
+ var rootType: DataType = StructType(Nil)
+ val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
+ val mergeResult = (index: Int, taskResult: DataType) => {
+ rootType = SQLConf.withExistingConf(existingConf) {
--- End diff --
the schema can be very complex (e.g. very wide and deep schema).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22152
**[Test build #94952 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94952/testReport)** for PR 22152 at commit [`cf13d71`](https://github.com/apache/spark/commit/cf13d71cb1b23ad6e5ad4644df8c591bfb7a00f9).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211607005
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
}.reduceOption(typeMerger).toIterator
}
- // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because
- // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have
- // active SparkSession and `SQLConf.get` may point to the wrong configs.
- val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+ // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running
+ // the fold functions in the scheduler event loop thread.
+ val existingConf = SQLConf.get
+ var rootType: DataType = StructType(Nil)
+ val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
+ val mergeResult = (index: Int, taskResult: DataType) => {
+ rootType = SQLConf.withExistingConf(existingConf) {
--- End diff --
ah good point!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94952/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22152
**[Test build #95022 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95022/testReport)** for PR 22152 at commit [`95ec4d7`](https://github.com/apache/spark/commit/95ec4d7f196a20a0b6461244523a9418021677f6).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22152
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22152
**[Test build #95068 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95068/testReport)** for PR 22152 at commit [`95ec4d7`](https://github.com/apache/spark/commit/95ec4d7f196a20a0b6461244523a9418021677f6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211496317
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
}.reduceOption(typeMerger).toIterator
}
- // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because
- // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have
- // active SparkSession and `SQLConf.get` may point to the wrong configs.
- val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+ // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running
+ // the fold functions in the scheduler event loop thread.
+ val existingConf = SQLConf.get
+ var rootType: DataType = StructType(Nil)
+ val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
--- End diff --
This closure is defined by us and I don't think we leak outer reference here. If we do, it's a bug and we should fix it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/22152#discussion_r211599957
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala ---
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema {
}.reduceOption(typeMerger).toIterator
}
- // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because
- // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have
- // active SparkSession and `SQLConf.get` may point to the wrong configs.
- val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger)
+ // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running
+ // the fold functions in the scheduler event loop thread.
+ val existingConf = SQLConf.get
+ var rootType: DataType = StructType(Nil)
+ val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
--- End diff --
Yeah, agreed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22152: [SPARK-25159][SQL] json schema inference should only tri...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22152
**[Test build #95068 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95068/testReport)** for PR 22152 at commit [`95ec4d7`](https://github.com/apache/spark/commit/95ec4d7f196a20a0b6461244523a9418021677f6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org