You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by koertkuipers <gi...@git.apache.org> on 2018/11/28 20:05:16 UTC
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
GitHub user koertkuipers opened a pull request:
https://github.com/apache/spark/pull/23173
[SPARK-26208][SQL] add headers to empty csv files when header=true
## What changes were proposed in this pull request?
Add headers to empty csv files when header=true, because otherwise these files are invalid when reading.
## How was this patch tested?
Added test for roundtrip of empty dataframe to csv file with headers and back in CSVSuite
Please review http://spark.apache.org/contributing.html before opening a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tresata-opensource/spark feat-empty-csv-with-header
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23173.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23173
----
commit 3192656ad91d326824360f4d4890dc1f6c3f6393
Author: Koert Kuipers <ko...@...>
Date: 2018-11-28T19:03:20Z
write headers to empty csv files when header=true
commit aad5f09710d4b6d4aafa810307b3cae9c965babf
Author: Koert Kuipers <ko...@...>
Date: 2018-11-28T20:00:22Z
Merge branch 'master' into feat-empty-csv-with-header
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99519/testReport)** for PR 23173 at commit [`6165e1a`](https://github.com/apache/spark/commit/6165e1a6c65601b57a06905c497cc18369d74354).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99397/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99467 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99467/testReport)** for PR 23173 at commit [`258a1e4`](https://github.com/apache/spark/commit/258a1e437ce6fdf89bbed79b58175b3b1044d2c6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99519/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5543/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by koertkuipers <gi...@git.apache.org>.
Github user koertkuipers commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237579324
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala ---
@@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable {
* executor side. This instance is used to persist rows to this single output file.
*/
abstract class OutputWriter {
+ /** Initializes before writing any rows. Invoked on executor size. */
+ def init(): Unit
--- End diff --
yeah makes sense, i will make that change
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by koertkuipers <gi...@git.apache.org>.
Github user koertkuipers commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237663865
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
assert(errMsg2.contains("'lineSep' can contain only 1 character"))
}
+ test("SPARK-26208: write and read empty data to csv file with header") {
+ withTempPath { path =>
+ val df1 = Seq.empty[(String, String)].toDF("x", "y")
--- End diff --
that doesnt seem to be what is happening.
if i do a .repartition(4) on empty dataframe it still only writes one part file with header
if i do a .repartition(4) on a dataframe with 2 elements then it writes 2 part files with headers
so it seems empty partitions get pruned, except when all partitions are empty then it writes a single partition thanks to SPARK-23271
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99553/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237634432
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
assert(errMsg2.contains("'lineSep' can contain only 1 character"))
}
+ test("SPARK-26208: write and read empty data to csv file with header") {
+ withTempPath { path =>
+ val df1 = Seq.empty[(String, String)].toDF("x", "y")
+ df1.printSchema
+ df1.write.format("csv").option("header", true).save(path.getAbsolutePath)
+ val df2 = spark.read.format("csv").option("header", true).load(path.getAbsolutePath)
--- End diff --
I would set the `inferSchema` option to `false` explicitly because your test assumes that.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5618/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238058594
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ if (params.headerFlag) {
+ val gen = getGen()
+ gen.writeHeaders()
+ }
+ private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+
+ override def write(row: InternalRow): Unit = {
+ val gen = getGen()
--- End diff --
Wait .. is this going to create `UnivocityGenerator` for each record?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5612/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238064338
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ if (params.headerFlag) {
+ val gen = getGen()
+ gen.writeHeaders()
+ }
+ private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+
+ override def write(row: InternalRow): Unit = {
+ val gen = getGen()
--- End diff --
I see the problem. OrcFileFormat uses a flag approach. For instance:
```
private var isGeneratorInitiated = false
lazy val univocityGenerator = {
val charset = Charset.forName(params.charset)
val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
new UnivocityGenerator(dataSchema, os, params)
}
if (isGeneratorInitiated) {
univocityGenerator.close()
}
```
Should be okay to stick to it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99405/testReport)** for PR 23173 at commit [`bfadbf9`](https://github.com/apache/spark/commit/bfadbf9ae5df8e4a1e84ffff5c82dddc91d5082f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by koertkuipers <gi...@git.apache.org>.
Github user koertkuipers commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237687091
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
assert(errMsg2.contains("'lineSep' can contain only 1 character"))
}
+ test("SPARK-26208: write and read empty data to csv file with header") {
+ withTempPath { path =>
+ val df1 = Seq.empty[(String, String)].toDF("x", "y")
--- End diff --
i can do that, but i think when i write it out and read it back in it will come back in as 1 partition (one part file with header) because of SPARK-23271. is that worth checking for?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99405/testReport)** for PR 23173 at commit [`bfadbf9`](https://github.com/apache/spark/commit/bfadbf9ae5df8e4a1e84ffff5c82dddc91d5082f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237883918
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ private def getOrCreateGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+ if (params.headerFlag) {
--- End diff --
My only nit here is that this is part of the constructor, but lives between two methods, which is a little less clear. Maybe move it just after the member declarations?
Also you could inline this to things like `getOrCreateGen().writeHeaders()`, but doesn't matter.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99477/testReport)** for PR 23173 at commit [`088c710`](https://github.com/apache/spark/commit/088c7106b8837ada52ad3a5b1d356c3572ff0af7).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5471/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99477/testReport)** for PR 23173 at commit [`088c710`](https://github.com/apache/spark/commit/088c7106b8837ada52ad3a5b1d356c3572ff0af7).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99485/testReport)** for PR 23173 at commit [`6f498a0`](https://github.com/apache/spark/commit/6f498a043a2347f6f391257d04e6d7bf5f98470d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class CSVInferSchema(options: CSVOptions) extends Serializable `
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238059569
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
@@ -1987,6 +1987,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
assert(errMsg2.contains("'lineSep' can contain only 1 character"))
}
+ test("SPARK-26208: write and read empty data to csv file with headers") {
+ withTempPath { path =>
+ val df1 = spark.range(10).repartition(2).filter(_ < 0).map(_.toString).toDF
+ // we have 2 partitions but they are both empty and will be filtered out upon writing
+ // thanks to SPARK-23271 one new empty partition will be inserted
+ df1.write.format("csv").option("header", true).save(path.getAbsolutePath)
+ val df2 = spark.read.format("csv").option("header", true).option("inferSchema", false)
+ .load(path.getAbsolutePath)
+ assert(df1.rdd.getNumPartitions == 2)
+ assert(df2.rdd.getNumPartitions == 1)
--- End diff --
nit: I wouldn't check number of partition here since it is implementation specific and doesn't matter for behavior checked in the test.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/23173
It seems this is similar to @HyukjinKwon PR: https://github.com/apache/spark/pull/13252
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5552/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99467/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99519/testReport)** for PR 23173 at commit [`6165e1a`](https://github.com/apache/spark/commit/6165e1a6c65601b57a06905c497cc18369d74354).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99397 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99397/testReport)** for PR 23173 at commit [`aad5f09`](https://github.com/apache/spark/commit/aad5f09710d4b6d4aafa810307b3cae9c965babf).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5474/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99486/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99400/testReport)** for PR 23173 at commit [`1f897be`](https://github.com/apache/spark/commit/1f897be372e1ffca5445a36d2c13c8ef2a489e2e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5551/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99559/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237633638
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
assert(errMsg2.contains("'lineSep' can contain only 1 character"))
}
+ test("SPARK-26208: write and read empty data to csv file with header") {
+ withTempPath { path =>
+ val df1 = Seq.empty[(String, String)].toDF("x", "y")
+ df1.printSchema
--- End diff --
`printSchema` should be removed. This is not necessary for test.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238073218
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ if (params.headerFlag) {
+ val gen = getGen()
+ gen.writeHeaders()
+ }
+ private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+
+ override def write(row: InternalRow): Unit = {
+ val gen = getGen()
--- End diff --
Yeah we have two different approaches, both of which are fine IMHO. I think it's reasonable to clean that up in a follow-up if desired. WDYT @HyukjinKwon ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237633755
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
assert(errMsg2.contains("'lineSep' can contain only 1 character"))
}
+ test("SPARK-26208: write and read empty data to csv file with header") {
+ withTempPath { path =>
+ val df1 = Seq.empty[(String, String)].toDF("x", "y")
+ df1.printSchema
+ df1.write.format("csv").option("header", true).save(path.getAbsolutePath)
+ val df2 = spark.read.format("csv").option("header", true).load(path.getAbsolutePath)
+ df2.printSchema
--- End diff --
Please, remove this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99559 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99559/testReport)** for PR 23173 at commit [`9d5cb7b`](https://github.com/apache/spark/commit/9d5cb7be9a8cb97bd54dd1e938ba819ed3066351).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99477/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238059480
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ if (params.headerFlag) {
+ val gen = getGen()
+ gen.writeHeaders()
+ }
+ private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+
+ override def write(row: InternalRow): Unit = {
+ val gen = getGen()
--- End diff --
@HyukjinKwon Do you mean creating generator in lazy val?
```
lazy val univocityGenerator = {
val charset = Charset.forName(params.charset)
val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
new UnivocityGenerator(dataSchema, os, params)
}
```
The problem is in the close method, you will have to call `univocityGenerator.close()` in the method. If the lazy val wasn't instantiated before (empty partition and the `header` option is `false`), the generator will be created and closed immediately. And as a result, you will get an empty file for the empty partition. That's why I prefer the approach with `Option[UnivocityGenerator]` in https://github.com/apache/spark/pull/23052.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237561245
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala ---
@@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable {
* executor side. This instance is used to persist rows to this single output file.
*/
abstract class OutputWriter {
+ /** Initializes before writing any rows. Invoked on executor size. */
+ def init(): Unit
--- End diff --
Rather than make subclasses implement as a no-op, just provide that no-op impl here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99400/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by koertkuipers <gi...@git.apache.org>.
Github user koertkuipers commented on the issue:
https://github.com/apache/spark/pull/23173
i was not aware of SPARK-15473. thanks. let me look at @HyukjinKwon pullreq and mark my jira as a duplicate.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99485/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99485/testReport)** for PR 23173 at commit [`6f498a0`](https://github.com/apache/spark/commit/6f498a043a2347f6f391257d04e6d7bf5f98470d).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99467 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99467/testReport)** for PR 23173 at commit [`258a1e4`](https://github.com/apache/spark/commit/258a1e437ce6fdf89bbed79b58175b3b1044d2c6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237635061
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
assert(errMsg2.contains("'lineSep' can contain only 1 character"))
}
+ test("SPARK-26208: write and read empty data to csv file with header") {
+ withTempPath { path =>
+ val df1 = Seq.empty[(String, String)].toDF("x", "y")
--- End diff --
This will create a dataframe with one empty partition. I would check the case when there are more than 2 empty partitions, and each saved file has a header.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/23173
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237679872
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
assert(errMsg2.contains("'lineSep' can contain only 1 character"))
}
+ test("SPARK-26208: write and read empty data to csv file with header") {
+ withTempPath { path =>
+ val df1 = Seq.empty[(String, String)].toDF("x", "y")
--- End diff --
You can do something like this:
```
val df = spark.range(10).repartition(2).filter(_ < 0)
```
```
scala> df.rdd.getNumPartitions
res0: Int = 2
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99486/testReport)** for PR 23173 at commit [`29fc6b8`](https://github.com/apache/spark/commit/29fc6b89094841ba2a28827247305e4fa6c01520).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5535/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by koertkuipers <gi...@git.apache.org>.
Github user koertkuipers commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238077135
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ if (params.headerFlag) {
+ val gen = getGen()
+ gen.writeHeaders()
+ }
+ private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+
+ override def write(row: InternalRow): Unit = {
+ val gen = getGen()
--- End diff --
i will revert this change to lazy val for now since it doesnt have anything to do wit this pullreq or jira: the Option approach was created in another pullreq.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by koertkuipers <gi...@git.apache.org>.
Github user koertkuipers commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237716913
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala ---
@@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable {
* executor side. This instance is used to persist rows to this single output file.
*/
abstract class OutputWriter {
+ /** Initializes before writing any rows. Invoked on executor size. */
+ def init(): Unit = {}
--- End diff --
do the init logic in the constructor for CsvOutputWriter instead?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99553/testReport)** for PR 23173 at commit [`238efa5`](https://github.com/apache/spark/commit/238efa59fe3b63f88a36b1bb7634dd868e2f2097).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238058669
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ if (params.headerFlag) {
+ val gen = getGen()
+ gen.writeHeaders()
+ }
+ private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+
+ override def write(row: InternalRow): Unit = {
+ val gen = getGen()
--- End diff --
Ah, it's `getOrElse`. Okay but still can we simplify this logic? Looks a bit confusing. For instance, I think we can do this with lazy val.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237711250
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala ---
@@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable {
* executor side. This instance is used to persist rows to this single output file.
*/
abstract class OutputWriter {
+ /** Initializes before writing any rows. Invoked on executor size. */
+ def init(): Unit = {}
--- End diff --
I don't think we need to add init only because one case CSV needs it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99405/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23173
Looks fine to me too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5478/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238091149
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ if (params.headerFlag) {
+ val gen = getGen()
+ gen.writeHeaders()
+ }
+ private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+
+ override def write(row: InternalRow): Unit = {
+ val gen = getGen()
--- End diff --
OK. Shouldn't be a big deal.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238070641
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ if (params.headerFlag) {
+ val gen = getGen()
+ gen.writeHeaders()
+ }
+ private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+
+ override def write(row: InternalRow): Unit = {
+ val gen = getGen()
--- End diff --
Frankly speaking I don't see any reasons for this. For now we have 2 flags actually - `isGeneratorInitiated` and another one inside of lazy val. And 2 slightly different approaches - with the `Option` type in Json and Text, and lazy val + flag in Orc and CSV.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99400/testReport)** for PR 23173 at commit [`1f897be`](https://github.com/apache/spark/commit/1f897be372e1ffca5445a36d2c13c8ef2a489e2e).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r237884151
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ private def getOrCreateGen(): UnivocityGenerator = univocityGenerator.getOrElse {
--- End diff --
This is a really small thing, I don't feel strongly about, but what about just `getGen()`? the caller doesn't care whether it's created.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Posted by koertkuipers <gi...@git.apache.org>.
Github user koertkuipers commented on a diff in the pull request:
https://github.com/apache/spark/pull/23173#discussion_r238068538
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
private var univocityGenerator: Option[UnivocityGenerator] = None
- override def write(row: InternalRow): Unit = {
- val gen = univocityGenerator.getOrElse {
- val charset = Charset.forName(params.charset)
- val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
- val newGen = new UnivocityGenerator(dataSchema, os, params)
- univocityGenerator = Some(newGen)
- newGen
- }
+ if (params.headerFlag) {
+ val gen = getGen()
+ gen.writeHeaders()
+ }
+ private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+ val charset = Charset.forName(params.charset)
+ val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
+ val newGen = new UnivocityGenerator(dataSchema, os, params)
+ univocityGenerator = Some(newGen)
+ newGen
+ }
+
+ override def write(row: InternalRow): Unit = {
+ val gen = getGen()
--- End diff --
ok i changed it to lazy val and flag
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5583/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99559 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99559/testReport)** for PR 23173 at commit [`9d5cb7b`](https://github.com/apache/spark/commit/9d5cb7be9a8cb97bd54dd1e938ba819ed3066351).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23173
Merged to master.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99486/testReport)** for PR 23173 at commit [`29fc6b8`](https://github.com/apache/spark/commit/29fc6b89094841ba2a28827247305e4fa6c01520).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99397 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99397/testReport)** for PR 23173 at commit [`aad5f09`](https://github.com/apache/spark/commit/aad5f09710d4b6d4aafa810307b3cae9c965babf).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23173
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23173
**[Test build #99553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99553/testReport)** for PR 23173 at commit [`238efa5`](https://github.com/apache/spark/commit/238efa59fe3b63f88a36b1bb7634dd868e2f2097).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org