You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MaxGekk <gi...@git.apache.org> on 2018/11/15 21:14:03 UTC

[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/23052

    [SPARK-26081][SQL] Prevent empty files for empty partitions in Text datasources

    ## What changes were proposed in this pull request?
    
    In the PR, I propose to postpone creation of `OutputStream`/`Univocity`/`JacksonGenerator` till the first row should be written. This prevents creation of empty files for empty partitions. So, no need to open and to read such files back while loading data from the location.
    
    ## How was this patch tested?
    
    Added tests for Text, JSON and CSV datasource where empty dataset is written but should not produce any files.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 text-empty-files

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23052.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23052
    
----
commit 3efa7b615f7c37538edb0afca30d4f300ac07aee
Author: Maxim Gekk <ma...@...>
Date:   2018-11-15T19:44:47Z

    Added a test for text datasource

commit 80aadf645ab63885ce6f43ac74b0c02871e10883
Author: Maxim Gekk <ma...@...>
Date:   2018-11-15T20:11:00Z

    Creating output stream on the first write

commit 0a774ef9e4de987c9f3073b90396215b9f04ca16
Author: Maxim Gekk <ma...@...>
Date:   2018-11-15T20:20:27Z

    Test for csv

commit 47b71b7a235ffcdfa79753307f1afcb377a17977
Author: Maxim Gekk <ma...@...>
Date:   2018-11-15T20:21:06Z

    Don't produce empty CSV files

commit 040c71f8ea49ca10160cfa242095d6ebd2d76a8d
Author: Maxim Gekk <ma...@...>
Date:   2018-11-15T20:22:23Z

    Test for JSON

commit 6f3cb18d5a863f6aded763bdeb5395f6622876ff
Author: Maxim Gekk <ma...@...>
Date:   2018-11-15T20:32:32Z

    Do not produce empty JSON files

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #98887 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98887/testReport)** for PR 23052 at commit [`6f3cb18`](https://github.com/apache/spark/commit/6f3cb18d5a863f6aded763bdeb5395f6622876ff).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    > Similar changes were proposed in Parquet few years ago (by me) and reverted.
    
    What was the main reason to revert it? If it is possible could you give me a link to your PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99107 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99107/testReport)** for PR 23052 at commit [`9501d01`](https://github.com/apache/spark/commit/9501d01c6b0239b0da1e6ac1e99e6d5081e65258).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99407 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99407/testReport)** for PR 23052 at commit [`083d411`](https://github.com/apache/spark/commit/083d411ec1822986dbb82fbe1896a6c0d846c7d8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99407/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99107/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98887/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by koertkuipers <gi...@git.apache.org>.
Github user koertkuipers commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    it is pretty common for us to write empty dataframe to parquet and later read it back in
    same for writing to csv with header and reading it back in (with type inference disabled, we assume all strings)
    
    would this break those behaviors? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99380/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    @MaxGekk, actually this is kind of important behaviour change. This basically means we're unable to read the empty files back. Similar changes were proposed in Parquet few years ago (by me) and reverted.
    
    We should better investigate and match the behaviours first across datasources. IIRC, ORC does not create files (if that's not updated from what I have checked long ago) but Parquet does.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    @MaxGekk I didn't  mean to block this PR. Since we're going ahead for 3.0, it should be good to match and fix the behaviours across data sources. For instance, CSV should still be able to read the header. Shall we clarify each data sources behaviour?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99221/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99361/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Also, it's not always for Parquet to write empty files. That does not write empty files when data frames are created from emptyRDD (the one pointed out in the PR link I gave). We should match this behaviour as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99372/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    > seems like a real failure
    
    I am looking at it. It seems the test is not deterministic. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r236584201
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
    --- End diff --
    
    We have not observe any race conditions so far. Instances of `UnivocityGenerator` are created per-tasks as well as `OutputStreamWriter`s. They share instances of schema and CSVOptions but we do not modify them while writing. Inside of each `UnivocityGenerator`, we create an instance of `CsvWriter` but I almost absolutely sure they do not share anything internally. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    related another try https://github.com/apache/spark/pull/13252


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    I think now it should be good timing to match the behaviours.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    First of all, sometimes we do need to write "empty" files, so that we can infer schema of a parquet directory. Empty parquet file is not really empty, as it has header/footer. https://github.com/apache/spark/pull/20525 guarantees we always write out at least one empty file.
    
    One important thing is, when we write out an empty dataframe to file, and read it back, it should still be an empty dataframe. I'd suggest we skip empty file in text-based data sources, and later on send a followup PR to not write empty text files, as a perf improvement.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #98887 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98887/testReport)** for PR 23052 at commit [`6f3cb18`](https://github.com/apache/spark/commit/6f3cb18d5a863f6aded763bdeb5395f6622876ff).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r234211079
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -174,13 +174,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
     
    -  private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
    -
    -  private val gen = new UnivocityGenerator(dataSchema, writer, params)
    +  override def write(row: InternalRow): Unit = {
    +    val gen = univocityGenerator.getOrElse {
    --- End diff --
    
    I do think it is fine to write only headers if an user wants to have them. Filtering the header out on this level could be slightly difficult.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    cc @cloud-fan as well


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99361 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99361/testReport)** for PR 23052 at commit [`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    seems like a real failure


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99354/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    jenkins, retest this, please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r236659185
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
    --- End diff --
    
    > ... but that it could create many generators and writers that aren't closed. 
    
    Writers/generators are created inside of tasks: https://github.com/apache/spark/blob/ab1650d2938db4901b8c28df945d6a0691a19d31/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L228-L256 where `dataWriter.commit()` and `dataWriter.abort()` close writers/generators. So, number of not closed generators is less or equal to the size of the task thread pool on executors at any moment.
    
    > Unless we know writes will only happen in one thread ...
    
    According to comments below, this is our assumption: https://github.com/apache/spark/blob/e8167768cfebfdb11acd8e0a06fe34ca43c14648/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L33-L37
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5438/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99372/testReport)** for PR 23052 at commit [`586ab31`](https://github.com/apache/spark/commit/586ab316ed2b9bce07a879dc89766dc854807c21).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99221/testReport)** for PR 23052 at commit [`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    There are two more things to deal with:
    
    https://github.com/apache/spark/pull/23052#issuecomment-440687200 comment will still be valid - at least it should be double checked because dataframes originated from emptyRDD does not write anything all times.
    
    One thing is CSV for text-based datasources because it can write out headers. Thing is, the header is currently written when the first row is written - this is what https://github.com/apache/spark/pull/13252 PR targeted before. I closed this because there's no interests but we should fix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Actually it needs similar changes like in https://github.com/apache/spark/pull/23130


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r234062564
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -174,13 +174,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
     
    -  private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
    -
    -  private val gen = new UnivocityGenerator(dataSchema, writer, params)
    +  override def write(row: InternalRow): Unit = {
    +    val gen = univocityGenerator.getOrElse {
    --- End diff --
    
    Also, one thing we should not forget about is, CSV _could_ have headers even if the records are empty.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99372/testReport)** for PR 23052 at commit [`586ab31`](https://github.com/apache/spark/commit/586ab316ed2b9bce07a879dc89766dc854807c21).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99354/testReport)** for PR 23052 at commit [`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99221/testReport)** for PR 23052 at commit [`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  class RuleSummary(`
      * `class QueryPlanningTracker `
      * `class QueryExecution(`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99380 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99380/testReport)** for PR 23052 at commit [`586ab31`](https://github.com/apache/spark/commit/586ab316ed2b9bce07a879dc89766dc854807c21).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99354 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99354/testReport)** for PR 23052 at commit [`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5432/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r236652952
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
    --- End diff --
    
    I don't mean that it would cause an error, but that it could create many generators and writers that aren't closed. It may not be obvious that it's happening. Unless we know writes will only happen in one thread what about breaking out and synchronizing the get/create part of this method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r237210777
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
     
    -  private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
    -
    -  private val gen = new UnivocityGenerator(dataSchema, writer, params)
    +  override def write(row: InternalRow): Unit = {
    +    val gen = univocityGenerator.getOrElse {
    +      val charset = Charset.forName(params.charset)
    +      val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
    +      new UnivocityGenerator(dataSchema, os, params)
    +    }
    +    univocityGenerator = Some(gen)
    --- End diff --
    
    Doesn't this need to be in the getOrElse block? although it doesn't matter, it's setting this to itself every time, and maybe that's a little bit of overhead to avoid.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/23052


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99380 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99380/testReport)** for PR 23052 at commit [`586ab31`](https://github.com/apache/spark/commit/586ab316ed2b9bce07a879dc89766dc854807c21).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99107 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99107/testReport)** for PR 23052 at commit [`9501d01`](https://github.com/apache/spark/commit/9501d01c6b0239b0da1e6ac1e99e6d5081e65258).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99361 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99361/testReport)** for PR 23052 at commit [`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r237287908
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
     
    -  private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
    -
    -  private val gen = new UnivocityGenerator(dataSchema, writer, params)
    +  override def write(row: InternalRow): Unit = {
    +    val gen = univocityGenerator.getOrElse {
    +      val charset = Charset.forName(params.charset)
    +      val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
    +      new UnivocityGenerator(dataSchema, os, params)
    +    }
    +    univocityGenerator = Some(gen)
    --- End diff --
    
    Yes that's right


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    **[Test build #99407 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99407/testReport)** for PR 23052 at commit [`083d411`](https://github.com/apache/spark/commit/083d411ec1822986dbb82fbe1896a6c0d846c7d8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r236097176
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
    --- End diff --
    
    Do we have a race condition below then where multiple generators can be created?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    Which should be ... this https://github.com/apache/spark/pull/12855


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    I have read the tickets you pointed out but haven't found what could potentially block the changes. One of corner cases is saving an empty dataframe. In this case, no files would be written, but this is ok for text-based datasources because , in any case, we cannot restore the schema fully from empty files (comparing to parquet files where we can). So, a schema must be provided by an user.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23052
  
    One try to add some tests for reading/writing empty dataframes was here https://github.com/apache/spark/pull/13253 fyi


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r237282738
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
     
    -  private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
    -
    -  private val gen = new UnivocityGenerator(dataSchema, writer, params)
    +  override def write(row: InternalRow): Unit = {
    +    val gen = univocityGenerator.getOrElse {
    +      val charset = Charset.forName(params.charset)
    +      val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
    +      new UnivocityGenerator(dataSchema, os, params)
    +    }
    +    univocityGenerator = Some(gen)
    --- End diff --
    
    done


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org