You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "wayneguow (via GitHub)" <gi...@apache.org> on 2023/02/03 10:55:47 UTC

[GitHub] [spark] wayneguow opened a new pull request, #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

wayneguow opened a new pull request, #39878:
URL: https://github.com/apache/spark/pull/39878

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Add a new config for users to restores the legacy behavior that pass '\u0000' character as written comment option to univocity-parsers `CsvFormat`.
   
   
   ### Why are the changes needed?
   In #29516 , in order to fix some bugs, univocity-parsers was upgrade from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Add a new test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wayneguow commented on pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "wayneguow (via GitHub)" <gi...@apache.org>.
wayneguow commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1415948930

   > But you set it to \u0000 here. The caller can already do that.
   
   With the `isCommentSet ` condition:
   `val isCommentSet = this.comment != '\u0000'`
   when users set comment as '\u0000' explicitly, it can not be passed to univocity-parsers. The default value in univocity-parsers is '#'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1415937877

   But you set it to \u0000 here. The caller can already do that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #39878: [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1423245678

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wayneguow commented on pull request #39878: [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource

Posted by "wayneguow (via GitHub)" <gi...@apache.org>.
wayneguow commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1420222944

   Gentle ping @srowen 
   I have made some changes for `isCommentSet `, but it will bring some little differences with before. I put the behavior comparison in a table in [SPARK-42335](https://issues.apache.org/jira/browse/SPARK-42335) 's description. I'm not sure it's worth it.
   
   The best way is to add a `else` condition like parsers after this pr [univocity-parsers#518](https://github.com/uniVocity/univocity-parsers/pull/518) is merged, but univocity-parsers seems to be inactive, there is no any update of for a long time.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wayneguow commented on a diff in pull request #39878: [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource

Posted by "wayneguow (via GitHub)" <gi...@apache.org>.
wayneguow commented on code in PR #39878:
URL: https://github.com/apache/spark/pull/39878#discussion_r1095705128


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -283,6 +286,8 @@ class CSVOptions(
     charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping)
     if (isCommentSet) {
       format.setComment(comment)
+    } else if (legacyDefaultUnicodeNullAsWrittenComment) {

Review Comment:
   If [univocity-parsers#518](https://github.com/uniVocity/univocity-parsers/pull/518) is merged, the code can be changed to:
   `else {
       writerSettings.setCommentProcessingEnabled(false)
   }`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1415899027

   Wait, if the comment char is # then isn't quoting needed to avoid ambiguity? Or why not just set the comment char to something else on write if desired?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #39878: [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen closed pull request #39878: [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource
URL: https://github.com/apache/spark/pull/39878


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wayneguow commented on pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "wayneguow (via GitHub)" <gi...@apache.org>.
wayneguow commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1415931589

   > Wait, if the comment char is # then isn't quoting needed to avoid ambiguity? Or why not just set the comment char to something else on write if desired?
   
   @srowen Actually, the real goal is that we don't want any character to be used as a comment when writing CSV files, and just keeping output as the original. If set to another char, it would cause that the fist column of rows starting with the new char will be quoted, the problem still exists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1415982838

   That seems simpler yeah, and means these semantics are available now, in non-'legacy' uses


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wayneguow commented on a diff in pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "wayneguow (via GitHub)" <gi...@apache.org>.
wayneguow commented on code in PR #39878:
URL: https://github.com/apache/spark/pull/39878#discussion_r1095705128


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -283,6 +286,8 @@ class CSVOptions(
     charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping)
     if (isCommentSet) {
       format.setComment(comment)
+    } else if (legacyDefaultUnicodeNullAsWrittenComment) {

Review Comment:
   If [univocity-parsers#518](https://github.com/uniVocity/univocity-parsers/pull/518) is merged, the code can be like:
   `else {
       writerSettings.setQuoteCommentStartingFirstColumnEnabled(false)
   }`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wayneguow commented on pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "wayneguow (via GitHub)" <gi...@apache.org>.
wayneguow commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1415743405

   cc @srowen @HyukjinKwon @manuzhang @koertkuipers 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wayneguow commented on pull request #39878: [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource

Posted by "wayneguow (via GitHub)" <gi...@apache.org>.
wayneguow commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1421476632

   > Hm is there a way now to not set any comment char in univocity?
   
   As I know, there is no way to not set it currently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1415952377

   I see, can we fix that instead and adopt this behavior? So I can set the comment char in Spark with similar semantics? It's just not clear why we need a different flag for this vs letting users select the comment, like Univocity does


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wayneguow commented on pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "wayneguow (via GitHub)" <gi...@apache.org>.
wayneguow commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1415965277

   How do you think of fixing that if users set comment as '\u0000' explicitly, Spark passes it to univocity-parsers?
   If you aggree with it, I would change the related issue description and recommit some code for this pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #39878: [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource

Posted by "srowen (via GitHub)" <gi...@apache.org>.
srowen commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1421080962

   LGTM. The behavior changes are more like bug fixes. Where someone has \u0000 in data they can pick another comment char that isn't used. Arguably we can achieve this while keeping the special case behavior for \u0000 for setCommentChar, but I actually like not special casing this. Hm is there a way now to not set any comment char in univocity?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wayneguow commented on a diff in pull request #39878: [SPARK-42335][SQL] Add a legacy config for restoring written comment option behavior in CSV dataSource

Posted by "wayneguow (via GitHub)" <gi...@apache.org>.
wayneguow commented on code in PR #39878:
URL: https://github.com/apache/spark/pull/39878#discussion_r1095705128


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -283,6 +286,8 @@ class CSVOptions(
     charToEscapeQuoteEscaping.foreach(format.setCharToEscapeQuoteEscaping)
     if (isCommentSet) {
       format.setComment(comment)
+    } else if (legacyDefaultUnicodeNullAsWrittenComment) {

Review Comment:
   If [univocity-parsers#518](https://github.com/uniVocity/univocity-parsers/pull/518) is merged, the code can be changed to:
   `else {
       writerSettings.setQuoteCommentStartingFirstColumnEnabled(false)
   }`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org