You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/24 10:08:30 UTC

[GitHub] [spark] izchen opened a new pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

izchen opened a new pull request #29862:
URL: https://github.com/apache/spark/pull/29862


   ### What changes were proposed in this pull request?
   In [SPARK-16896](https://issues.apache.org/jira/browse/SPARK-16896), generate some new column headers to replace duplicate column headers or empty string column headers in the CSV DataSource.
   
   In this PR, when the newly generated column header is duplicated with the existing column header, a new column header is generated again using the method in SPARK-16896.
   
   
   ### Why are the changes needed?
   When the CSV data source has duplicate column headers, Spark will generate some new column headers based on the original column headers with the index as a suffix.
   
   When the newly generated column header is duplicated with the existing column header, Spark will throw an exception message that is difficult for users to understand.
   
   For example, the CSV column header is `a,a,a,a1`.
   
   > AnalysisException: Found duplicate column(s) in the data schema: a1
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Added a unit test case


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] izchen commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

izchen commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494215490



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Thanks for your review.
   
   R uses`.` as the delimiter and a non-repeated increasing number as the suffix.
   For example, the header is `a, a, a, a, a.2`
   ```R
   > read.csv("x.csv", header = TRUE, sep = ",")
   [1] a   a.1 a.3 a.4 a.2
   ```

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Current behavior of Spark and R for CSV headers:
   
   | CSV             | SPARK             | R               |
   | --------------- | ----------------- | --------------- |
   | `a,a,a,a`       | `a0,a1,a2,a3`     | `a,a.1,a.2,a.3` |
   | `a,,,`          | `a,_c1,_c2,_c3`   | `a,X,X.1,X.2`   |
   | *header: false* | `_c0,_c1,_c2,_c3` | `V1,V2,V3,V4`   |
   
   If we follow R's behavior, we will introduce a user-facing change. This may cause errors in the user's legacy code. 
   Maybe we should keep the behavior of Spark.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698251280


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #29862:
URL: https://github.com/apache/spark/pull/29862


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494649445



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -76,7 +76,7 @@ object CSVUtils {
         headerNames.diff(headerNames.distinct).distinct
       }
 
-      row.zipWithIndex.map { case (value, index) =>
+      val header = row.zipWithIndex.map { case (value, index) =>

Review comment:
       I think we can modify the logic here to make sure we generate non-duplicate header in just one run?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-753214532


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698251280






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] izchen commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

izchen commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r526722167



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Before that, spark added column index numbers as suffixes for duplicate column names, instead of adding incremental numbers as suffixes.
   In my opinion, following R's behavior here may introduce an unnecessary user-facing change.
   E.g:
   `a0, b, c, a, d, a`
   Follow the original behavior of Spark will be replaced with `a0, b, c, a3, d, a4`
   The behavior that follows R will be replaced with `a0, b, c, a1, d, a2`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698251280






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] izchen commented on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

izchen commented on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698252565


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-817221835


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] izchen commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

izchen commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494215490



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Thanks for your review.
   
   R uses`.` as the delimiter and a non-repeated increasing number as the suffix.
   For example, the header is `a, a, a, a, a.2`
   ```R
   > read.csv("x.csv", header = TRUE, sep = ",")
   [1] a   a.1 a.3 a.4 a.2
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] izchen commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

izchen commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r526722958



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -76,7 +76,7 @@ object CSVUtils {
         headerNames.diff(headerNames.distinct).distinct
       }
 
-      row.zipWithIndex.map { case (value, index) =>
+      val header = row.zipWithIndex.map { case (value, index) =>

Review comment:
       Thanks for your review, I modified the code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] izchen commented on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

izchen commented on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698252565


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] izchen removed a comment on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

izchen removed a comment on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698252565


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494200005



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Can you check how R's `read_csv` works in this case? That patch was inspired by R's one.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Can you check how R's `read_csv` works in this case? This behaviour was inspired by R's one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698251800


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698251800


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494649445



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -76,7 +76,7 @@ object CSVUtils {
         headerNames.diff(headerNames.distinct).distinct
       }
 
-      row.zipWithIndex.map { case (value, index) =>
+      val header = row.zipWithIndex.map { case (value, index) =>

Review comment:
       I think we can modify the logic here to make sure we generate non-duplicate header in just one run?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29862:
URL: https://github.com/apache/spark/pull/29862#issuecomment-698251280


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494200005



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Can you check how R's `read_csv` works in this case? That patch was inspired by R's one.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Can you check how R's `read_csv` works in this case? This behaviour was inspired by R's one.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Can we follow this behaviour?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       I mean the numbering. Can. we create a name like `a1 a3 a4 a2` for `a, a, a, a, a.2`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] izchen commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

izchen commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r526726981



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -68,31 +70,33 @@ object CSVUtils {
       caseSensitive: Boolean,
       options: CSVOptions): Array[String] = {
     if (options.headerFlag) {
-      val duplicates = {
-        val headerNames = row.filter(_ != null)
-          // scalastyle:off caselocale
-          .map(name => if (caseSensitive) name else name.toLowerCase)
-        // scalastyle:on caselocale
-        headerNames.diff(headerNames.distinct).distinct
-      }
-
-      row.zipWithIndex.map { case (value, index) =>
+      val header = row.zipWithIndex.map { case (value, index) =>
         if (value == null || value.isEmpty || value == options.nullValue) {
           // When there are empty strings or the values set in `nullValue`, put the
           // index as the suffix.
           s"_c$index"
-          // scalastyle:off caselocale
-        } else if (!caseSensitive && duplicates.contains(value.toLowerCase)) {
-          // scalastyle:on caselocale
-          // When there are case-insensitive duplicates, put the index as the suffix.
-          s"$value$index"
-        } else if (duplicates.contains(value)) {
-          // When there are duplicates, put the index as the suffix.
-          s"$value$index"
         } else {
           value
         }
       }
+      val duplicatesSet = {
+        // scalastyle:off caselocale
+        val headerNames = header.map(name => if (caseSensitive) name else name.toLowerCase)
+        // scalastyle:on caselocale
+        headerNames.diff(headerNames.distinct)

Review comment:
       The initial value of Set here is to keep the user-facing behavior(from SPARK-16896) unchanged.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] izchen commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

izchen commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494817447



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Current behavior of Spark and R for CSV headers:
   
   | CSV             | SPARK             | R               |
   | --------------- | ----------------- | --------------- |
   | `a,a,a,a`       | `a0,a1,a2,a3`     | `a,a.1,a.2,a.3` |
   | `a,,,`          | `a,_c1,_c2,_c3`   | `a,X,X.1,X.2`   |
   | *header: false* | `_c0,_c1,_c2,_c3` | `V1,V2,V3,V4`   |
   
   If we follow R's behavior, we will introduce a user-facing change. This may cause errors in the user's legacy code. 
   Maybe we should keep the behavior of Spark.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494654857



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       Can we follow this behaviour?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #29862:
URL: https://github.com/apache/spark/pull/29862#discussion_r494861000



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala
##########
@@ -93,6 +93,12 @@ object CSVUtils {
           value
         }
       }
+      if (header.sameElements(row)) {
+        header
+      } else {
+        // Ensure that the newly generated and existing headers are not duplicated.
+        makeSafeHeader(header, caseSensitive, options)
+      }

Review comment:
       I mean the numbering. Can. we create a name like `a1 a3 a4 a2` for `a, a, a, a, a.2`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org