You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2018/05/24 14:19:18 UTC
spark git commit: [SPARK-24329][SQL] Test for skipping multi-space lines

Repository: spark
Updated Branches:
  refs/heads/master 3469f5c98 -> 13bedc05c


[SPARK-24329][SQL] Test for skipping multi-space lines

## What changes were proposed in this pull request?

The PR is a continue of https://github.com/apache/spark/pull/21380 . It checks cases that are handled by the code:
https://github.com/apache/spark/blob/e3de6ab30d52890eb08578e55eb4a5d2b4e7aa35/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L303-L304

Basically the code skips lines with one or many whitespaces, and lines with comments (see [filterCommentAndEmpty](https://github.com/apache/spark/blob/e3de6ab30d52890eb08578e55eb4a5d2b4e7aa35/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala#L47))

```scala
   iter.filter { line =>
      line.trim.nonEmpty && !line.startsWith(options.comment.toString)
    }
```

Closes #21380

## How was this patch tested?

Added a test for the case described above.

Author: Maxim Gekk <ma...@databricks.com>
Author: Maxim Gekk <ma...@gmail.com>

Closes #21394 from MaxGekk/test-for-multi-space-lines.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13bedc05
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13bedc05
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13bedc05

Branch: refs/heads/master
Commit: 13bedc05c28fcc6e739fb472bd2ee3035fa11648
Parents: 3469f5c
Author: Maxim Gekk <ma...@databricks.com>
Authored: Thu May 24 22:18:58 2018 +0800
Committer: hyukjinkwon <gu...@apache.org>
Committed: Thu May 24 22:18:58 2018 +0800

----------------------------------------------------------------------
 .../resources/test-data/comments-whitespaces.csv     |  8 ++++++++
 .../sql/execution/datasources/csv/CSVSuite.scala     | 15 +++++++++++++++
 2 files changed, 23 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/13bedc05/sql/core/src/test/resources/test-data/comments-whitespaces.csv
----------------------------------------------------------------------
diff --git a/sql/core/src/test/resources/test-data/comments-whitespaces.csv b/sql/core/src/test/resources/test-data/comments-whitespaces.csv
new file mode 100644
index 0000000..2737978
--- /dev/null
+++ b/sql/core/src/test/resources/test-data/comments-whitespaces.csv
@@ -0,0 +1,8 @@
+# The file contains comments, whitespaces and empty lines
+colA
+# empty line
+
+# the line with a few whitespaces
+   
+# int value with leading and trailing whitespaces
+ "a" 

http://git-wip-us.apache.org/repos/asf/spark/blob/13bedc05/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index 07e6c74..2bac1a3 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -1368,4 +1368,19 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
       checkAnswer(computed, expected)
     }
   }
+
+  test("SPARK-24329: skip lines with comments, and one or multiple whitespaces") {
+    val schema = new StructType().add("colA", StringType)
+    val ds = spark
+      .read
+      .schema(schema)
+      .option("multiLine", false)
+      .option("header", true)
+      .option("comment", "#")
+      .option("ignoreLeadingWhiteSpace", false)
+      .option("ignoreTrailingWhiteSpace", false)
+      .csv(testFile("test-data/comments-whitespaces.csv"))
+
+    checkAnswer(ds, Seq(Row(""" "a" """)))
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org