You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "holdenk (via GitHub)" <gi...@apache.org> on 2023/07/08 23:42:32 UTC

[GitHub] [spark] holdenk commented on a diff in pull request #39907: [SPARK-42359][SQL] Support row skipping when reading CSV files

holdenk commented on code in PR #39907:
URL: https://github.com/apache/spark/pull/39907#discussion_r1257386798


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -378,12 +378,15 @@ private[sql] object UnivocityParser {
   def tokenizeStream(
       inputStream: InputStream,
       shouldDropHeader: Boolean,
+      skipLines: Int,
       tokenizer: CsvParser,
       encoding: String): Iterator[Array[String]] = {
+    val handleSkipLines: () => Unit =
+      () => 1.to(skipLines).foreach(_ => tokenizer.parseNext())

Review Comment:
   Whats the behaviour when skipLines is greater than the length of the input file?



##########
docs/sql-data-sources-csv.md:
##########
@@ -102,6 +102,12 @@ Data source options of CSV can be set via:
     <td>For reading, uses the first line as names of columns. For writing, writes the names of columns as the first line. Note that if the given path is a RDD of Strings, this header option will remove all lines same with the header if exists. CSV built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
+  <tr>
+    <td><code>skipLines</code></td>
+    <td>0</td>
+    <td>Sets the number of non-empty, uncommented lines to skip before parsing CSV files. If the <code>header</code> option is set to <code>true</code>, the first line after the number of <code>skipLines</code> will be taken as the header.</td>
+    <td>read</td>
+  </tr>

Review Comment:
   Does skipLines apply before or after the filtering? (e.g. if we have 10 empty lines at the top of partition 1, what is the behaviour)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org