You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Thomas Kastl (JIRA)" <ji...@apache.org> on 2018/12/19 10:39:00 UTC

[jira] [Created] (SPARK-26406) Add option to skip rows when reading csv files

Thomas Kastl created SPARK-26406:
------------------------------------

             Summary: Add option to skip rows when reading csv files
                 Key: SPARK-26406
                 URL: https://issues.apache.org/jira/browse/SPARK-26406
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.4.0
            Reporter: Thomas Kastl


Real-world data can contain multiple header lines. Spark currently does not offer any way to skip more than one header row.

Several workarounds are proposed on stackoverflow (manually editing each csv file by adding "#" to the rows and using the comment option, or filtering after reading) but all of them are workarounds with more or less obvious drawbacks and restrictions.

The option
{code:java}
header=True{code}
already treats the first row of csv files differently, so the argument that Spark wants to be row-agnostic does not really hold here in my opinion. A solution like pandas
{code:java}
skiprows={code}
would be highly preferable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org