You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "chanduhawk (Jira)" <ji...@apache.org> on 2020/08/16 17:38:00 UTC
[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

    [ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178560#comment-17178560 ] 

chanduhawk edited comment on SPARK-32614 at 8/16/20, 5:37 PM:
--------------------------------------------------------------

*currently  spark cannt process the row that starts with null character.*
If one of the rows the data file(CSV) starts with null or \u0000 character like below(PFA screenshot)
like below
**null*,abc,test*

then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below

option("comment","a character")

comment - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character.
In data ware house most of the time we dont have comment charatcers concept and all the rows needs to be processed.

So there should be an option in spark which will disable the processing of comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing.






was (Author: chanduhawk):
If one of the rows the data file(CSV) starts with null or \u0000 character like below(PFA screenshot)
like below
**null*,abc,test*

then spark will throw the error as mentioned in the description. i.e spark cannt process any row that starts with null character. It can only process the row if we set the options like below

option("comment","a character")

comment - it will take a character that needs to be treated as comment character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this process is also flawed and subject to skip a valid row of data that may start with the comment character.
In data ware house most of the time we dont have comment charatcers concept and all the rows needs to be processed.

So there should be an option in spark which will disable the processing of comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing.





> Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32614
>                 URL: https://issues.apache.org/jira/browse/SPARK-32614
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: chanduhawk
>            Assignee: Jeff Evans
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u0000 or null character.Though user can set any comment character other than \u0000, but there is a chance the actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row starts with null \u0000
> character it will throw the below error.
> *eg: *val df = spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>       df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
> 	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> 	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
> 	at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
> 	at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
> 	at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to provide any other comment character by mentioning .option("comment","#"), but if my actual data starts with this character then the particular row will be discarded.
> Currently I pushed the code in univocity parser to handle this scenario as part of the below PR
> https://github.com/uniVocity/univocity-parsers/pull/412
> please accept the jira so that we can enable this feature in spark-csv by adding a parameter in spark csvoptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org