You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nirav Patel <np...@xactlycorp.com> on 2018/10/08 18:57:09 UTC

CSV parser - is there a way to find malformed csv record

I am getting `RuntimeException: Malformed CSV record` while parsing csv
record and attaching schema at same time. Most likely there are additional
commas or json data in some field which are not escaped properly. Is there
a way CSV parser tells me which record is malformed?


This is what I am using:

    val df2 = sparkSession.read
      .option("inferSchema", true)
      .option("multiLine", true)
      .schema(headerDF.schema) // this only works without column mismatch
      .csv(dataPath)

Thanks

-- 


 <http://www.xactlycorp.com/email-click/>

 
<https://www.instagram.com/xactlycorp/>   
<https://www.linkedin.com/company/xactly-corporation>   
<https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
<http://www.youtube.com/xactlycorporation>

Re: CSV parser - is there a way to find malformed csv record

Posted by Shuporno Choudhury <sh...@gmail.com>.

Hi,
There is a way to way obtain these malformed/rejected records. Rejection
can happen not only because of column number mismatch but also if the data
type of the data does not match the data type mentioned in the schema.
To obtain the rejected records, you can do the following:
1. Add an extra column (eg: CorruptRecCol) to your schema of type
StringType()
2. In the datadrame reader, add the *mode* 'PERMISSIVE' while
simultaneously adding the column CorruptRecCol to
*columnNameOfCorruptRecord*
3. The column CorruptRecCol will contain the complete record if it is
malformed/corrupted. On the other hand, it will be null if the record is
valid. So you can use a filter (CorruptRecCol is NULL) to obtain the
malformed/corrupted record.
You can use any column name to contain the invalid records. I have used
*CorruptRecCol* just for example.
http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
This example is for pyspark. Similar example will exist for Java/Scala also.
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/DataFrameReader.html

On Tue, 9 Oct 2018 at 00:27, Nirav Patel [via Apache Spark User List] <
ml+s1001560n33643h19@n3.nabble.com> wrote:

> I am getting `RuntimeException: Malformed CSV record` while parsing csv
> record and attaching schema at same time. Most likely there are additional
> commas or json data in some field which are not escaped properly. Is there
> a way CSV parser tells me which record is malformed?
>
>
> This is what I am using:
>
>     val df2 = sparkSession.read
>       .option("inferSchema", true)
>       .option("multiLine", true)
>       .schema(headerDF.schema) // this only works without column mismatch
>       .csv(dataPath)
>
> Thanks
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.instagram.com/xactlycorp/>
> <https://www.linkedin.com/company/xactly-corporation>
> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>
> <http://www.youtube.com/xactlycorporation>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/CSV-parser-is-there-a-way-to-find-malformed-csv-record-tp33643.html
> To start a new topic under Apache Spark User List, email
> ml+s1001560n1h17@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2h1cG9ybm8uY2hvdWRodXJ5QGdtYWlsLmNvbXwxfC0xODI0MTU0MzQ0>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

-- 
--Thanks,
Shuporno Choudhury

Re: CSV parser - is there a way to find malformed csv record

Posted by Nirav Patel <np...@xactlycorp.com>.

Thanks Shuporno . That mode worked. I found out couple records having
quotes inside quotes which needed to be escaped.



On Tue, Oct 9, 2018 at 1:40 PM Taylor Cox <Ta...@microsoft.com> wrote:

> Hey Nirav,
>
>
>
> Here’s an idea:
>
>
>
> Suppose your file.csv has N records, one for each line. Read the csv
> line-by-line (without spark) and attempt to parse each line. If a record is
> malformed, catch the exception and rethrow it with the line number. That
> should show you where the problematic record(s) can be found.
>
>
>
> *From:* Nirav Patel <np...@xactlycorp.com>
> *Sent:* Monday, October 8, 2018 11:57 AM
> *To:* spark users <us...@spark.apache.org>
> *Subject:* CSV parser - is there a way to find malformed csv record
>
>
>
> I am getting `RuntimeException: Malformed CSV record` while parsing csv
> record and attaching schema at same time. Most likely there are additional
> commas or json data in some field which are not escaped properly. Is there
> a way CSV parser tells me which record is malformed?
>
>
>
>
>
> This is what I am using:
>
>
>
>     val df2 = sparkSession.read
>
>       .option("inferSchema", true)
>
>       .option("multiLine", true)
>
>       .schema(headerDF.schema) // this only works without column mismatch
>
>       .csv(dataPath)
>
>
>
> Thanks
>
>
>
>
> [image: Image removed by sender. What's New with Xactly]
> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xactlycorp.com%2Femail-click%2F&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498409102&sdata=Q648xF6kZthiaWtDpXXsy3jSnKT%2FYVF7DFKSp9Mahtk%3D&reserved=0>
>
> [image: Image removed by sender.]
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fxactlycorp%2F&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498419112&sdata=Rz6ft6lLLRJ9FJVtRMSlKfpKZriwi1yQiiOix0P3PiM%3D&reserved=0>
>   [image: Image removed by sender.]
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fxactly-corporation&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498419112&sdata=htCoZq07XYbOkkB%2Fojwpo4FMTT32LvMsq0%2F8vdp4cD0%3D&reserved=0>
>   [image: Image removed by sender.]
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2FXactly&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498429117&sdata=dx4hY7uwBbthUahdZ%2FlsWPaWBvsBS6zskgOfZj%2BBHCY%3D&reserved=0>
>   [image: Image removed by sender.]
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.facebook.com%2FXactlyCorp&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498429117&sdata=KohVt7EXC9P5GiwKKGUMXxvM507o4ZnNozXofMxvn78%3D&reserved=0>
>   [image: Image removed by sender.]
> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.youtube.com%2Fxactlycorporation&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498439126&sdata=wphFwmIuci%2BZlrdWYmRdaSOvynU48UmAs0xEFI2BRh0%3D&reserved=0>
>

-- 


 <http://www.xactlycorp.com/email-click/>

 
<https://www.instagram.com/xactlycorp/>   
<https://www.linkedin.com/company/xactly-corporation>   
<https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
<http://www.youtube.com/xactlycorporation>

RE: CSV parser - is there a way to find malformed csv record

Posted by Taylor Cox <Ta...@microsoft.com.INVALID>.

Hey Nirav,

Here’s an idea:

Suppose your file.csv has N records, one for each line. Read the csv line-by-line (without spark) and attempt to parse each line. If a record is malformed, catch the exception and rethrow it with the line number. That should show you where the problematic record(s) can be found.

From: Nirav Patel <np...@xactlycorp.com>
Sent: Monday, October 8, 2018 11:57 AM
To: spark users <us...@spark.apache.org>
Subject: CSV parser - is there a way to find malformed csv record

I am getting `RuntimeException: Malformed CSV record` while parsing csv record and attaching schema at same time. Most likely there are additional commas or json data in some field which are not escaped properly. Is there a way CSV parser tells me which record is malformed?

This is what I am using:

    val df2 = sparkSession.read
      .option("inferSchema", true)
      .option("multiLine", true)
      .schema(headerDF.schema) // this only works without column mismatch
      .csv(dataPath)

Thanks

[Image removed by sender. What's New with Xactly]<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xactlycorp.com%2Femail-click%2F&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498409102&sdata=Q648xF6kZthiaWtDpXXsy3jSnKT%2FYVF7DFKSp9Mahtk%3D&reserved=0>

[Image removed by sender.]<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fxactlycorp%2F&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498419112&sdata=Rz6ft6lLLRJ9FJVtRMSlKfpKZriwi1yQiiOix0P3PiM%3D&reserved=0>  [Image removed by sender.] <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fxactly-corporation&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498419112&sdata=htCoZq07XYbOkkB%2Fojwpo4FMTT32LvMsq0%2F8vdp4cD0%3D&reserved=0>   [Image removed by sender.] <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2FXactly&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498429117&sdata=dx4hY7uwBbthUahdZ%2FlsWPaWBvsBS6zskgOfZj%2BBHCY%3D&reserved=0>   [Image removed by sender.] <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.facebook.com%2FXactlyCorp&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498429117&sdata=KohVt7EXC9P5GiwKKGUMXxvM507o4ZnNozXofMxvn78%3D&reserved=0>   [Image removed by sender.] <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.youtube.com%2Fxactlycorporation&data=02%7C01%7CTaylor.Cox%40microsoft.com%7C99917500d9d546c8bef308d62d4fe469%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636746218498439126&sdata=wphFwmIuci%2BZlrdWYmRdaSOvynU48UmAs0xEFI2BRh0%3D&reserved=0>