You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lakshminarayan Kamath (JIRA)" <ji...@apache.org> on 2018/10/30 22:20:00 UTC

[jira] [Updated] (SPARK-25890) Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.

     [ https://issues.apache.org/jira/browse/SPARK-25890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lakshminarayan Kamath updated SPARK-25890:
------------------------------------------
    Description: 
Reading a Ctrl-A delimited CSV file ignores rows with all null values. However a comma delimited CSV file doesn't.

*Reproduction in spark-shell:*

import org.apache.spark.sql._
 import org.apache.spark.sql.types._

val l = List(List(1, 2), List(null,null), List(2,3))
 val datasetSchema = StructType(List(StructField("colA", IntegerType, true), StructField("colB", IntegerType, true)))
 val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
 val df = spark.createDataFrame(rdd, datasetSchema)

df.show()


|colA|colB|
|1   |2   |
|null|null|
|2   |3   | |

df.write.option("delimiter", "\u0001").option("header", "true").csv("/ctrl-a-separated.csv")
 df.write.option("delimiter", ",").option("header", "true").csv("/comma-separated.csv")

val commaDf = spark.read.option("header", "true").option("delimiter", ",").csv("/comma-separated.csv")
 commaDf.show
|colA|colB|
|1   |2   |
|2   |3   |
|null|null|

val ctrlaDf = spark.read.option("header", "true").option("delimiter", "\u0001").csv("/ctrl-a-separated.csv")
 ctrlaDf.show


|colA|colB|
|1   |2   |
|2   |3   |

 

As seen above, for Ctrl-A delimited CSV, rows containing only null values are ignored.

 

 

 

  was:
Reading a Ctrl-A delimited CSV file ignores rows with all null values. However a comma delimited CSV file doesn't.

*Reproduction in spark-shell:*

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val l = List(List(1, 2), List(null,null), List(2,3))
val datasetSchema = StructType(List(StructField("colA", IntegerType, true), StructField("colB", IntegerType, true)))
val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
val df = spark.createDataFrame(rdd, datasetSchema)

df.show()
+----+----+
|colA|colB|
+----+----+
| 1    | 2   |
|null | null|
| 2    | 3   | 
+----+----+

df.write.option("delimiter", "\u0001").option("header", "true").csv("/ctrl-a-separated.csv")
df.write.option("delimiter", ",").option("header", "true").csv("/comma-separated.csv")

val commaDf = spark.read.option("header", "true").option("delimiter", ",").csv("/comma-separated.csv")
commaDf.show
+----+----+
|colA|colB|
+----+----+
| 1    | 2   |
| 2    | 3   |
|null |null|
+----+----+

val ctrlaDf = spark.read.option("header", "true").option("delimiter", "\u0001").csv("/ctrl-a-separated.csv")
ctrlaDf.show
+----+----+
|colA|colB|
+----+----+
| 1   | 2   |
| 2   | 3   |
+----+----+

 

As seen above, for Ctrl-A delimited CSV, rows containing only null values are ignored.

 

 

 


> Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.
> -------------------------------------------------------------------------
>
>                 Key: SPARK-25890
>                 URL: https://issues.apache.org/jira/browse/SPARK-25890
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell, SQL
>    Affects Versions: 2.3.2
>            Reporter: Lakshminarayan Kamath
>            Priority: Major
>
> Reading a Ctrl-A delimited CSV file ignores rows with all null values. However a comma delimited CSV file doesn't.
> *Reproduction in spark-shell:*
> import org.apache.spark.sql._
>  import org.apache.spark.sql.types._
> val l = List(List(1, 2), List(null,null), List(2,3))
>  val datasetSchema = StructType(List(StructField("colA", IntegerType, true), StructField("colB", IntegerType, true)))
>  val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
>  val df = spark.createDataFrame(rdd, datasetSchema)
> df.show()
> |colA|colB|
> |1   |2   |
> |null|null|
> |2   |3   | |
> df.write.option("delimiter", "\u0001").option("header", "true").csv("/ctrl-a-separated.csv")
>  df.write.option("delimiter", ",").option("header", "true").csv("/comma-separated.csv")
> val commaDf = spark.read.option("header", "true").option("delimiter", ",").csv("/comma-separated.csv")
>  commaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
> |null|null|
> val ctrlaDf = spark.read.option("header", "true").option("delimiter", "\u0001").csv("/ctrl-a-separated.csv")
>  ctrlaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
>  
> As seen above, for Ctrl-A delimited CSV, rows containing only null values are ignored.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org