You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2016/09/27 17:54:21 UTC
[jira] [Commented] (SPARK-17681) Empty DataFrame with non-zero rows after using drop

    [ https://issues.apache.org/jira/browse/SPARK-17681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526929#comment-15526929 ] 

Josh Rosen commented on SPARK-17681:
------------------------------------

I don't think that the current behavior is wrong. If {{drop()}} behaved as you suggest then I think we would have some weird anomalies when both adding and dropping columns. For instance, the following two examples currently return equivalent DataFrames:

{code}
scala> val df = Seq((1,2)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> df.drop("a").drop("b").withColumn("newCol", expr("1")).show()
+------+
|newCol|
+------+
|     1|
+------+

scala> df.withColumn("newCol", expr("1")).drop("a").drop("b").show()
+------+
|newCol|
+------+
|     1|
+------+
{code}

Under your suggested semantics, the first DataFrame would become empty after dropping both columns (since collecting that would return zero rows), so it would mean that either the two results would differ according to the order of the {{drop}} and {{withColumn}} calls or the {{withColumn}} call would have taken a DataFrame with zero rows and increased the number of rows (which doesn't make sense).

If dropping a column doesn't change the number of rows when going from 2 columns to 1 then for consistency it should also not affect the number of rows when going from 1 column to none.

Therefore, I'm inclined to say that this is not an issue, but I'm curious to hear if you have a rationale for why this should behave differently.

> Empty DataFrame with non-zero rows after using drop
> ---------------------------------------------------
>
>                 Key: SPARK-17681
>                 URL: https://issues.apache.org/jira/browse/SPARK-17681
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1, 1.6.0, 2.0.0
>            Reporter: Ian Hellstrom
>
> It is possible to have a {{DataFrame}} with no columns to have a non-zero number of rows, even though the contents are empty:
> {code}
> val df = Seq((1,2)).toDF("a", "b")
> df.drop("a").drop("b").count
> {code}
> The problem is also present in 2.0.0:
> {code}
> import org.apache.spark._
> import org.apache.spark.sql._
> val conf = new SparkConf()
> val sc = new SparkContext("local", "demo", conf)
> val ss = SparkSession.builder.getOrCreate()
> import ss.implicits._
> case class Data(a: Int, b: Int)
> val rdd = sc.parallelize(List(Data(1,2)))
> val ds = ss.createDataset(rdd)
> ds.drop("a").drop("b").count
> {code}
> In both the pre-2.0 and 2.0 releases the returned number is 1 instead of 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org