You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Andrew Vykhodtsev <yo...@gmail.com> on 2015/06/29 08:57:20 UTC

Dataframes filter by count fails with python API

Dear developers,

I found the following behaviour that I think is a minor bug.

If I apply groupBy and count in python API,  the resulting data frame has
grouped columns and the field named "count". Filtering by that field does
not work because it thinks it is a key word:

x = sc.parallelize(zip(xrange(1000),xrange(1000)))
df = sqlContext.createDataFrame(x)

df.groupBy("_1").count().printSchema()

root
 |-- _1: long (nullable = true)
 |-- count: long (nullable = false)


df.groupBy("_1").count().filter("count > 1")

gives

: java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found

count > 1
      ^
	at scala.sys.package$.error(package.scala:27)



the following syntax works :

f = df.groupBy("_1").count()
n = f.filter(f["count"] > 1)

In Scala, referring to $"count" column works as well.

please let me know if I should submit a JIRA for this.

Re: Dataframes filter by count fails with python API

Posted by Reynold Xin <rx...@databricks.com>.
Hi Andrew,

Thanks for the email. This is a known bug with the expression parser. We
will hopefully fix this in 1.5.

There are more keywords with the expression parser, and we have already got
rid of most of them. Count is still there due to the handling of count
distinct, but we plan to get rid of that as well.



On Sun, Jun 28, 2015 at 11:57 PM, Andrew Vykhodtsev <yo...@gmail.com>
wrote:

> Dear developers,
>
> I found the following behaviour that I think is a minor bug.
>
> If I apply groupBy and count in python API,  the resulting data frame has
> grouped columns and the field named "count". Filtering by that field does
> not work because it thinks it is a key word:
>
> x = sc.parallelize(zip(xrange(1000),xrange(1000)))
> df = sqlContext.createDataFrame(x)
>
> df.groupBy("_1").count().printSchema()
>
> root
>  |-- _1: long (nullable = true)
>  |-- count: long (nullable = false)
>
>
> df.groupBy("_1").count().filter("count > 1")
>
> gives
>
> : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found
>
> count > 1
>       ^
> 	at scala.sys.package$.error(package.scala:27)
>
>
>
> the following syntax works :
>
> f = df.groupBy("_1").count()
> n = f.filter(f["count"] > 1)
>
> In Scala, referring to $"count" column works as well.
>
> please let me know if I should submit a JIRA for this.
>