You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Young, Matthew T" <ma...@intel.com> on 2015/07/23 00:04:52 UTC

Issue with column named "count" in a DataFrame

I'm trying to do some simple counting and aggregation in an IPython notebook with Spark 1.4.0 and I have encountered behavior that looks like a bug.

When I try to filter rows out of an RDD with a column name of count I get a large error message. I would just avoid naming things count, except for the fact that this is the default column name created with the count() operation in pyspark.sql.GroupedData

The small example program below demonstrates the issue.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
counts = dataFrame.groupBy('title').count()
counts.filter("title = 'foo'").show() # Works
counts.filter("count > 1").show()     # Errors out


I can even reproduce the issue in a PySpark shell session by entering these commands.

I suspect that the error has something to with Spark wanting to call the count() function in place of looking at the count column.

The error message is as follows:


Py4JJavaError                             Traceback (most recent call last)
<ipython-input-29-62a1b7c71f21> in <module>()
----> 1 counts.filter("count > 1").show() # Errors Out

C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc in filter(self, condition)
    774         """
    775         if isinstance(condition, basestring):
--> 776             jdf = self._jdf.filter(condition)
    777         elif isinstance(condition, Column):
    778             jdf = self._jdf.filter(condition._jc)

C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539
    540         for temp_arg in temp_args:

C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o229.filter.
: java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found

count > 1
      ^
        at scala.sys.package$.error(package.scala:27)
        at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
        at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Unknown Source)



Is there a recommended workaround to the inability to filter on a column named count? Do I have to make a new DataFrame and rename the column just to work around this bug? What's the best way to do that?

Thanks,

-- Matthew Young

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: Issue with column named "count" in a DataFrame

Posted by "Young, Matthew T" <ma...@intel.com>.

Thanks Michael, using backticks resolves the issue.

Wouldn't this fix also be something that should go into Spark 1.4.2, or at least have the limitation noted in the documentation?


________________________________
From: Michael Armbrust [michael@databricks.com]
Sent: Wednesday, July 22, 2015 4:26 PM
To: Young, Matthew T
Cc: user@spark.apache.org
Subject: Re: Issue with column named "count" in a DataFrame

Additionally have you tried enclosing count in `backticks`?

On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust <mi...@databricks.com>> wrote:
I believe this will be fixed in Spark 1.5

https://github.com/apache/spark/pull/7237

On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T <ma...@intel.com>> wrote:
I'm trying to do some simple counting and aggregation in an IPython notebook with Spark 1.4.0 and I have encountered behavior that looks like a bug.

When I try to filter rows out of an RDD with a column name of count I get a large error message. I would just avoid naming things count, except for the fact that this is the default column name created with the count() operation in pyspark.sql.GroupedData

The small example program below demonstrates the issue.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
counts = dataFrame.groupBy('title').count()
counts.filter("title = 'foo'").show() # Works
counts.filter("count > 1").show()     # Errors out


I can even reproduce the issue in a PySpark shell session by entering these commands.

I suspect that the error has something to with Spark wanting to call the count() function in place of looking at the count column.

The error message is as follows:


Py4JJavaError                             Traceback (most recent call last)
<ipython-input-29-62a1b7c71f21> in <module>()
----> 1 counts.filter("count > 1").show() # Errors Out

C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc in filter(self, condition)
    774         """
    775         if isinstance(condition, basestring):
--> 776             jdf = self._jdf.filter(condition)
    777         elif isinstance(condition, Column):
    778             jdf = self._jdf.filter(condition._jc)

C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name<http://self.name>)

    539
    540         for temp_arg in temp_args:

C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o229.filter.
: java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found

count > 1
      ^
        at scala.sys.package$.error(package.scala:27)
        at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
        at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Unknown Source)



Is there a recommended workaround to the inability to filter on a column named count? Do I have to make a new DataFrame and rename the column just to work around this bug? What's the best way to do that?

Thanks,

-- Matthew Young

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Issue with column named "count" in a DataFrame

Posted by Michael Armbrust <mi...@databricks.com>.

Additionally have you tried enclosing count in `backticks`?

On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> I believe this will be fixed in Spark 1.5
>
> https://github.com/apache/spark/pull/7237
>
> On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T <
> matthew.t.young@intel.com> wrote:
>
>> I'm trying to do some simple counting and aggregation in an IPython
>> notebook with Spark 1.4.0 and I have encountered behavior that looks like a
>> bug.
>>
>> When I try to filter rows out of an RDD with a column name of count I get
>> a large error message. I would just avoid naming things count, except for
>> the fact that this is the default column name created with the count()
>> operation in pyspark.sql.GroupedData
>>
>> The small example program below demonstrates the issue.
>>
>> from pyspark.sql import SQLContext
>> sqlContext = SQLContext(sc)
>> dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
>> counts = dataFrame.groupBy('title').count()
>> counts.filter("title = 'foo'").show() # Works
>> counts.filter("count > 1").show()     # Errors out
>>
>>
>> I can even reproduce the issue in a PySpark shell session by entering
>> these commands.
>>
>> I suspect that the error has something to with Spark wanting to call the
>> count() function in place of looking at the count column.
>>
>> The error message is as follows:
>>
>>
>> Py4JJavaError                             Traceback (most recent call
>> last)
>> <ipython-input-29-62a1b7c71f21> in <module>()
>> ----> 1 counts.filter("count > 1").show() # Errors Out
>>
>> C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc
>> in filter(self, condition)
>>     774         """
>>     775         if isinstance(condition, basestring):
>> --> 776             jdf = self._jdf.filter(condition)
>>     777         elif isinstance(condition, Column):
>>     778             jdf = self._jdf.filter(condition._jc)
>>
>> C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self,
>> *args)
>>     536         answer = self.gateway_client.send_command(command)
>>     537         return_value = get_return_value(answer,
>> self.gateway_client,
>> --> 538                 self.target_id, self.name)
>>
>>     539
>>     540         for temp_arg in temp_args:
>>
>> C:\Python27\lib\site-packages\py4j\protocol.pyc in
>> get_return_value(answer, gateway_client, target_id, name)
>>     298                 raise Py4JJavaError(
>>     299                     'An error occurred while calling
>> {0}{1}{2}.\n'.
>> --> 300                     format(target_id, '.', name), value)
>>     301             else:
>>     302                 raise Py4JError(
>>
>> Py4JJavaError: An error occurred while calling o229.filter.
>> : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found
>>
>> count > 1
>>       ^
>>         at scala.sys.package$.error(package.scala:27)
>>         at
>> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
>>         at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>         at java.lang.reflect.Method.invoke(Unknown Source)
>>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>         at
>> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>>         at py4j.Gateway.invoke(Gateway.java:259)
>>         at
>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>         at java.lang.Thread.run(Unknown Source)
>>
>>
>>
>> Is there a recommended workaround to the inability to filter on a column
>> named count? Do I have to make a new DataFrame and rename the column just
>> to work around this bug? What's the best way to do that?
>>
>> Thanks,
>>
>> -- Matthew Young
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Issue with column named "count" in a DataFrame

Posted by Michael Armbrust <mi...@databricks.com>.

I believe this will be fixed in Spark 1.5

https://github.com/apache/spark/pull/7237

On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T <matthew.t.young@intel.com
> wrote:

> I'm trying to do some simple counting and aggregation in an IPython
> notebook with Spark 1.4.0 and I have encountered behavior that looks like a
> bug.
>
> When I try to filter rows out of an RDD with a column name of count I get
> a large error message. I would just avoid naming things count, except for
> the fact that this is the default column name created with the count()
> operation in pyspark.sql.GroupedData
>
> The small example program below demonstrates the issue.
>
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
> counts = dataFrame.groupBy('title').count()
> counts.filter("title = 'foo'").show() # Works
> counts.filter("count > 1").show()     # Errors out
>
>
> I can even reproduce the issue in a PySpark shell session by entering
> these commands.
>
> I suspect that the error has something to with Spark wanting to call the
> count() function in place of looking at the count column.
>
> The error message is as follows:
>
>
> Py4JJavaError                             Traceback (most recent call last)
> <ipython-input-29-62a1b7c71f21> in <module>()
> ----> 1 counts.filter("count > 1").show() # Errors Out
>
> C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc
> in filter(self, condition)
>     774         """
>     775         if isinstance(condition, basestring):
> --> 776             jdf = self._jdf.filter(condition)
>     777         elif isinstance(condition, Column):
>     778             jdf = self._jdf.filter(condition._jc)
>
> C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self,
> *args)
>     536         answer = self.gateway_client.send_command(command)
>     537         return_value = get_return_value(answer,
> self.gateway_client,
> --> 538                 self.target_id, self.name)
>     539
>     540         for temp_arg in temp_args:
>
> C:\Python27\lib\site-packages\py4j\protocol.pyc in
> get_return_value(answer, gateway_client, target_id, name)
>     298                 raise Py4JJavaError(
>     299                     'An error occurred while calling {0}{1}{2}.\n'.
> --> 300                     format(target_id, '.', name), value)
>     301             else:
>     302                 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling o229.filter.
> : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found
>
> count > 1
>       ^
>         at scala.sys.package$.error(package.scala:27)
>         at
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
>         at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>         at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>         at py4j.Gateway.invoke(Gateway.java:259)
>         at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>         at java.lang.Thread.run(Unknown Source)
>
>
>
> Is there a recommended workaround to the inability to filter on a column
> named count? Do I have to make a new DataFrame and rename the column just
> to work around this bug? What's the best way to do that?
>
> Thanks,
>
> -- Matthew Young
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>