You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Young, Matthew T" <ma...@intel.com> on 2015/07/23 00:04:52 UTC
Issue with column named "count" in a DataFrame
I'm trying to do some simple counting and aggregation in an IPython notebook with Spark 1.4.0 and I have encountered behavior that looks like a bug.
When I try to filter rows out of an RDD with a column name of count I get a large error message. I would just avoid naming things count, except for the fact that this is the default column name created with the count() operation in pyspark.sql.GroupedData
The small example program below demonstrates the issue.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
counts = dataFrame.groupBy('title').count()
counts.filter("title = 'foo'").show() # Works
counts.filter("count > 1").show() # Errors out
I can even reproduce the issue in a PySpark shell session by entering these commands.
I suspect that the error has something to with Spark wanting to call the count() function in place of looking at the count column.
The error message is as follows:
Py4JJavaError Traceback (most recent call last)
<ipython-input-29-62a1b7c71f21> in <module>()
----> 1 counts.filter("count > 1").show() # Errors Out
C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc in filter(self, condition)
774 """
775 if isinstance(condition, basestring):
--> 776 jdf = self._jdf.filter(condition)
777 elif isinstance(condition, Column):
778 jdf = self._jdf.filter(condition._jc)
C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling o229.filter.
: java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found
count > 1
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
Is there a recommended workaround to the inability to filter on a column named count? Do I have to make a new DataFrame and rename the column just to work around this bug? What's the best way to do that?
Thanks,
-- Matthew Young
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
RE: Issue with column named "count" in a DataFrame
Posted by "Young, Matthew T" <ma...@intel.com>.
Thanks Michael, using backticks resolves the issue.
Wouldn't this fix also be something that should go into Spark 1.4.2, or at least have the limitation noted in the documentation?
________________________________
From: Michael Armbrust [michael@databricks.com]
Sent: Wednesday, July 22, 2015 4:26 PM
To: Young, Matthew T
Cc: user@spark.apache.org
Subject: Re: Issue with column named "count" in a DataFrame
Additionally have you tried enclosing count in `backticks`?
On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust <mi...@databricks.com>> wrote:
I believe this will be fixed in Spark 1.5
https://github.com/apache/spark/pull/7237
On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T <ma...@intel.com>> wrote:
I'm trying to do some simple counting and aggregation in an IPython notebook with Spark 1.4.0 and I have encountered behavior that looks like a bug.
When I try to filter rows out of an RDD with a column name of count I get a large error message. I would just avoid naming things count, except for the fact that this is the default column name created with the count() operation in pyspark.sql.GroupedData
The small example program below demonstrates the issue.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
counts = dataFrame.groupBy('title').count()
counts.filter("title = 'foo'").show() # Works
counts.filter("count > 1").show() # Errors out
I can even reproduce the issue in a PySpark shell session by entering these commands.
I suspect that the error has something to with Spark wanting to call the count() function in place of looking at the count column.
The error message is as follows:
Py4JJavaError Traceback (most recent call last)
<ipython-input-29-62a1b7c71f21> in <module>()
----> 1 counts.filter("count > 1").show() # Errors Out
C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc in filter(self, condition)
774 """
775 if isinstance(condition, basestring):
--> 776 jdf = self._jdf.filter(condition)
777 elif isinstance(condition, Column):
778 jdf = self._jdf.filter(condition._jc)
C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name<http://self.name>)
539
540 for temp_arg in temp_args:
C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling o229.filter.
: java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found
count > 1
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
Is there a recommended workaround to the inability to filter on a column named count? Do I have to make a new DataFrame and rename the column just to work around this bug? What's the best way to do that?
Thanks,
-- Matthew Young
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Issue with column named "count" in a DataFrame
Posted by Michael Armbrust <mi...@databricks.com>.
Additionally have you tried enclosing count in `backticks`?
On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust <mi...@databricks.com>
wrote:
> I believe this will be fixed in Spark 1.5
>
> https://github.com/apache/spark/pull/7237
>
> On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T <
> matthew.t.young@intel.com> wrote:
>
>> I'm trying to do some simple counting and aggregation in an IPython
>> notebook with Spark 1.4.0 and I have encountered behavior that looks like a
>> bug.
>>
>> When I try to filter rows out of an RDD with a column name of count I get
>> a large error message. I would just avoid naming things count, except for
>> the fact that this is the default column name created with the count()
>> operation in pyspark.sql.GroupedData
>>
>> The small example program below demonstrates the issue.
>>
>> from pyspark.sql import SQLContext
>> sqlContext = SQLContext(sc)
>> dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
>> counts = dataFrame.groupBy('title').count()
>> counts.filter("title = 'foo'").show() # Works
>> counts.filter("count > 1").show() # Errors out
>>
>>
>> I can even reproduce the issue in a PySpark shell session by entering
>> these commands.
>>
>> I suspect that the error has something to with Spark wanting to call the
>> count() function in place of looking at the count column.
>>
>> The error message is as follows:
>>
>>
>> Py4JJavaError Traceback (most recent call
>> last)
>> <ipython-input-29-62a1b7c71f21> in <module>()
>> ----> 1 counts.filter("count > 1").show() # Errors Out
>>
>> C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc
>> in filter(self, condition)
>> 774 """
>> 775 if isinstance(condition, basestring):
>> --> 776 jdf = self._jdf.filter(condition)
>> 777 elif isinstance(condition, Column):
>> 778 jdf = self._jdf.filter(condition._jc)
>>
>> C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self,
>> *args)
>> 536 answer = self.gateway_client.send_command(command)
>> 537 return_value = get_return_value(answer,
>> self.gateway_client,
>> --> 538 self.target_id, self.name)
>>
>> 539
>> 540 for temp_arg in temp_args:
>>
>> C:\Python27\lib\site-packages\py4j\protocol.pyc in
>> get_return_value(answer, gateway_client, target_id, name)
>> 298 raise Py4JJavaError(
>> 299 'An error occurred while calling
>> {0}{1}{2}.\n'.
>> --> 300 format(target_id, '.', name), value)
>> 301 else:
>> 302 raise Py4JError(
>>
>> Py4JJavaError: An error occurred while calling o229.filter.
>> : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found
>>
>> count > 1
>> ^
>> at scala.sys.package$.error(package.scala:27)
>> at
>> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
>> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>> at java.lang.reflect.Method.invoke(Unknown Source)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>> at
>> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>> at py4j.Gateway.invoke(Gateway.java:259)
>> at
>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>> at java.lang.Thread.run(Unknown Source)
>>
>>
>>
>> Is there a recommended workaround to the inability to filter on a column
>> named count? Do I have to make a new DataFrame and rename the column just
>> to work around this bug? What's the best way to do that?
>>
>> Thanks,
>>
>> -- Matthew Young
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
Re: Issue with column named "count" in a DataFrame
Posted by Michael Armbrust <mi...@databricks.com>.
I believe this will be fixed in Spark 1.5
https://github.com/apache/spark/pull/7237
On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T <matthew.t.young@intel.com
> wrote:
> I'm trying to do some simple counting and aggregation in an IPython
> notebook with Spark 1.4.0 and I have encountered behavior that looks like a
> bug.
>
> When I try to filter rows out of an RDD with a column name of count I get
> a large error message. I would just avoid naming things count, except for
> the fact that this is the default column name created with the count()
> operation in pyspark.sql.GroupedData
>
> The small example program below demonstrates the issue.
>
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
> counts = dataFrame.groupBy('title').count()
> counts.filter("title = 'foo'").show() # Works
> counts.filter("count > 1").show() # Errors out
>
>
> I can even reproduce the issue in a PySpark shell session by entering
> these commands.
>
> I suspect that the error has something to with Spark wanting to call the
> count() function in place of looking at the count column.
>
> The error message is as follows:
>
>
> Py4JJavaError Traceback (most recent call last)
> <ipython-input-29-62a1b7c71f21> in <module>()
> ----> 1 counts.filter("count > 1").show() # Errors Out
>
> C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc
> in filter(self, condition)
> 774 """
> 775 if isinstance(condition, basestring):
> --> 776 jdf = self._jdf.filter(condition)
> 777 elif isinstance(condition, Column):
> 778 jdf = self._jdf.filter(condition._jc)
>
> C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self,
> *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer,
> self.gateway_client,
> --> 538 self.target_id, self.name)
> 539
> 540 for temp_arg in temp_args:
>
> C:\Python27\lib\site-packages\py4j\protocol.pyc in
> get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling o229.filter.
> : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found
>
> count > 1
> ^
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Unknown Source)
>
>
>
> Is there a recommended workaround to the inability to filter on a column
> named count? Do I have to make a new DataFrame and rename the column just
> to work around this bug? What's the best way to do that?
>
> Thanks,
>
> -- Matthew Young
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>