You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Perttu Ranta-aho <ra...@iki.fi> on 2016/11/10 19:14:02 UTC
UDF with column value comparison fails with PySpark
Hello,
I want to create an UDF which modifies one column value depending on value
of some other column. But Python version of the code fails always in column
value comparison. Below are simple examples, scala version works as
expected but Python version throws an execption. Am I missing something
obvious? As can be seen from PySpark exception I'm using Spark 2.0.1.
-Perttu
import org.apache.spark.sql.functions.udf
val df = spark.createDataFrame(List(("a",1), ("b",2), ("c",
3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value")
def myUdf = udf((name: String, value: Int) => {if (name == "c") { value * 2
} else { value }})
df.withColumn("udf", myUdf(df("name"), df("value"))).show()
+----+-----+---+
|name|value|udf|
+----+-----+---+
| a| 1| 1|
| b| 2| 2|
| c| 3| 6|
+----+-----+---+
from pyspark.sql.types import StringType, IntegerType
import pyspark.sql.functions as F
df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)),
('name','value'))
def my_udf(name, value):
if name == 'c':
return value * 2
return value
F.udf(my_udf, IntegerType())
df.withColumn("udf", my_udf(df.name, df.value)).show()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-10032e941fc4> in <module>()
----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show()
<ipython-input-5-c103a6066373> in my_udf(name, value)
3
4 def my_udf(name, value):
----> 5 if name == 'c':
6 return value * 2
7 return value
/home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in
__nonzero__(self)
425
426 def __nonzero__(self):
--> 427 raise ValueError("Cannot convert column into bool: please
use '&' for 'and', '|' for 'or', "
428 "'~' for 'not' when building DataFrame
boolean expressions.")
429 __bool__ = __nonzero__
ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.
Re: UDF with column value comparison fails with PySpark
Posted by Perttu Ranta-aho <ra...@iki.fi>.
So it was something obvious, thanks!
-Perttu
to 10. marraskuuta 2016 klo 21.19 Davies Liu <da...@databricks.com>
kirjoitti:
> On Thu, Nov 10, 2016 at 11:14 AM, Perttu Ranta-aho <ra...@iki.fi>
> wrote:
> > Hello,
> >
> > I want to create an UDF which modifies one column value depending on
> value
> > of some other column. But Python version of the code fails always in
> column
> > value comparison. Below are simple examples, scala version works as
> expected
> > but Python version throws an execption. Am I missing something obvious?
> As
> > can be seen from PySpark exception I'm using Spark 2.0.1.
> >
> > -Perttu
> >
> > import org.apache.spark.sql.functions.udf
> > val df = spark.createDataFrame(List(("a",1), ("b",2), ("c",
> > 3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value")
> > def myUdf = udf((name: String, value: Int) => {if (name == "c") { value
> * 2
> > } else { value }})
> > df.withColumn("udf", myUdf(df("name"), df("value"))).show()
> > +----+-----+---+
> > |name|value|udf|
> > +----+-----+---+
> > | a| 1| 1|
> > | b| 2| 2|
> > | c| 3| 6|
> > +----+-----+---+
> >
> >
> > from pyspark.sql.types import StringType, IntegerType
> > import pyspark.sql.functions as F
> >
> > df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)),
> > ('name','value'))
> >
> > def my_udf(name, value):
> > if name == 'c':
> > return value * 2
> > return value
> > F.udf(my_udf, IntegerType())
>
> udf = F.udf(my_udf, IntegerType())
> df.withColumn("udf", udf(df.name, df.value)).show()
>
> >
> > df.withColumn("udf", my_udf(df.name, df.value)).show()
> >
> >
> ---------------------------------------------------------------------------
> > ValueError Traceback (most recent call
> last)
> > <ipython-input-6-10032e941fc4> in <module>()
> > ----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show()
> >
> > <ipython-input-5-c103a6066373> in my_udf(name, value)
> > 3
> > 4 def my_udf(name, value):
> > ----> 5 if name == 'c':
> > 6 return value * 2
> > 7 return value
> >
> > /home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in
> > __nonzero__(self)
> > 425
> > 426 def __nonzero__(self):
> > --> 427 raise ValueError("Cannot convert column into bool: please
> > use '&' for 'and', '|' for 'or', "
> > 428 "'~' for 'not' when building DataFrame
> > boolean expressions.")
> > 429 __bool__ = __nonzero__
> >
> > ValueError: Cannot convert column into bool: please use '&' for 'and',
> '|'
> > for 'or', '~' for 'not' when building DataFrame boolean expressions.
>
Re: UDF with column value comparison fails with PySpark
Posted by Davies Liu <da...@databricks.com>.
On Thu, Nov 10, 2016 at 11:14 AM, Perttu Ranta-aho <ra...@iki.fi> wrote:
> Hello,
>
> I want to create an UDF which modifies one column value depending on value
> of some other column. But Python version of the code fails always in column
> value comparison. Below are simple examples, scala version works as expected
> but Python version throws an execption. Am I missing something obvious? As
> can be seen from PySpark exception I'm using Spark 2.0.1.
>
> -Perttu
>
> import org.apache.spark.sql.functions.udf
> val df = spark.createDataFrame(List(("a",1), ("b",2), ("c",
> 3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value")
> def myUdf = udf((name: String, value: Int) => {if (name == "c") { value * 2
> } else { value }})
> df.withColumn("udf", myUdf(df("name"), df("value"))).show()
> +----+-----+---+
> |name|value|udf|
> +----+-----+---+
> | a| 1| 1|
> | b| 2| 2|
> | c| 3| 6|
> +----+-----+---+
>
>
> from pyspark.sql.types import StringType, IntegerType
> import pyspark.sql.functions as F
>
> df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)),
> ('name','value'))
>
> def my_udf(name, value):
> if name == 'c':
> return value * 2
> return value
> F.udf(my_udf, IntegerType())
udf = F.udf(my_udf, IntegerType())
df.withColumn("udf", udf(df.name, df.value)).show()
>
> df.withColumn("udf", my_udf(df.name, df.value)).show()
>
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> <ipython-input-6-10032e941fc4> in <module>()
> ----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show()
>
> <ipython-input-5-c103a6066373> in my_udf(name, value)
> 3
> 4 def my_udf(name, value):
> ----> 5 if name == 'c':
> 6 return value * 2
> 7 return value
>
> /home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in
> __nonzero__(self)
> 425
> 426 def __nonzero__(self):
> --> 427 raise ValueError("Cannot convert column into bool: please
> use '&' for 'and', '|' for 'or', "
> 428 "'~' for 'not' when building DataFrame
> boolean expressions.")
> 429 __bool__ = __nonzero__
>
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org