You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Perttu Ranta-aho <ra...@iki.fi> on 2016/11/10 19:14:02 UTC

UDF with column value comparison fails with PySpark

Hello,

I want to create an UDF which modifies one column value depending on value
of some other column. But Python version of the code fails always in column
value comparison. Below are simple examples, scala version works as
expected but Python version throws an execption. Am I missing something
obvious? As can be seen from PySpark exception I'm using Spark 2.0.1.

-Perttu

import org.apache.spark.sql.functions.udf
val df = spark.createDataFrame(List(("a",1), ("b",2), ("c",
3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value")
def myUdf = udf((name: String, value: Int) => {if (name == "c") { value * 2
} else { value }})
df.withColumn("udf", myUdf(df("name"), df("value"))).show()
+----+-----+---+
|name|value|udf|
+----+-----+---+
|   a|    1|  1|
|   b|    2|  2|
|   c|    3|  6|
+----+-----+---+


from pyspark.sql.types import StringType, IntegerType
import pyspark.sql.functions as F

df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)),
('name','value'))

def my_udf(name, value):
    if name == 'c':
        return value * 2
    return value
F.udf(my_udf, IntegerType())

df.withColumn("udf", my_udf(df.name, df.value)).show()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-10032e941fc4> in <module>()
----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show()

<ipython-input-5-c103a6066373> in my_udf(name, value)
      3
      4 def my_udf(name, value):
----> 5     if name == 'c':
      6         return value * 2
      7     return value

/home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in
__nonzero__(self)
    425
    426     def __nonzero__(self):
--> 427         raise ValueError("Cannot convert column into bool: please
use '&' for 'and', '|' for 'or', "
    428                          "'~' for 'not' when building DataFrame
boolean expressions.")
    429     __bool__ = __nonzero__

ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.

Re: UDF with column value comparison fails with PySpark

Posted by Perttu Ranta-aho <ra...@iki.fi>.
So it was something obvious, thanks!

-Perttu

to 10. marraskuuta 2016 klo 21.19 Davies Liu <da...@databricks.com>
kirjoitti:

> On Thu, Nov 10, 2016 at 11:14 AM, Perttu Ranta-aho <ra...@iki.fi>
> wrote:
> > Hello,
> >
> > I want to create an UDF which modifies one column value depending on
> value
> > of some other column. But Python version of the code fails always in
> column
> > value comparison. Below are simple examples, scala version works as
> expected
> > but Python version throws an execption. Am I missing something obvious?
> As
> > can be seen from PySpark exception I'm using Spark 2.0.1.
> >
> > -Perttu
> >
> > import org.apache.spark.sql.functions.udf
> > val df = spark.createDataFrame(List(("a",1), ("b",2), ("c",
> > 3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value")
> > def myUdf = udf((name: String, value: Int) => {if (name == "c") { value
> * 2
> > } else { value }})
> > df.withColumn("udf", myUdf(df("name"), df("value"))).show()
> > +----+-----+---+
> > |name|value|udf|
> > +----+-----+---+
> > |   a|    1|  1|
> > |   b|    2|  2|
> > |   c|    3|  6|
> > +----+-----+---+
> >
> >
> > from pyspark.sql.types import StringType, IntegerType
> > import pyspark.sql.functions as F
> >
> > df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)),
> > ('name','value'))
> >
> > def my_udf(name, value):
> >     if name == 'c':
> >         return value * 2
> >     return value
> > F.udf(my_udf, IntegerType())
>
> udf = F.udf(my_udf, IntegerType())
> df.withColumn("udf", udf(df.name, df.value)).show()
>
> >
> > df.withColumn("udf", my_udf(df.name, df.value)).show()
> >
> >
> ---------------------------------------------------------------------------
> > ValueError                                Traceback (most recent call
> last)
> > <ipython-input-6-10032e941fc4> in <module>()
> > ----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show()
> >
> > <ipython-input-5-c103a6066373> in my_udf(name, value)
> >       3
> >       4 def my_udf(name, value):
> > ----> 5     if name == 'c':
> >       6         return value * 2
> >       7     return value
> >
> > /home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in
> > __nonzero__(self)
> >     425
> >     426     def __nonzero__(self):
> > --> 427         raise ValueError("Cannot convert column into bool: please
> > use '&' for 'and', '|' for 'or', "
> >     428                          "'~' for 'not' when building DataFrame
> > boolean expressions.")
> >     429     __bool__ = __nonzero__
> >
> > ValueError: Cannot convert column into bool: please use '&' for 'and',
> '|'
> > for 'or', '~' for 'not' when building DataFrame boolean expressions.
>

Re: UDF with column value comparison fails with PySpark

Posted by Davies Liu <da...@databricks.com>.
On Thu, Nov 10, 2016 at 11:14 AM, Perttu Ranta-aho <ra...@iki.fi> wrote:
> Hello,
>
> I want to create an UDF which modifies one column value depending on value
> of some other column. But Python version of the code fails always in column
> value comparison. Below are simple examples, scala version works as expected
> but Python version throws an execption. Am I missing something obvious? As
> can be seen from PySpark exception I'm using Spark 2.0.1.
>
> -Perttu
>
> import org.apache.spark.sql.functions.udf
> val df = spark.createDataFrame(List(("a",1), ("b",2), ("c",
> 3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value")
> def myUdf = udf((name: String, value: Int) => {if (name == "c") { value * 2
> } else { value }})
> df.withColumn("udf", myUdf(df("name"), df("value"))).show()
> +----+-----+---+
> |name|value|udf|
> +----+-----+---+
> |   a|    1|  1|
> |   b|    2|  2|
> |   c|    3|  6|
> +----+-----+---+
>
>
> from pyspark.sql.types import StringType, IntegerType
> import pyspark.sql.functions as F
>
> df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)),
> ('name','value'))
>
> def my_udf(name, value):
>     if name == 'c':
>         return value * 2
>     return value
> F.udf(my_udf, IntegerType())

udf = F.udf(my_udf, IntegerType())
df.withColumn("udf", udf(df.name, df.value)).show()

>
> df.withColumn("udf", my_udf(df.name, df.value)).show()
>
> ---------------------------------------------------------------------------
> ValueError                                Traceback (most recent call last)
> <ipython-input-6-10032e941fc4> in <module>()
> ----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show()
>
> <ipython-input-5-c103a6066373> in my_udf(name, value)
>       3
>       4 def my_udf(name, value):
> ----> 5     if name == 'c':
>       6         return value * 2
>       7     return value
>
> /home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in
> __nonzero__(self)
>     425
>     426     def __nonzero__(self):
> --> 427         raise ValueError("Cannot convert column into bool: please
> use '&' for 'and', '|' for 'or', "
>     428                          "'~' for 'not' when building DataFrame
> boolean expressions.")
>     429     __bool__ = __nonzero__
>
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org