You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by lk_spark <lk...@163.com> on 2017/06/16 03:51:05 UTC

how to call udf with parameters

hi,all
     I define a udf with multiple parameters  ,but I don't know how to call it with DataFrame
 
UDF:

def ssplit2 = udf { (sentence: String, delNum: Boolean, delEn: Boolean, minTermLen: Int) =>
    val terms = HanLP.segment(sentence).asScala
.....

Call :

scala> val output = input.select(ssplit2($"text",true,true,2).as('words))
<console>:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                 ^
<console>:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                      ^
<console>:40: error: type mismatch;
 found   : Int(2)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                           ^

scala> val output = input.select(ssplit2($"text",$"true",$"true",$"2").as('words))
org.apache.spark.sql.AnalysisException: cannot resolve '`true`' given input columns: [id, text];;
'Project [UDF(text#6, 'true, 'true, '2) AS words#16]
+- Project [_1#2 AS id#5, _2#3 AS text#6]
   +- LocalRelation [_1#2, _2#3]


I need help!!


2017-06-16


lk_spark

Re: Re: Re: how to call udf with parameters

Posted by lk_spark <lk...@163.com>.

thanks Kumar ,  that really helpful !!


2017-06-16 

lk_spark 



发件人：Pralabh Kumar <pr...@gmail.com>
发送时间：2017-06-16 18:30
主题：Re: Re: how to call udf with parameters
收件人："lk_spark"<lk...@163.com>
抄送："user.spark"<us...@spark.apache.org>

val getlength=udf((idx1:Int,idx2:Int, data : String)=> data.substring(idx1,idx2))



data.select(getlength(lit(1),lit(2),data("col1"))).collect



On Fri, Jun 16, 2017 at 10:22 AM, Pralabh Kumar <pr...@gmail.com> wrote:

Use lit , give me some time , I'll provide an example


On 16-Jun-2017 10:15 AM, "lk_spark" <lk...@163.com> wrote:

thanks Kumar , I want to know how to cao udf with multiple parameters , maybe an udf to make a substr function,how can I pass parameter with begin and end index ?  I try it with errors. Does the udf parameters could only be a column type?

2017-06-16 

lk_spark 



发件人：Pralabh Kumar <pr...@gmail.com>
发送时间：2017-06-16 17:49
主题：Re: how to call udf with parameters
收件人："lk_spark"<lk...@163.com>
抄送："user.spark"<us...@spark.apache.org>

sample UDF
val getlength=udf((data:String)=>data.length())

data.select(getlength(data("col1")))



On Fri, Jun 16, 2017 at 9:21 AM, lk_spark <lk...@163.com> wrote:

hi,all
     I define a udf with multiple parameters  ,but I don't know how to call it with DataFrame

UDF:

def ssplit2 = udf { (sentence: String, delNum: Boolean, delEn: Boolean, minTermLen: Int) =>
    val terms = HanLP.segment(sentence).asScala
.....

Call :

scala> val output = input.select(ssplit2($"text",true,true,2).as('words))
<console>:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                 ^
<console>:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                      ^
<console>:40: error: type mismatch;
 found   : Int(2)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                           ^

scala> val output = input.select(ssplit2($"text",$"true",$"true",$"2").as('words))
org.apache.spark.sql.AnalysisException: cannot resolve '`true`' given input columns: [id, text];;
'Project [UDF(text#6, 'true, 'true, '2) AS words#16]
+- Project [_1#2 AS id#5, _2#3 AS text#6]
   +- LocalRelation [_1#2, _2#3]


I need help!!


2017-06-16


lk_spark

Re: Re: how to call udf with parameters

Posted by Pralabh Kumar <pr...@gmail.com>.

val getlength=udf((idx1:Int,idx2:Int, data : String)=>
data.substring(idx1,idx2))

data.select(getlength(lit(1),lit(2),data("col1"))).collect

On Fri, Jun 16, 2017 at 10:22 AM, Pralabh Kumar <pr...@gmail.com>
wrote:

> Use lit , give me some time , I'll provide an example
>
> On 16-Jun-2017 10:15 AM, "lk_spark" <lk...@163.com> wrote:
>
>> thanks Kumar , I want to know how to cao udf with multiple parameters ,
>> maybe an udf to make a substr function,how can I pass parameter with begin
>> and end index ?  I try it with errors. Does the udf parameters could only
>> be a column type?
>>
>> 2017-06-16
>> ------------------------------
>> lk_spark
>> ------------------------------
>>
>> *发件人：*Pralabh Kumar <pr...@gmail.com>
>> *发送时间：*2017-06-16 17:49
>> *主题：*Re: how to call udf with parameters
>> *收件人：*"lk_spark"<lk...@163.com>
>> *抄送：*"user.spark"<us...@spark.apache.org>
>>
>> sample UDF
>> val getlength=udf((data:String)=>data.length())
>> data.select(getlength(data("col1")))
>>
>> On Fri, Jun 16, 2017 at 9:21 AM, lk_spark <lk...@163.com> wrote:
>>
>>> hi,all
>>>      I define a udf with multiple parameters  ,but I don't know how to
>>> call it with DataFrame
>>>
>>> UDF:
>>>
>>> def ssplit2 = udf { (sentence: String, delNum: Boolean, delEn: Boolean,
>>> minTermLen: Int) =>
>>>     val terms = HanLP.segment(sentence).asScala
>>> .....
>>>
>>> Call :
>>>
>>> scala> val output = input.select(ssplit2($"text",t
>>> rue,true,2).as('words))
>>> <console>:40: error: type mismatch;
>>>  found   : Boolean(true)
>>>  required: org.apache.spark.sql.Column
>>>        val output = input.select(ssplit2($"text",t
>>> rue,true,2).as('words))
>>>                                                  ^
>>> <console>:40: error: type mismatch;
>>>  found   : Boolean(true)
>>>  required: org.apache.spark.sql.Column
>>>        val output = input.select(ssplit2($"text",t
>>> rue,true,2).as('words))
>>>                                                       ^
>>> <console>:40: error: type mismatch;
>>>  found   : Int(2)
>>>  required: org.apache.spark.sql.Column
>>>        val output = input.select(ssplit2($"text",t
>>> rue,true,2).as('words))
>>>                                                            ^
>>>
>>> scala> val output = input.select(ssplit2($"text",$
>>> "true",$"true",$"2").as('words))
>>> org.apache.spark.sql.AnalysisException: cannot resolve '`true`' given
>>> input columns: [id, text];;
>>> 'Project [UDF(text#6, 'true, 'true, '2) AS words#16]
>>> +- Project [_1#2 AS id#5, _2#3 AS text#6]
>>>    +- LocalRelation [_1#2, _2#3]
>>>
>>> I need help!!
>>>
>>>
>>> 2017-06-16
>>> ------------------------------
>>> lk_spark
>>>
>>
>>

Re: Re: how to call udf with parameters

Posted by Pralabh Kumar <pr...@gmail.com>.

Use lit , give me some time , I'll provide an example

On 16-Jun-2017 10:15 AM, "lk_spark" <lk...@163.com> wrote:

> thanks Kumar , I want to know how to cao udf with multiple parameters ,
> maybe an udf to make a substr function,how can I pass parameter with begin
> and end index ?  I try it with errors. Does the udf parameters could only
> be a column type?
>
> 2017-06-16
> ------------------------------
> lk_spark
> ------------------------------
>
> *发件人：*Pralabh Kumar <pr...@gmail.com>
> *发送时间：*2017-06-16 17:49
> *主题：*Re: how to call udf with parameters
> *收件人：*"lk_spark"<lk...@163.com>
> *抄送：*"user.spark"<us...@spark.apache.org>
>
> sample UDF
> val getlength=udf((data:String)=>data.length())
> data.select(getlength(data("col1")))
>
> On Fri, Jun 16, 2017 at 9:21 AM, lk_spark <lk...@163.com> wrote:
>
>> hi,all
>>      I define a udf with multiple parameters  ,but I don't know how to
>> call it with DataFrame
>>
>> UDF:
>>
>> def ssplit2 = udf { (sentence: String, delNum: Boolean, delEn: Boolean,
>> minTermLen: Int) =>
>>     val terms = HanLP.segment(sentence).asScala
>> .....
>>
>> Call :
>>
>> scala> val output = input.select(ssplit2($"text",true,true,2).as('words))
>> <console>:40: error: type mismatch;
>>  found   : Boolean(true)
>>  required: org.apache.spark.sql.Column
>>        val output = input.select(ssplit2($"text",true,true,2).as('words))
>>                                                  ^
>> <console>:40: error: type mismatch;
>>  found   : Boolean(true)
>>  required: org.apache.spark.sql.Column
>>        val output = input.select(ssplit2($"text",true,true,2).as('words))
>>                                                       ^
>> <console>:40: error: type mismatch;
>>  found   : Int(2)
>>  required: org.apache.spark.sql.Column
>>        val output = input.select(ssplit2($"text",true,true,2).as('words))
>>                                                            ^
>>
>> scala> val output = input.select(ssplit2($"text",$
>> "true",$"true",$"2").as('words))
>> org.apache.spark.sql.AnalysisException: cannot resolve '`true`' given
>> input columns: [id, text];;
>> 'Project [UDF(text#6, 'true, 'true, '2) AS words#16]
>> +- Project [_1#2 AS id#5, _2#3 AS text#6]
>>    +- LocalRelation [_1#2, _2#3]
>>
>> I need help!!
>>
>>
>> 2017-06-16
>> ------------------------------
>> lk_spark
>>
>
>

Re: Re: how to call udf with parameters

Posted by lk_spark <lk...@163.com>.

thanks Kumar , I want to know how to cao udf with multiple parameters , maybe an udf to make a substr function,how can I pass parameter with begin and end index ?  I try it with errors. Does the udf parameters could only be a column type?

2017-06-16 

lk_spark 



发件人：Pralabh Kumar <pr...@gmail.com>
发送时间：2017-06-16 17:49
主题：Re: how to call udf with parameters
收件人："lk_spark"<lk...@163.com>
抄送："user.spark"<us...@spark.apache.org>

sample UDF
val getlength=udf((data:String)=>data.length())

data.select(getlength(data("col1")))



On Fri, Jun 16, 2017 at 9:21 AM, lk_spark <lk...@163.com> wrote:

hi,all
     I define a udf with multiple parameters  ,but I don't know how to call it with DataFrame

UDF:

def ssplit2 = udf { (sentence: String, delNum: Boolean, delEn: Boolean, minTermLen: Int) =>
    val terms = HanLP.segment(sentence).asScala
.....

Call :

scala> val output = input.select(ssplit2($"text",true,true,2).as('words))
<console>:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                 ^
<console>:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                      ^
<console>:40: error: type mismatch;
 found   : Int(2)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                           ^

scala> val output = input.select(ssplit2($"text",$"true",$"true",$"2").as('words))
org.apache.spark.sql.AnalysisException: cannot resolve '`true`' given input columns: [id, text];;
'Project [UDF(text#6, 'true, 'true, '2) AS words#16]
+- Project [_1#2 AS id#5, _2#3 AS text#6]
   +- LocalRelation [_1#2, _2#3]


I need help!!


2017-06-16


lk_spark

Re: how to call udf with parameters

Posted by Yong Zhang <ja...@hotmail.com>.

What version of spark you are using? I cannot reproduce your error:


scala> spark.version
res9: String = 2.1.1
scala> val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")
dataset: org.apache.spark.sql.DataFrame = [id: int, text: string]
scala> import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.udf

// define a method in similar way like you did
scala> def len = udf { (data: String) => data.length > 0 }
len: org.apache.spark.sql.expressions.UserDefinedFunction

// use it
scala> dataset.select(len($"text").as('length)).show
+------+
|length|
+------+
|  true|
|  true|
+------+


Yong



________________________________
From: Pralabh Kumar <pr...@gmail.com>
Sent: Friday, June 16, 2017 12:19 AM
To: lk_spark
Cc: user.spark
Subject: Re: how to call udf with parameters

sample UDF
val getlength=udf((data:String)=>data.length())
data.select(getlength(data("col1")))

On Fri, Jun 16, 2017 at 9:21 AM, lk_spark <lk...@163.com>> wrote:
hi,all
     I define a udf with multiple parameters  ,but I don't know how to call it with DataFrame

UDF:

def ssplit2 = udf { (sentence: String, delNum: Boolean, delEn: Boolean, minTermLen: Int) =>
    val terms = HanLP.segment(sentence).asScala
.....

Call :

scala> val output = input.select(ssplit2($"text",true,true,2).as('words))
<console>:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                 ^
<console>:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                      ^
<console>:40: error: type mismatch;
 found   : Int(2)
 required: org.apache.spark.sql.Column
       val output = input.select(ssplit2($"text",true,true,2).as('words))
                                                           ^

scala> val output = input.select(ssplit2($"text",$"true",$"true",$"2").as('words))
org.apache.spark.sql.AnalysisException: cannot resolve '`true`' given input columns: [id, text];;
'Project [UDF(text#6, 'true, 'true, '2) AS words#16]
+- Project [_1#2 AS id#5, _2#3 AS text#6]
   +- LocalRelation [_1#2, _2#3]

I need help!!


2017-06-16
________________________________
lk_spark

Re: how to call udf with parameters

Posted by Pralabh Kumar <pr...@gmail.com>.

sample UDF
val getlength=udf((data:String)=>data.length())
data.select(getlength(data("col1")))

On Fri, Jun 16, 2017 at 9:21 AM, lk_spark <lk...@163.com> wrote:

> hi,all
>      I define a udf with multiple parameters  ,but I don't know how to
> call it with DataFrame
>
> UDF:
>
> def ssplit2 = udf { (sentence: String, delNum: Boolean, delEn: Boolean,
> minTermLen: Int) =>
>     val terms = HanLP.segment(sentence).asScala
> .....
>
> Call :
>
> scala> val output = input.select(ssplit2($"text",true,true,2).as('words))
> <console>:40: error: type mismatch;
>  found   : Boolean(true)
>  required: org.apache.spark.sql.Column
>        val output = input.select(ssplit2($"text",true,true,2).as('words))
>                                                  ^
> <console>:40: error: type mismatch;
>  found   : Boolean(true)
>  required: org.apache.spark.sql.Column
>        val output = input.select(ssplit2($"text",true,true,2).as('words))
>                                                       ^
> <console>:40: error: type mismatch;
>  found   : Int(2)
>  required: org.apache.spark.sql.Column
>        val output = input.select(ssplit2($"text",true,true,2).as('words))
>                                                            ^
>
> scala> val output = input.select(ssplit2($"text",$
> "true",$"true",$"2").as('words))
> org.apache.spark.sql.AnalysisException: cannot resolve '`true`' given
> input columns: [id, text];;
> 'Project [UDF(text#6, 'true, 'true, '2) AS words#16]
> +- Project [_1#2 AS id#5, _2#3 AS text#6]
>    +- LocalRelation [_1#2, _2#3]
>
> I need help!!
>
>
> 2017-06-16
> ------------------------------
> lk_spark
>