You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2015/10/09 21:01:59 UTC

How to calculate percentile of a column of DataFrame?

Hi how to calculate percentile of a column in a DataFrame? I cant find any
percentile_approx function in Spark aggregation functions. For e.g. in Hive
we have percentile_approx and we can use it in the following way

hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);

I can see ntile function but not sure how it is gonna give results same as
above query please guide.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to calculate percentile of a column of DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

I would suggest using http://search-hadoop.com/ to find literature on the empty
partitions directory problem.

If there is no answer there, please start a new thread with the following
information:

release of Spark
release of hadoop
code snippet
symptom

Cheers

On Mon, Oct 12, 2015 at 12:08 PM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Ted thanks much are you saying above code will work in only 1.5.1? I
> tried upgrading to 1.5.1 but I have found potential bug my Spark job
> creates hive partitions using hiveContext.sql("insert into partitions")
> when I use Spark 1.5.1 I cant see any partitions files orc files getting
> created in HDFS I can see empty partitions directory under Hive table along
> with many staging files created by spark.
>
> On Tue, Oct 13, 2015 at 12:34 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> SQL context available as sqlContext.
>>
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>> "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>
>> scala> df.select(callUDF("percentile_approx",col("value"),
>> lit(0.25))).show()
>> +------------------------------+
>> |'percentile_approx(value,0.25)|
>> +------------------------------+
>> |                           1.0|
>> +------------------------------+
>>
>> Can you upgrade to 1.5.1 ?
>>
>> Cheers
>>
>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>>> in Spark 1.4.0 as per JAvadocx
>>>
>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>>> Do we need to register Hive UDFs?
>>>>
>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>
>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>> gives compilation error
>>>>
>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>> error
>>>>
>>>> //compile error because callUdf() takes String and Column* as arguments.
>>>>
>>>> Please guide. Thanks much.
>>>>
>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>
>>>>>
>>>>> SQL context available as sqlContext.
>>>>>
>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>> "value")
>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>
>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
>>>>> v + cnst)
>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>
>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>>> +---+--------------------+
>>>>> | id|'simpleUDF(value,25)|
>>>>> +---+--------------------+
>>>>> |id1|                  26|
>>>>> |id2|                  41|
>>>>> |id3|                  50|
>>>>> +---+--------------------+
>>>>>
>>>>> Which Spark release are you using ?
>>>>>
>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>>> it does not work.
>>>>>>
>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>
>>>>>> Above code does not compile because callUdf() takes only two
>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> thanks much Michael let me try.
>>>>>>>
>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>> michael@databricks.com> wrote:
>>>>>>>
>>>>>>>> This is confusing because I made a typo...
>>>>>>>>
>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>
>>>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>>>> to be columns that are passed in as arguments.  lit is just saying to make
>>>>>>>> a literal column that always has the value 0.25.
>>>>>>>>
>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for clarification
>>>>>>>>>
>>>>>>>>> Saif
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>
>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>> DataFrame?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>>
>>>>>>>>> Creates a Column
>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>>> literal value.
>>>>>>>>>
>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>> Column
>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>>> Otherwise, a new Column
>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>>> created to represent the literal value.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>> can’t find lit in the api.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>> *To:* unk1102
>>>>>>>>> *Cc:* user
>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>> DataFrame?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>> from dataframes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>>>> find any
>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>> e.g. in Hive
>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>
>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>> myTable);
>>>>>>>>>
>>>>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>>>>> same as
>>>>>>>>> above query please guide.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Ted thanks much are you saying above code will work in only 1.5.1? I
tried upgrading to 1.5.1 but I have found potential bug my Spark job
creates hive partitions using hiveContext.sql("insert into partitions")
when I use Spark 1.5.1 I cant see any partitions files orc files getting
created in HDFS I can see empty partitions directory under Hive table along
with many staging files created by spark.

On Tue, Oct 13, 2015 at 12:34 AM, Ted Yu <yu...@gmail.com> wrote:

> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> df.select(callUDF("percentile_approx",col("value"),
> lit(0.25))).show()
> +------------------------------+
> |'percentile_approx(value,0.25)|
> +------------------------------+
> |                           1.0|
> +------------------------------+
>
> Can you upgrade to 1.5.1 ?
>
> Cheers
>
> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>> in Spark 1.4.0 as per JAvadocx
>>
>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>> Do we need to register Hive UDFs?
>>>
>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>
>>> I am calling Hive UDF percentile_approx in the following manner which
>>> gives compilation error
>>>
>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>> error
>>>
>>> //compile error because callUdf() takes String and Column* as arguments.
>>>
>>> Please guide. Thanks much.
>>>
>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>
>>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
>>>> v + cnst)
>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>
>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>> +---+--------------------+
>>>> | id|'simpleUDF(value,25)|
>>>> +---+--------------------+
>>>> |id1|                  26|
>>>> |id2|                  41|
>>>> |id3|                  50|
>>>> +---+--------------------+
>>>>
>>>> Which Spark release are you using ?
>>>>
>>>> Can you pastebin the full stack trace where you got the error ?
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>> it does not work.
>>>>>
>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>
>>>>> Above code does not compile because callUdf() takes only two arguments
>>>>> function name in String and Column class type. Please guide.
>>>>>
>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> thanks much Michael let me try.
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>> michael@databricks.com> wrote:
>>>>>>
>>>>>>> This is confusing because I made a typo...
>>>>>>>
>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>
>>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>>> to be columns that are passed in as arguments.  lit is just saying to make
>>>>>>> a literal column that always has the value 0.25.
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>>>> --> becomes a percentile function lit(25)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for clarification
>>>>>>>>
>>>>>>>> Saif
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>
>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>> DataFrame?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>
>>>>>>>> Creates a Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>> literal value.
>>>>>>>>
>>>>>>>> The passed in object is returned directly if it is already a Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>> Otherwise, a new Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>> created to represent the literal value.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>>>> find lit in the api.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>> *To:* unk1102
>>>>>>>> *Cc:* user
>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>> DataFrame?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>> from dataframes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>>> find any
>>>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>>>> in Hive
>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>
>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>> myTable);
>>>>>>>>
>>>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>>>> same as
>>>>>>>> above query please guide.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

I am currently dealing with a high priority bug in another project.

Hope to get back to this soon.

On Tue, Oct 13, 2015 at 11:56 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Ted sorry for asking again. Did you get chance to look at compilation
> issue? Thanks much.
>
> Regards.
> On Oct 13, 2015 18:39, "Umesh Kacha" <um...@gmail.com> wrote:
>
>> Hi Ted I am using the following line of code I can't paste entire code
>> sorry but the following only line doesn't compile in my spark job
>>
>>  sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))
>>
>> I am using Intellij editor java and maven dependencies of spark core
>> spark sql spark hive version 1.5.1
>> On Oct 13, 2015 18:21, "Ted Yu" <yu...@gmail.com> wrote:
>>
>>> Can you pastebin your Java code and the command you used to compile ?
>>>
>>> Thanks
>>>
>>> On Oct 13, 2015, at 1:42 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>
>>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>>> 1.5.1 binary in spark-shell.
>>> On Oct 13, 2015 1:32 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>
>>>> Looks like the fix went in after 1.5.1 was released.
>>>>
>>>> You may verify using master branch build.
>>>>
>>>> Cheers
>>>>
>>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>>> and column types only. Please guide.
>>>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>
>>>>> SQL context available as sqlContext.
>>>>>
>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>> "value")
>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>
>>>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>>>> lit(0.25))).show()
>>>>> +------------------------------+
>>>>> |'percentile_approx(value,0.25)|
>>>>> +------------------------------+
>>>>> |                           1.0|
>>>>> +------------------------------+
>>>>>
>>>>> Can you upgrade to 1.5.1 ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>>>> available in Spark 1.4.0 as per JAvadocx
>>>>>>
>>>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>>>> efforts. Do we need to register Hive UDFs?
>>>>>>>
>>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>>>
>>>>>>> I am calling Hive UDF percentile_approx in the following manner
>>>>>>> which gives compilation error
>>>>>>>
>>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>>>> error
>>>>>>>
>>>>>>> //compile error because callUdf() takes String and Column* as
>>>>>>> arguments.
>>>>>>>
>>>>>>> Please guide. Thanks much.
>>>>>>>
>>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>>>
>>>>>>>>
>>>>>>>> SQL context available as sqlContext.
>>>>>>>>
>>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>>>> "value")
>>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>>>
>>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) =>
>>>>>>>> v * v + cnst)
>>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>>>>
>>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>>> lit(25))).show()
>>>>>>>> +---+--------------------+
>>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>>> +---+--------------------+
>>>>>>>> |id1|                  26|
>>>>>>>> |id2|                  41|
>>>>>>>> |id3|                  50|
>>>>>>>> +---+--------------------+
>>>>>>>>
>>>>>>>> Which Spark release are you using ?
>>>>>>>>
>>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>>> code it does not work.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>>
>>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <
>>>>>>>>> umesh.kacha@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>>
>>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>>
>>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just saying to
>>>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>>>
>>>>>>>>>>>> Saif
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>>>
>>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>>> DataFrame?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>>>> percent
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>>>>>
>>>>>>>>>>>> Creates a Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>>>>>> literal value.
>>>>>>>>>>>>
>>>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>>>> Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>>>> *To:* unk1102
>>>>>>>>>>>> *Cc:* user
>>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>>> DataFrame?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>>>> from dataframes.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I
>>>>>>>>>>>> cant find any
>>>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>>>> e.g. in Hive
>>>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>>>
>>>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>>>> myTable);
>>>>>>>>>>>>
>>>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>>>> results same as
>>>>>>>>>>>> above query please guide.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Ted sorry for asking again. Did you get chance to look at compilation
issue? Thanks much.

Regards.
On Oct 13, 2015 18:39, "Umesh Kacha" <um...@gmail.com> wrote:

> Hi Ted I am using the following line of code I can't paste entire code
> sorry but the following only line doesn't compile in my spark job
>
>  sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))
>
> I am using Intellij editor java and maven dependencies of spark core spark
> sql spark hive version 1.5.1
> On Oct 13, 2015 18:21, "Ted Yu" <yu...@gmail.com> wrote:
>
>> Can you pastebin your Java code and the command you used to compile ?
>>
>> Thanks
>>
>> On Oct 13, 2015, at 1:42 AM, Umesh Kacha <um...@gmail.com> wrote:
>>
>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>> 1.5.1 binary in spark-shell.
>> On Oct 13, 2015 1:32 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>>> Looks like the fix went in after 1.5.1 was released.
>>>
>>> You may verify using master branch build.
>>>
>>> Cheers
>>>
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>> and column types only. Please guide.
>>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>>> lit(0.25))).show()
>>>> +------------------------------+
>>>> |'percentile_approx(value,0.25)|
>>>> +------------------------------+
>>>> |                           1.0|
>>>> +------------------------------+
>>>>
>>>> Can you upgrade to 1.5.1 ?
>>>>
>>>> Cheers
>>>>
>>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>>> available in Spark 1.4.0 as per JAvadocx
>>>>>
>>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>>> efforts. Do we need to register Hive UDFs?
>>>>>>
>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>>
>>>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>>>> gives compilation error
>>>>>>
>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>>> error
>>>>>>
>>>>>> //compile error because callUdf() takes String and Column* as
>>>>>> arguments.
>>>>>>
>>>>>> Please guide. Thanks much.
>>>>>>
>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>>
>>>>>>>
>>>>>>> SQL context available as sqlContext.
>>>>>>>
>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>>> "value")
>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>>
>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>>> * v + cnst)
>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>>>
>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>> lit(25))).show()
>>>>>>> +---+--------------------+
>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>> +---+--------------------+
>>>>>>> |id1|                  26|
>>>>>>> |id2|                  41|
>>>>>>> |id3|                  50|
>>>>>>> +---+--------------------+
>>>>>>>
>>>>>>> Which Spark release are you using ?
>>>>>>>
>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>> code it does not work.
>>>>>>>>
>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>
>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <umesh.kacha@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>
>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>
>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just saying to
>>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>>
>>>>>>>>>>> Saif
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>>
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>>> percent
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>>>>
>>>>>>>>>>> Creates a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>>>>> literal value.
>>>>>>>>>>>
>>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>>> Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>>> *To:* unk1102
>>>>>>>>>>> *Cc:* user
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>>> from dataframes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I
>>>>>>>>>>> cant find any
>>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>>> e.g. in Hive
>>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>>
>>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>>> myTable);
>>>>>>>>>>>
>>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>>> results same as
>>>>>>>>>>> above query please guide.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Ted thanks much for your help. So fix is in JIRA 10671 and it is suppose
to release in spark 1.6.0 right? Until 1.6.0 is released I won't be able to
invoke callUdf using string and percentile_approx with lit as argument
right
On Oct 14, 2015 03:26, "Ted Yu" <yu...@gmail.com> wrote:

> I modified DataFrameSuite, in master branch, to call percentile_approx
> instead of simpleUDF :
>
> - deprecated callUdf in SQLContext
> - callUDF in SQLContext *** FAILED ***
>   org.apache.spark.sql.AnalysisException: undefined function
> percentile_approx;
>   at
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
>   at
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
>   at scala.Option.getOrElse(Option.scala:120)
>   at
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:63)
>   at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>   at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>   at
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
>   at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>
> SPARK-10671 is included.
> For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL
> treats percentile_approx as normal UDF.
>
> Experts can correct me, if there is any misunderstanding.
>
> Cheers
>
> On Tue, Oct 13, 2015 at 6:09 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> Hi Ted I am using the following line of code I can't paste entire code
>> sorry but the following only line doesn't compile in my spark job
>>
>>  sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))
>>
>> I am using Intellij editor java and maven dependencies of spark core
>> spark sql spark hive version 1.5.1
>> On Oct 13, 2015 18:21, "Ted Yu" <yu...@gmail.com> wrote:
>>
>>> Can you pastebin your Java code and the command you used to compile ?
>>>
>>> Thanks
>>>
>>> On Oct 13, 2015, at 1:42 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>
>>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>>> 1.5.1 binary in spark-shell.
>>> On Oct 13, 2015 1:32 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>
>>>> Looks like the fix went in after 1.5.1 was released.
>>>>
>>>> You may verify using master branch build.
>>>>
>>>> Cheers
>>>>
>>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>>> and column types only. Please guide.
>>>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>
>>>>> SQL context available as sqlContext.
>>>>>
>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>> "value")
>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>
>>>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>>>> lit(0.25))).show()
>>>>> +------------------------------+
>>>>> |'percentile_approx(value,0.25)|
>>>>> +------------------------------+
>>>>> |                           1.0|
>>>>> +------------------------------+
>>>>>
>>>>> Can you upgrade to 1.5.1 ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>>>> available in Spark 1.4.0 as per JAvadocx
>>>>>>
>>>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>>>> efforts. Do we need to register Hive UDFs?
>>>>>>>
>>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>>>
>>>>>>> I am calling Hive UDF percentile_approx in the following manner
>>>>>>> which gives compilation error
>>>>>>>
>>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>>>> error
>>>>>>>
>>>>>>> //compile error because callUdf() takes String and Column* as
>>>>>>> arguments.
>>>>>>>
>>>>>>> Please guide. Thanks much.
>>>>>>>
>>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>>>
>>>>>>>>
>>>>>>>> SQL context available as sqlContext.
>>>>>>>>
>>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>>>> "value")
>>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>>>
>>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) =>
>>>>>>>> v * v + cnst)
>>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>>>>
>>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>>> lit(25))).show()
>>>>>>>> +---+--------------------+
>>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>>> +---+--------------------+
>>>>>>>> |id1|                  26|
>>>>>>>> |id2|                  41|
>>>>>>>> |id3|                  50|
>>>>>>>> +---+--------------------+
>>>>>>>>
>>>>>>>> Which Spark release are you using ?
>>>>>>>>
>>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>>> code it does not work.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>>
>>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <
>>>>>>>>> umesh.kacha@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>>
>>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>>
>>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just saying to
>>>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>>>
>>>>>>>>>>>> Saif
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>>>
>>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>>> DataFrame?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>>>> percent
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>>>>>
>>>>>>>>>>>> Creates a Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>>>>>> literal value.
>>>>>>>>>>>>
>>>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>>>> Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>>>> *To:* unk1102
>>>>>>>>>>>> *Cc:* user
>>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>>> DataFrame?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>>>> from dataframes.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I
>>>>>>>>>>>> cant find any
>>>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>>>> e.g. in Hive
>>>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>>>
>>>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>>>> myTable);
>>>>>>>>>>>>
>>>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>>>> results same as
>>>>>>>>>>>> above query please guide.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

I modified DataFrameSuite, in master branch, to call percentile_approx
instead of simpleUDF :

- deprecated callUdf in SQLContext
- callUDF in SQLContext *** FAILED ***
  org.apache.spark.sql.AnalysisException: undefined function
percentile_approx;
  at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
  at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
  at scala.Option.getOrElse(Option.scala:120)
  at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:63)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
  at
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
  at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
  at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)

SPARK-10671 is included.
For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL
treats percentile_approx as normal UDF.

Experts can correct me, if there is any misunderstanding.

Cheers

On Tue, Oct 13, 2015 at 6:09 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Ted I am using the following line of code I can't paste entire code
> sorry but the following only line doesn't compile in my spark job
>
>  sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))
>
> I am using Intellij editor java and maven dependencies of spark core spark
> sql spark hive version 1.5.1
> On Oct 13, 2015 18:21, "Ted Yu" <yu...@gmail.com> wrote:
>
>> Can you pastebin your Java code and the command you used to compile ?
>>
>> Thanks
>>
>> On Oct 13, 2015, at 1:42 AM, Umesh Kacha <um...@gmail.com> wrote:
>>
>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>> 1.5.1 binary in spark-shell.
>> On Oct 13, 2015 1:32 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>>> Looks like the fix went in after 1.5.1 was released.
>>>
>>> You may verify using master branch build.
>>>
>>> Cheers
>>>
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>> and column types only. Please guide.
>>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>>> lit(0.25))).show()
>>>> +------------------------------+
>>>> |'percentile_approx(value,0.25)|
>>>> +------------------------------+
>>>> |                           1.0|
>>>> +------------------------------+
>>>>
>>>> Can you upgrade to 1.5.1 ?
>>>>
>>>> Cheers
>>>>
>>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>>> available in Spark 1.4.0 as per JAvadocx
>>>>>
>>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>>> efforts. Do we need to register Hive UDFs?
>>>>>>
>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>>
>>>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>>>> gives compilation error
>>>>>>
>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>>> error
>>>>>>
>>>>>> //compile error because callUdf() takes String and Column* as
>>>>>> arguments.
>>>>>>
>>>>>> Please guide. Thanks much.
>>>>>>
>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>>
>>>>>>>
>>>>>>> SQL context available as sqlContext.
>>>>>>>
>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>>> "value")
>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>>
>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>>> * v + cnst)
>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>>>
>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>> lit(25))).show()
>>>>>>> +---+--------------------+
>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>> +---+--------------------+
>>>>>>> |id1|                  26|
>>>>>>> |id2|                  41|
>>>>>>> |id3|                  50|
>>>>>>> +---+--------------------+
>>>>>>>
>>>>>>> Which Spark release are you using ?
>>>>>>>
>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>> code it does not work.
>>>>>>>>
>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>
>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <umesh.kacha@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>
>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>
>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just saying to
>>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>>
>>>>>>>>>>> Saif
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>>
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>>> percent
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>>>>
>>>>>>>>>>> Creates a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>>>>> literal value.
>>>>>>>>>>>
>>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>>> Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>>> *To:* unk1102
>>>>>>>>>>> *Cc:* user
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>>> from dataframes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I
>>>>>>>>>>> cant find any
>>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>>> e.g. in Hive
>>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>>
>>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>>> myTable);
>>>>>>>>>>>
>>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>>> results same as
>>>>>>>>>>> above query please guide.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Ted I am using the following line of code I can't paste entire code
sorry but the following only line doesn't compile in my spark job

 sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))

I am using Intellij editor java and maven dependencies of spark core spark
sql spark hive version 1.5.1
On Oct 13, 2015 18:21, "Ted Yu" <yu...@gmail.com> wrote:

> Can you pastebin your Java code and the command you used to compile ?
>
> Thanks
>
> On Oct 13, 2015, at 1:42 AM, Umesh Kacha <um...@gmail.com> wrote:
>
> Hi Ted if fix went after 1.5.1 release then how come it's working with
> 1.5.1 binary in spark-shell.
> On Oct 13, 2015 1:32 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
>> Looks like the fix went in after 1.5.1 was released.
>>
>> You may verify using master branch build.
>>
>> Cheers
>>
>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com> wrote:
>>
>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>> 1.5.1 maven libraries it still complains same that callUdf can have string
>> and column types only. Please guide.
>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>> lit(0.25))).show()
>>> +------------------------------+
>>> |'percentile_approx(value,0.25)|
>>> +------------------------------+
>>> |                           1.0|
>>> +------------------------------+
>>>
>>> Can you upgrade to 1.5.1 ?
>>>
>>> Cheers
>>>
>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>> available in Spark 1.4.0 as per JAvadocx
>>>>
>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>> efforts. Do we need to register Hive UDFs?
>>>>>
>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>
>>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>>> gives compilation error
>>>>>
>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>> error
>>>>>
>>>>> //compile error because callUdf() takes String and Column* as
>>>>> arguments.
>>>>>
>>>>> Please guide. Thanks much.
>>>>>
>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>
>>>>>>
>>>>>> SQL context available as sqlContext.
>>>>>>
>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>> "value")
>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>
>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>> * v + cnst)
>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>>
>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>> lit(25))).show()
>>>>>> +---+--------------------+
>>>>>> | id|'simpleUDF(value,25)|
>>>>>> +---+--------------------+
>>>>>> |id1|                  26|
>>>>>> |id2|                  41|
>>>>>> |id3|                  50|
>>>>>> +---+--------------------+
>>>>>>
>>>>>> Which Spark release are you using ?
>>>>>>
>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>>>> it does not work.
>>>>>>>
>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>
>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>
>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> thanks much Michael let me try.
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>
>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>
>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>> need to be columns that are passed in as arguments.  lit is just saying to
>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>
>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for clarification
>>>>>>>>>>
>>>>>>>>>> Saif
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>
>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>> DataFrame?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>> percent
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>>>
>>>>>>>>>> Creates a Column
>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>>>> literal value.
>>>>>>>>>>
>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>> Column
>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>> *To:* unk1102
>>>>>>>>>> *Cc:* user
>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>> DataFrame?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>> from dataframes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>>>>> find any
>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>> e.g. in Hive
>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>
>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>> myTable);
>>>>>>>>>>
>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>> results same as
>>>>>>>>>> above query please guide.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context:
>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: How to calculate percentile of a column of DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

Can you pastebin your Java code and the command you used to compile ?

Thanks

> On Oct 13, 2015, at 1:42 AM, Umesh Kacha <um...@gmail.com> wrote:
> 
> Hi Ted if fix went after 1.5.1 release then how come it's working with 1.5.1 binary in spark-shell.
> 
>> On Oct 13, 2015 1:32 PM, "Ted Yu" <yu...@gmail.com> wrote:
>> Looks like the fix went in after 1.5.1 was released. 
>> 
>> You may verify using master branch build. 
>> 
>> Cheers
>> 
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com> wrote:
>>> 
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1 maven libraries it still complains same that callUdf can have string and column types only. Please guide.
>>> 
>>>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>>>> SQL context available as sqlContext.
>>>> 
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>> 
>>>> scala> df.select(callUDF("percentile_approx",col("value"), lit(0.25))).show()
>>>> +------------------------------+
>>>> |'percentile_approx(value,0.25)|
>>>> +------------------------------+
>>>> |                           1.0|
>>>> +------------------------------+
>>>> 
>>>> Can you upgrade to 1.5.1 ?
>>>> 
>>>> Cheers
>>>> 
>>>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available in Spark 1.4.0 as per JAvadocx
>>>>> 
>>>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>>>> Hi Ted thanks much for the detailed answer and appreciate your efforts. Do we need to register Hive UDFs?
>>>>>> 
>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>> 
>>>>>> I am calling Hive UDF percentile_approx in the following manner which gives compilation error
>>>>>> 
>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile error
>>>>>> 
>>>>>> //compile error because callUdf() takes String and Column* as arguments.
>>>>>> 
>>>>>> Please guide. Thanks much.
>>>>>> 
>>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>> 
>>>>>>> 
>>>>>>> SQL context available as sqlContext.
>>>>>>> 
>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>> 
>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v + cnst)
>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List())
>>>>>>> 
>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>>>>> +---+--------------------+
>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>> +---+--------------------+
>>>>>>> |id1|                  26|
>>>>>>> |id2|                  41|
>>>>>>> |id3|                  50|
>>>>>>> +---+--------------------+
>>>>>>> 
>>>>>>> Which Spark release are you using ?
>>>>>>> 
>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com> wrote:
>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following code it does not work. 
>>>>>>>> 
>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>> 
>>>>>>>> Above code does not compile because callUdf() takes only two arguments function name in String and Column class type. Please guide.
>>>>>>>> 
>>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>>>>>>> thanks much Michael let me try. 
>>>>>>>>> 
>>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <mi...@databricks.com> wrote:
>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>> 
>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>> 
>>>>>>>>>> The first argument is the name of the UDF, all other arguments need to be columns that are passed in as arguments.  lit is just saying to make a literal column that always has the value 0.25.
>>>>>>>>>> 
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>>>>>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>> 
>>>>>>>>>>> Saif
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> From: Umesh Kacha [mailto:umesh.kacha@gmail.com] 
>>>>>>>>>>> Sent: Friday, October 09, 2015 4:10 PM
>>>>>>>>>>> To: Ellafi, Saif A.
>>>>>>>>>>> Cc: Michael Armbrust; user
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Subject: Re: How to calculate percentile of a column of DataFrame?
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> public static Column lit(Object literal)
>>>>>>>>>>> Creates a Column of literal value.
>>>>>>>>>>> 
>>>>>>>>>>> The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Where can we find other available functions such as lit() ? I can’t find lit in the api.
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> From: Michael Armbrust [mailto:michael@databricks.com] 
>>>>>>>>>>> Sent: Friday, October 09, 2015 4:04 PM
>>>>>>>>>>> To: unk1102
>>>>>>>>>>> Cc: user
>>>>>>>>>>> Subject: Re: How to calculate percentile of a column of DataFrame?
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes.
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant find any
>>>>>>>>>>> percentile_approx function in Spark aggregation functions. For e.g. in Hive
>>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>> 
>>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>>>>>>> 
>>>>>>>>>>> I can see ntile function but not sure how it is gonna give results same as
>>>>>>>>>>> above query please guide.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>>>>>>>> 
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

OK thanks much Ted looks like some issue while using maven dependencies in
Java code for 1.5.1. I am still not able to understand if spark 1.5.1
binary in spark-shell can recognize callUdf then why not callUdf not
getting compiled while using maven build.
On Oct 13, 2015 2:20 PM, "Ted Yu" <yu...@gmail.com> wrote:

> Pardon me.
> I didn't read your previous response clearly.
>
> I will try to reproduce the compilation error on master branch.
> Right now, I have some other high priority task on hand.
>
> BTW I was looking at SPARK-10671
>
> FYI
>
> On Tue, Oct 13, 2015 at 1:42 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>> 1.5.1 binary in spark-shell.
>> On Oct 13, 2015 1:32 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>>> Looks like the fix went in after 1.5.1 was released.
>>>
>>> You may verify using master branch build.
>>>
>>> Cheers
>>>
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>> and column types only. Please guide.
>>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>>> lit(0.25))).show()
>>>> +------------------------------+
>>>> |'percentile_approx(value,0.25)|
>>>> +------------------------------+
>>>> |                           1.0|
>>>> +------------------------------+
>>>>
>>>> Can you upgrade to 1.5.1 ?
>>>>
>>>> Cheers
>>>>
>>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>>> available in Spark 1.4.0 as per JAvadocx
>>>>>
>>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>>> efforts. Do we need to register Hive UDFs?
>>>>>>
>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>>
>>>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>>>> gives compilation error
>>>>>>
>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>>> error
>>>>>>
>>>>>> //compile error because callUdf() takes String and Column* as
>>>>>> arguments.
>>>>>>
>>>>>> Please guide. Thanks much.
>>>>>>
>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>>
>>>>>>>
>>>>>>> SQL context available as sqlContext.
>>>>>>>
>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>>> "value")
>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>>
>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>>> * v + cnst)
>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>>>
>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>> lit(25))).show()
>>>>>>> +---+--------------------+
>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>> +---+--------------------+
>>>>>>> |id1|                  26|
>>>>>>> |id2|                  41|
>>>>>>> |id3|                  50|
>>>>>>> +---+--------------------+
>>>>>>>
>>>>>>> Which Spark release are you using ?
>>>>>>>
>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>> code it does not work.
>>>>>>>>
>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>
>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <umesh.kacha@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>
>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>
>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just saying to
>>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>>
>>>>>>>>>>> Saif
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>>
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>>> percent
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>>>>
>>>>>>>>>>> Creates a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>>>>> literal value.
>>>>>>>>>>>
>>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>>> Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>>> *To:* unk1102
>>>>>>>>>>> *Cc:* user
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>>> from dataframes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I
>>>>>>>>>>> cant find any
>>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>>> e.g. in Hive
>>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>>
>>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>>> myTable);
>>>>>>>>>>>
>>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>>> results same as
>>>>>>>>>>> above query please guide.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

Pardon me.
I didn't read your previous response clearly.

I will try to reproduce the compilation error on master branch.
Right now, I have some other high priority task on hand.

BTW I was looking at SPARK-10671

FYI

On Tue, Oct 13, 2015 at 1:42 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Ted if fix went after 1.5.1 release then how come it's working with
> 1.5.1 binary in spark-shell.
> On Oct 13, 2015 1:32 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
>> Looks like the fix went in after 1.5.1 was released.
>>
>> You may verify using master branch build.
>>
>> Cheers
>>
>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com> wrote:
>>
>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>> 1.5.1 maven libraries it still complains same that callUdf can have string
>> and column types only. Please guide.
>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>> lit(0.25))).show()
>>> +------------------------------+
>>> |'percentile_approx(value,0.25)|
>>> +------------------------------+
>>> |                           1.0|
>>> +------------------------------+
>>>
>>> Can you upgrade to 1.5.1 ?
>>>
>>> Cheers
>>>
>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>> available in Spark 1.4.0 as per JAvadocx
>>>>
>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>> efforts. Do we need to register Hive UDFs?
>>>>>
>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>
>>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>>> gives compilation error
>>>>>
>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>> error
>>>>>
>>>>> //compile error because callUdf() takes String and Column* as
>>>>> arguments.
>>>>>
>>>>> Please guide. Thanks much.
>>>>>
>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>
>>>>>>
>>>>>> SQL context available as sqlContext.
>>>>>>
>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>> "value")
>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>
>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>> * v + cnst)
>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>>
>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>> lit(25))).show()
>>>>>> +---+--------------------+
>>>>>> | id|'simpleUDF(value,25)|
>>>>>> +---+--------------------+
>>>>>> |id1|                  26|
>>>>>> |id2|                  41|
>>>>>> |id3|                  50|
>>>>>> +---+--------------------+
>>>>>>
>>>>>> Which Spark release are you using ?
>>>>>>
>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>>>> it does not work.
>>>>>>>
>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>
>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>
>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> thanks much Michael let me try.
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>
>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>
>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>> need to be columns that are passed in as arguments.  lit is just saying to
>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>
>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for clarification
>>>>>>>>>>
>>>>>>>>>> Saif
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>
>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>> DataFrame?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>> percent
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>>>
>>>>>>>>>> Creates a Column
>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>>>> literal value.
>>>>>>>>>>
>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>> Column
>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>> *To:* unk1102
>>>>>>>>>> *Cc:* user
>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>> DataFrame?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>> from dataframes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>>>>> find any
>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>> e.g. in Hive
>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>
>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>> myTable);
>>>>>>>>>>
>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>> results same as
>>>>>>>>>> above query please guide.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context:
>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Ted if fix went after 1.5.1 release then how come it's working with
1.5.1 binary in spark-shell.
On Oct 13, 2015 1:32 PM, "Ted Yu" <yu...@gmail.com> wrote:

> Looks like the fix went in after 1.5.1 was released.
>
> You may verify using master branch build.
>
> Cheers
>
> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com> wrote:
>
> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
> you mentioned it works using 1.5.1 but it doesn't compile in Java using
> 1.5.1 maven libraries it still complains same that callUdf can have string
> and column types only. Please guide.
> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>
>> SQL context available as sqlContext.
>>
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>> "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>
>> scala> df.select(callUDF("percentile_approx",col("value"),
>> lit(0.25))).show()
>> +------------------------------+
>> |'percentile_approx(value,0.25)|
>> +------------------------------+
>> |                           1.0|
>> +------------------------------+
>>
>> Can you upgrade to 1.5.1 ?
>>
>> Cheers
>>
>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>>> in Spark 1.4.0 as per JAvadocx
>>>
>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>>> Do we need to register Hive UDFs?
>>>>
>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>
>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>> gives compilation error
>>>>
>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>> error
>>>>
>>>> //compile error because callUdf() takes String and Column* as arguments.
>>>>
>>>> Please guide. Thanks much.
>>>>
>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>
>>>>>
>>>>> SQL context available as sqlContext.
>>>>>
>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>> "value")
>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>
>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
>>>>> v + cnst)
>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>>
>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>>> +---+--------------------+
>>>>> | id|'simpleUDF(value,25)|
>>>>> +---+--------------------+
>>>>> |id1|                  26|
>>>>> |id2|                  41|
>>>>> |id3|                  50|
>>>>> +---+--------------------+
>>>>>
>>>>> Which Spark release are you using ?
>>>>>
>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>>> it does not work.
>>>>>>
>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>
>>>>>> Above code does not compile because callUdf() takes only two
>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> thanks much Michael let me try.
>>>>>>>
>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>> michael@databricks.com> wrote:
>>>>>>>
>>>>>>>> This is confusing because I made a typo...
>>>>>>>>
>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>
>>>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>>>> to be columns that are passed in as arguments.  lit is just saying to make
>>>>>>>> a literal column that always has the value 0.25.
>>>>>>>>
>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for clarification
>>>>>>>>>
>>>>>>>>> Saif
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>
>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>> DataFrame?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>>
>>>>>>>>> Creates a Column
>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>>> literal value.
>>>>>>>>>
>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>> Column
>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>>> Otherwise, a new Column
>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>>> created to represent the literal value.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>> can’t find lit in the api.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>> *To:* unk1102
>>>>>>>>> *Cc:* user
>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>> DataFrame?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>> from dataframes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>>>> find any
>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>> e.g. in Hive
>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>
>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>> myTable);
>>>>>>>>>
>>>>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>>>>> same as
>>>>>>>>> above query please guide.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: How to calculate percentile of a column of DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

Looks like the fix went in after 1.5.1 was released. 

You may verify using master branch build. 

Cheers

> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <um...@gmail.com> wrote:
> 
> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1 maven libraries it still complains same that callUdf can have string and column types only. Please guide.
> 
>> On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:
>> SQL context available as sqlContext.
>> 
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>> 
>> scala> df.select(callUDF("percentile_approx",col("value"), lit(0.25))).show()
>> +------------------------------+
>> |'percentile_approx(value,0.25)|
>> +------------------------------+
>> |                           1.0|
>> +------------------------------+
>> 
>> Can you upgrade to 1.5.1 ?
>> 
>> Cheers
>> 
>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com> wrote:
>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available in Spark 1.4.0 as per JAvadocx
>>> 
>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>> Hi Ted thanks much for the detailed answer and appreciate your efforts. Do we need to register Hive UDFs?
>>>> 
>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>> 
>>>> I am calling Hive UDF percentile_approx in the following manner which gives compilation error
>>>> 
>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile error
>>>> 
>>>> //compile error because callUdf() takes String and Column* as arguments.
>>>> 
>>>> Please guide. Thanks much.
>>>> 
>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>> 
>>>>> 
>>>>> SQL context available as sqlContext.
>>>>> 
>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>> 
>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v + cnst)
>>>>> res0: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List())
>>>>> 
>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>>> +---+--------------------+
>>>>> | id|'simpleUDF(value,25)|
>>>>> +---+--------------------+
>>>>> |id1|                  26|
>>>>> |id2|                  41|
>>>>> |id3|                  50|
>>>>> +---+--------------------+
>>>>> 
>>>>> Which Spark release are you using ?
>>>>> 
>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com> wrote:
>>>>>> I have a doubt Michael I tried to use callUDF in  the following code it does not work. 
>>>>>> 
>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>> 
>>>>>> Above code does not compile because callUdf() takes only two arguments function name in String and Column class type. Please guide.
>>>>>> 
>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com> wrote:
>>>>>>> thanks much Michael let me try. 
>>>>>>> 
>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <mi...@databricks.com> wrote:
>>>>>>>> This is confusing because I made a typo...
>>>>>>>> 
>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>> 
>>>>>>>> The first argument is the name of the UDF, all other arguments need to be columns that are passed in as arguments.  lit is just saying to make a literal column that always has the value 0.25.
>>>>>>>> 
>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>>>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> Thanks for clarification
>>>>>>>>> 
>>>>>>>>> Saif
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> From: Umesh Kacha [mailto:umesh.kacha@gmail.com] 
>>>>>>>>> Sent: Friday, October 09, 2015 4:10 PM
>>>>>>>>> To: Ellafi, Saif A.
>>>>>>>>> Cc: Michael Armbrust; user
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Subject: Re: How to calculate percentile of a column of DataFrame?
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> public static Column lit(Object literal)
>>>>>>>>> Creates a Column of literal value.
>>>>>>>>> 
>>>>>>>>> The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com> wrote:
>>>>>>>>> 
>>>>>>>>> Where can we find other available functions such as lit() ? I can’t find lit in the api.
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> From: Michael Armbrust [mailto:michael@databricks.com] 
>>>>>>>>> Sent: Friday, October 09, 2015 4:04 PM
>>>>>>>>> To: unk1102
>>>>>>>>> Cc: user
>>>>>>>>> Subject: Re: How to calculate percentile of a column of DataFrame?
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes.
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant find any
>>>>>>>>> percentile_approx function in Spark aggregation functions. For e.g. in Hive
>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>> 
>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>>>>> 
>>>>>>>>> I can see ntile function but not sure how it is gonna give results same as
>>>>>>>>> above query please guide.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you
mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1
maven libraries it still complains same that callUdf can have string and
column types only. Please guide.
On Oct 13, 2015 12:34 AM, "Ted Yu" <yu...@gmail.com> wrote:

> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> df.select(callUDF("percentile_approx",col("value"),
> lit(0.25))).show()
> +------------------------------+
> |'percentile_approx(value,0.25)|
> +------------------------------+
> |                           1.0|
> +------------------------------+
>
> Can you upgrade to 1.5.1 ?
>
> Cheers
>
> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>> in Spark 1.4.0 as per JAvadocx
>>
>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>> Do we need to register Hive UDFs?
>>>
>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>
>>> I am calling Hive UDF percentile_approx in the following manner which
>>> gives compilation error
>>>
>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>> error
>>>
>>> //compile error because callUdf() takes String and Column* as arguments.
>>>
>>> Please guide. Thanks much.
>>>
>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>
>>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
>>>> v + cnst)
>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>>
>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>> +---+--------------------+
>>>> | id|'simpleUDF(value,25)|
>>>> +---+--------------------+
>>>> |id1|                  26|
>>>> |id2|                  41|
>>>> |id3|                  50|
>>>> +---+--------------------+
>>>>
>>>> Which Spark release are you using ?
>>>>
>>>> Can you pastebin the full stack trace where you got the error ?
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>> it does not work.
>>>>>
>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>
>>>>> Above code does not compile because callUdf() takes only two arguments
>>>>> function name in String and Column class type. Please guide.
>>>>>
>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> thanks much Michael let me try.
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>> michael@databricks.com> wrote:
>>>>>>
>>>>>>> This is confusing because I made a typo...
>>>>>>>
>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>
>>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>>> to be columns that are passed in as arguments.  lit is just saying to make
>>>>>>> a literal column that always has the value 0.25.
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>>>> --> becomes a percentile function lit(25)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for clarification
>>>>>>>>
>>>>>>>> Saif
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>
>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>> DataFrame?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>>
>>>>>>>> Creates a Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>>> literal value.
>>>>>>>>
>>>>>>>> The passed in object is returned directly if it is already a Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>>> Otherwise, a new Column
>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>>> created to represent the literal value.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>>>> find lit in the api.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>> *To:* unk1102
>>>>>>>> *Cc:* user
>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>> DataFrame?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>> from dataframes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>>> find any
>>>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>>>> in Hive
>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>
>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>> myTable);
>>>>>>>>
>>>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>>>> same as
>>>>>>>> above query please guide.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

SQL context available as sqlContext.

scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: string, value: int]

scala> df.select(callUDF("percentile_approx",col("value"),
lit(0.25))).show()
+------------------------------+
|'percentile_approx(value,0.25)|
+------------------------------+
|                           1.0|
+------------------------------+

Can you upgrade to 1.5.1 ?

Cheers

On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <um...@gmail.com> wrote:

> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
> in Spark 1.4.0 as per JAvadocx
>
> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>> Do we need to register Hive UDFs?
>>
>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>
>> I am calling Hive UDF percentile_approx in the following manner which
>> gives compilation error
>>
>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>> error
>>
>> //compile error because callUdf() takes String and Column* as arguments.
>>
>> Please guide. Thanks much.
>>
>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> Using spark-shell, I did the following exercise (master branch) :
>>>
>>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v
>>> + cnst)
>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>> UserDefinedFunction(<function2>,IntegerType,List())
>>>
>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>> +---+--------------------+
>>> | id|'simpleUDF(value,25)|
>>> +---+--------------------+
>>> |id1|                  26|
>>> |id2|                  41|
>>> |id3|                  50|
>>> +---+--------------------+
>>>
>>> Which Spark release are you using ?
>>>
>>> Can you pastebin the full stack trace where you got the error ?
>>>
>>> Cheers
>>>
>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>>> does not work.
>>>>
>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>
>>>> Above code does not compile because callUdf() takes only two arguments
>>>> function name in String and Column class type. Please guide.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> thanks much Michael let me try.
>>>>>
>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>>
>>>>>> This is confusing because I made a typo...
>>>>>>
>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>
>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>> to be columns that are passed in as arguments.  lit is just saying to make
>>>>>> a literal column that always has the value 0.25.
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>>> --> becomes a percentile function lit(25)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for clarification
>>>>>>>
>>>>>>> Saif
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>> *To:* Ellafi, Saif A.
>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>
>>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>
>>>>>>> Creates a Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>> literal value.
>>>>>>>
>>>>>>> The passed in object is returned directly if it is already a Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>> Otherwise, a new Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>> created to represent the literal value.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>>> find lit in the api.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>> *To:* unk1102
>>>>>>> *Cc:* user
>>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>>>> dataframes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>> find any
>>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>>> in Hive
>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>
>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>> myTable);
>>>>>>>
>>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>>> same as
>>>>>>> above query please guide.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available in
Spark 1.4.0 as per JAvadocx

On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Ted thanks much for the detailed answer and appreciate your efforts. Do
> we need to register Hive UDFs?
>
> sqlContext.udf.register("percentile_approx");???//is it valid?
>
> I am calling Hive UDF percentile_approx in the following manner which
> gives compilation error
>
> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
> error
>
> //compile error because callUdf() takes String and Column* as arguments.
>
> Please guide. Thanks much.
>
> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Using spark-shell, I did the following exercise (master branch) :
>>
>>
>> SQL context available as sqlContext.
>>
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>> "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>
>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v
>> + cnst)
>> res0: org.apache.spark.sql.UserDefinedFunction =
>> UserDefinedFunction(<function2>,IntegerType,List())
>>
>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>> +---+--------------------+
>> | id|'simpleUDF(value,25)|
>> +---+--------------------+
>> |id1|                  26|
>> |id2|                  41|
>> |id3|                  50|
>> +---+--------------------+
>>
>> Which Spark release are you using ?
>>
>> Can you pastebin the full stack trace where you got the error ?
>>
>> Cheers
>>
>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>> does not work.
>>>
>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>
>>> Above code does not compile because callUdf() takes only two arguments
>>> function name in String and Column class type. Please guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> thanks much Michael let me try.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> This is confusing because I made a typo...
>>>>>
>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>
>>>>> The first argument is the name of the UDF, all other arguments need to
>>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>>> literal column that always has the value 0.25.
>>>>>
>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>>>>
>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>> --> becomes a percentile function lit(25)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for clarification
>>>>>>
>>>>>> Saif
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>> *To:* Ellafi, Saif A.
>>>>>> *Cc:* Michael Armbrust; user
>>>>>>
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>
>>>>>>
>>>>>>
>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>
>>>>>> Creates a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>> literal value.
>>>>>>
>>>>>> The passed in object is returned directly if it is already a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>> Otherwise, a new Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>> created to represent the literal value.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>> wrote:
>>>>>>
>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>> find lit in the api.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>> *To:* unk1102
>>>>>> *Cc:* user
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>>> dataframes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>> find any
>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>> in Hive
>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>
>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>>
>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>> same as
>>>>>> above query please guide.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Ted thanks much for the detailed answer and appreciate your efforts. Do
we need to register Hive UDFs?

sqlContext.udf.register("percentile_approx");???//is it valid?

I am calling Hive UDF percentile_approx in the following manner which gives
compilation error

df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
error

//compile error because callUdf() takes String and Column* as arguments.

Please guide. Thanks much.

On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yu...@gmail.com> wrote:

> Using spark-shell, I did the following exercise (master branch) :
>
>
> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v +
> cnst)
> res0: org.apache.spark.sql.UserDefinedFunction =
> UserDefinedFunction(<function2>,IntegerType,List())
>
> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
> +---+--------------------+
> | id|'simpleUDF(value,25)|
> +---+--------------------+
> |id1|                  26|
> |id2|                  41|
> |id3|                  50|
> +---+--------------------+
>
> Which Spark release are you using ?
>
> Can you pastebin the full stack trace where you got the error ?
>
> Cheers
>
> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com> wrote:
>
>> I have a doubt Michael I tried to use callUDF in  the following code it
>> does not work.
>>
>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>
>> Above code does not compile because callUdf() takes only two arguments
>> function name in String and Column class type. Please guide.
>>
>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> thanks much Michael let me try.
>>>
>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> This is confusing because I made a typo...
>>>>
>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>
>>>> The first argument is the name of the UDF, all other arguments need to
>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>> literal column that always has the value 0.25.
>>>>
>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>>>
>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>> --> becomes a percentile function lit(25)
>>>>>
>>>>>
>>>>>
>>>>> Thanks for clarification
>>>>>
>>>>> Saif
>>>>>
>>>>>
>>>>>
>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>> *To:* Ellafi, Saif A.
>>>>> *Cc:* Michael Armbrust; user
>>>>>
>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>
>>>>>
>>>>>
>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>
>>>>>
>>>>>
>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>
>>>>> Creates a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>> literal value.
>>>>>
>>>>> The passed in object is returned directly if it is already a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>> Otherwise, a new Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>> created to represent the literal value.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>> wrote:
>>>>>
>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>> find lit in the api.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>> *To:* unk1102
>>>>> *Cc:* user
>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>
>>>>>
>>>>>
>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>> dataframes.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant find
>>>>> any
>>>>> percentile_approx function in Spark aggregation functions. For e.g. in
>>>>> Hive
>>>>> we have percentile_approx and we can use it in the following way
>>>>>
>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>
>>>>> I can see ntile function but not sure how it is gonna give results
>>>>> same as
>>>>> above query please guide.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

Using spark-shell, I did the following exercise (master branch) :


SQL context available as sqlContext.

scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: string, value: int]

scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v +
cnst)
res0: org.apache.spark.sql.UserDefinedFunction =
UserDefinedFunction(<function2>,IntegerType,List())

scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
+---+--------------------+
| id|'simpleUDF(value,25)|
+---+--------------------+
|id1|                  26|
|id2|                  41|
|id3|                  50|
+---+--------------------+

Which Spark release are you using ?

Can you pastebin the full stack trace where you got the error ?

Cheers

On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <um...@gmail.com> wrote:

> I have a doubt Michael I tried to use callUDF in  the following code it
> does not work.
>
> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>
> Above code does not compile because callUdf() takes only two arguments
> function name in String and Column class type. Please guide.
>
> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> thanks much Michael let me try.
>>
>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> This is confusing because I made a typo...
>>>
>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>
>>> The first argument is the name of the UDF, all other arguments need to
>>> be columns that are passed in as arguments.  lit is just saying to make a
>>> literal column that always has the value 0.25.
>>>
>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>>
>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
>>>> becomes a percentile function lit(25)
>>>>
>>>>
>>>>
>>>> Thanks for clarification
>>>>
>>>> Saif
>>>>
>>>>
>>>>
>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>> *To:* Ellafi, Saif A.
>>>> *Cc:* Michael Armbrust; user
>>>>
>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>
>>>>
>>>>
>>>> I found it in 1.3 documentation lit says something else not percent
>>>>
>>>>
>>>>
>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>
>>>> Creates a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>> literal value.
>>>>
>>>> The passed in object is returned directly if it is already a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>> If the object is a Scala Symbol, it is converted into a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>> Otherwise, a new Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>> created to represent the literal value.
>>>>
>>>>
>>>>
>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com> wrote:
>>>>
>>>> Where can we find other available functions such as lit() ? I can’t
>>>> find lit in the api.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>> *To:* unk1102
>>>> *Cc:* user
>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>
>>>>
>>>>
>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>> dataframes.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com> wrote:
>>>>
>>>> Hi how to calculate percentile of a column in a DataFrame? I cant find
>>>> any
>>>> percentile_approx function in Spark aggregation functions. For e.g. in
>>>> Hive
>>>> we have percentile_approx and we can use it in the following way
>>>>
>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>
>>>> I can see ntile function but not sure how it is gonna give results same
>>>> as
>>>> above query please guide.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Richard Eggert <ri...@gmail.com>.

I think the problem may be that callUDF takes a DataType indicating the
return type of the UDF as its second argument.
On Oct 12, 2015 9:27 AM, "Umesh Kacha" <um...@gmail.com> wrote:

> Hi if you can help it would be great as I am stuck don't know how to
> remove compilation error in callUdf when we pass three parameters function
> name string column name as col and lit function please guide
> On Oct 11, 2015 1:05 AM, "Umesh Kacha" <um...@gmail.com> wrote:
>
>> Hi any idea? how do I call percentlie_approx using callUdf() please guide.
>>
>> On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>> does not work.
>>>
>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>
>>> Above code does not compile because callUdf() takes only two arguments
>>> function name in String and Column class type. Please guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> thanks much Michael let me try.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> This is confusing because I made a typo...
>>>>>
>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>
>>>>> The first argument is the name of the UDF, all other arguments need to
>>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>>> literal column that always has the value 0.25.
>>>>>
>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>>>>
>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>> --> becomes a percentile function lit(25)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for clarification
>>>>>>
>>>>>> Saif
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>> *To:* Ellafi, Saif A.
>>>>>> *Cc:* Michael Armbrust; user
>>>>>>
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>
>>>>>>
>>>>>>
>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>
>>>>>> Creates a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>> literal value.
>>>>>>
>>>>>> The passed in object is returned directly if it is already a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>> Otherwise, a new Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>> created to represent the literal value.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>> wrote:
>>>>>>
>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>> find lit in the api.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>> *To:* unk1102
>>>>>> *Cc:* user
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>>> dataframes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>> find any
>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>> in Hive
>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>
>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>>
>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>> same as
>>>>>> above query please guide.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Ted thanks if I dont pass lit function then how can I tell
percentile_approx function to give me 25% or 50% like we do in Hive
percentile_approx(mycol,0.25).

Regards

On Mon, Oct 12, 2015 at 7:20 PM, Ted Yu <yu...@gmail.com> wrote:

> Umesh:
> Have you tried calling callUdf without the lit() parameter ?
>
> Cheers
>
> On Mon, Oct 12, 2015 at 6:27 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> Hi if you can help it would be great as I am stuck don't know how to
>> remove compilation error in callUdf when we pass three parameters function
>> name string column name as col and lit function please guide
>> On Oct 11, 2015 1:05 AM, "Umesh Kacha" <um...@gmail.com> wrote:
>>
>>> Hi any idea? how do I call percentlie_approx using callUdf() please
>>> guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>>> does not work.
>>>>
>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>
>>>> Above code does not compile because callUdf() takes only two arguments
>>>> function name in String and Column class type. Please guide.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> thanks much Michael let me try.
>>>>>
>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>>
>>>>>> This is confusing because I made a typo...
>>>>>>
>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>
>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>> to be columns that are passed in as arguments.  lit is just saying to make
>>>>>> a literal column that always has the value 0.25.
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>>> --> becomes a percentile function lit(25)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for clarification
>>>>>>>
>>>>>>> Saif
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>> *To:* Ellafi, Saif A.
>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>
>>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>>
>>>>>>> Creates a Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>>> literal value.
>>>>>>>
>>>>>>> The passed in object is returned directly if it is already a Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>>> Otherwise, a new Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>>> created to represent the literal value.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>>> find lit in the api.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>> *To:* unk1102
>>>>>>> *Cc:* user
>>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>>>> dataframes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>> find any
>>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>>> in Hive
>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>
>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>> myTable);
>>>>>>>
>>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>>> same as
>>>>>>> above query please guide.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

Umesh:
Have you tried calling callUdf without the lit() parameter ?

Cheers

On Mon, Oct 12, 2015 at 6:27 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi if you can help it would be great as I am stuck don't know how to
> remove compilation error in callUdf when we pass three parameters function
> name string column name as col and lit function please guide
> On Oct 11, 2015 1:05 AM, "Umesh Kacha" <um...@gmail.com> wrote:
>
>> Hi any idea? how do I call percentlie_approx using callUdf() please guide.
>>
>> On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>> does not work.
>>>
>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>
>>> Above code does not compile because callUdf() takes only two arguments
>>> function name in String and Column class type. Please guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> thanks much Michael let me try.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> This is confusing because I made a typo...
>>>>>
>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>
>>>>> The first argument is the name of the UDF, all other arguments need to
>>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>>> literal column that always has the value 0.25.
>>>>>
>>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>>>>
>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>> --> becomes a percentile function lit(25)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for clarification
>>>>>>
>>>>>> Saif
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>> *To:* Ellafi, Saif A.
>>>>>> *Cc:* Michael Armbrust; user
>>>>>>
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>
>>>>>>
>>>>>>
>>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>>
>>>>>> Creates a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>>> literal value.
>>>>>>
>>>>>> The passed in object is returned directly if it is already a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>>> Otherwise, a new Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>>> created to represent the literal value.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>>> wrote:
>>>>>>
>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>> find lit in the api.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>> *To:* unk1102
>>>>>> *Cc:* user
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>>> dataframes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>> find any
>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>> in Hive
>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>
>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>>
>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>> same as
>>>>>> above query please guide.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi if you can help it would be great as I am stuck don't know how to remove
compilation error in callUdf when we pass three parameters function name
string column name as col and lit function please guide
On Oct 11, 2015 1:05 AM, "Umesh Kacha" <um...@gmail.com> wrote:

> Hi any idea? how do I call percentlie_approx using callUdf() please guide.
>
> On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> I have a doubt Michael I tried to use callUDF in  the following code it
>> does not work.
>>
>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>
>> Above code does not compile because callUdf() takes only two arguments
>> function name in String and Column class type. Please guide.
>>
>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> thanks much Michael let me try.
>>>
>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> This is confusing because I made a typo...
>>>>
>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>
>>>> The first argument is the name of the UDF, all other arguments need to
>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>> literal column that always has the value 0.25.
>>>>
>>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>>>
>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>> --> becomes a percentile function lit(25)
>>>>>
>>>>>
>>>>>
>>>>> Thanks for clarification
>>>>>
>>>>> Saif
>>>>>
>>>>>
>>>>>
>>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>> *To:* Ellafi, Saif A.
>>>>> *Cc:* Michael Armbrust; user
>>>>>
>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>
>>>>>
>>>>>
>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>
>>>>>
>>>>>
>>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>>
>>>>> Creates a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>>> literal value.
>>>>>
>>>>> The passed in object is returned directly if it is already a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>>> Otherwise, a new Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>>> created to represent the literal value.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>
>>>>> wrote:
>>>>>
>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>> find lit in the api.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>> *To:* unk1102
>>>>> *Cc:* user
>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>
>>>>>
>>>>>
>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>> dataframes.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant find
>>>>> any
>>>>> percentile_approx function in Spark aggregation functions. For e.g. in
>>>>> Hive
>>>>> we have percentile_approx and we can use it in the following way
>>>>>
>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>
>>>>> I can see ntile function but not sure how it is gonna give results
>>>>> same as
>>>>> above query please guide.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi any idea? how do I call percentlie_approx using callUdf() please guide.

On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha <um...@gmail.com> wrote:

> I have a doubt Michael I tried to use callUDF in  the following code it
> does not work.
>
> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>
> Above code does not compile because callUdf() takes only two arguments
> function name in String and Column class type. Please guide.
>
> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> thanks much Michael let me try.
>>
>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> This is confusing because I made a typo...
>>>
>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>
>>> The first argument is the name of the UDF, all other arguments need to
>>> be columns that are passed in as arguments.  lit is just saying to make a
>>> literal column that always has the value 0.25.
>>>
>>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>>
>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
>>>> becomes a percentile function lit(25)
>>>>
>>>>
>>>>
>>>> Thanks for clarification
>>>>
>>>> Saif
>>>>
>>>>
>>>>
>>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>> *To:* Ellafi, Saif A.
>>>> *Cc:* Michael Armbrust; user
>>>>
>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>
>>>>
>>>>
>>>> I found it in 1.3 documentation lit says something else not percent
>>>>
>>>>
>>>>
>>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>>
>>>> Creates a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>>> literal value.
>>>>
>>>> The passed in object is returned directly if it is already a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>> If the object is a Scala Symbol, it is converted into a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>>> Otherwise, a new Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>>> created to represent the literal value.
>>>>
>>>>
>>>>
>>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com> wrote:
>>>>
>>>> Where can we find other available functions such as lit() ? I can’t
>>>> find lit in the api.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>> *To:* unk1102
>>>> *Cc:* user
>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>
>>>>
>>>>
>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>> dataframes.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com> wrote:
>>>>
>>>> Hi how to calculate percentile of a column in a DataFrame? I cant find
>>>> any
>>>> percentile_approx function in Spark aggregation functions. For e.g. in
>>>> Hive
>>>> we have percentile_approx and we can use it in the following way
>>>>
>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>
>>>> I can see ntile function but not sure how it is gonna give results same
>>>> as
>>>> above query please guide.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

I have a doubt Michael I tried to use callUDF in  the following code it
does not work.

sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))

Above code does not compile because callUdf() takes only two arguments
function name in String and Column class type. Please guide.

On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com> wrote:

> thanks much Michael let me try.
>
> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> This is confusing because I made a typo...
>>
>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>
>> The first argument is the name of the UDF, all other arguments need to be
>> columns that are passed in as arguments.  lit is just saying to make a
>> literal column that always has the value 0.25.
>>
>> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>>
>>> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
>>> becomes a percentile function lit(25)
>>>
>>>
>>>
>>> Thanks for clarification
>>>
>>> Saif
>>>
>>>
>>>
>>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>> *To:* Ellafi, Saif A.
>>> *Cc:* Michael Armbrust; user
>>>
>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>
>>>
>>>
>>> I found it in 1.3 documentation lit says something else not percent
>>>
>>>
>>>
>>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>>
>>> Creates a Column
>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>>> literal value.
>>>
>>> The passed in object is returned directly if it is already a Column
>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>> If the object is a Scala Symbol, it is converted into a Column
>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>>> Otherwise, a new Column
>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>>> created to represent the literal value.
>>>
>>>
>>>
>>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com> wrote:
>>>
>>> Where can we find other available functions such as lit() ? I can’t find
>>> lit in the api.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>> *To:* unk1102
>>> *Cc:* user
>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>
>>>
>>>
>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>> dataframes.
>>>
>>>
>>>
>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com> wrote:
>>>
>>> Hi how to calculate percentile of a column in a DataFrame? I cant find
>>> any
>>> percentile_approx function in Spark aggregation functions. For e.g. in
>>> Hive
>>> we have percentile_approx and we can use it in the following way
>>>
>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>
>>> I can see ntile function but not sure how it is gonna give results same
>>> as
>>> above query please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

thanks much Michael let me try.

On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> This is confusing because I made a typo...
>
> callUDF("percentile_approx", col("mycol"), lit(0.25))
>
> The first argument is the name of the UDF, all other arguments need to be
> columns that are passed in as arguments.  lit is just saying to make a
> literal column that always has the value 0.25.
>
> On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:
>
>> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
>> becomes a percentile function lit(25)
>>
>>
>>
>> Thanks for clarification
>>
>> Saif
>>
>>
>>
>> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
>> *Sent:* Friday, October 09, 2015 4:10 PM
>> *To:* Ellafi, Saif A.
>> *Cc:* Michael Armbrust; user
>>
>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>
>>
>>
>> I found it in 1.3 documentation lit says something else not percent
>>
>>
>>
>> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>>
>> Creates a Column
>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
>> literal value.
>>
>> The passed in object is returned directly if it is already a Column
>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>> If the object is a Scala Symbol, it is converted into a Column
>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
>> Otherwise, a new Column
>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
>> created to represent the literal value.
>>
>>
>>
>> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com> wrote:
>>
>> Where can we find other available functions such as lit() ? I can’t find
>> lit in the api.
>>
>>
>>
>> Thanks
>>
>>
>>
>> *From:* Michael Armbrust [mailto:michael@databricks.com]
>> *Sent:* Friday, October 09, 2015 4:04 PM
>> *To:* unk1102
>> *Cc:* user
>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>
>>
>>
>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>> dataframes.
>>
>>
>>
>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com> wrote:
>>
>> Hi how to calculate percentile of a column in a DataFrame? I cant find any
>> percentile_approx function in Spark aggregation functions. For e.g. in
>> Hive
>> we have percentile_approx and we can use it in the following way
>>
>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>
>> I can see ntile function but not sure how it is gonna give results same as
>> above query please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>>
>>
>
>

Re: How to calculate percentile of a column of DataFrame?

Posted by Michael Armbrust <mi...@databricks.com>.

This is confusing because I made a typo...

callUDF("percentile_approx", col("mycol"), lit(0.25))

The first argument is the name of the UDF, all other arguments need to be
columns that are passed in as arguments.  lit is just saying to make a
literal column that always has the value 0.25.

On Fri, Oct 9, 2015 at 12:16 PM, <Sa...@wellsfargo.com> wrote:

> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
> becomes a percentile function lit(25)
>
>
>
> Thanks for clarification
>
> Saif
>
>
>
> *From:* Umesh Kacha [mailto:umesh.kacha@gmail.com]
> *Sent:* Friday, October 09, 2015 4:10 PM
> *To:* Ellafi, Saif A.
> *Cc:* Michael Armbrust; user
>
> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>
>
>
> I found it in 1.3 documentation lit says something else not percent
>
>
>
> public static Column <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
>
> Creates a Column
> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of
> literal value.
>
> The passed in object is returned directly if it is already a Column
> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
> If the object is a Scala Symbol, it is converted into a Column
> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also.
> Otherwise, a new Column
> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is
> created to represent the literal value.
>
>
>
> On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com> wrote:
>
> Where can we find other available functions such as lit() ? I can’t find
> lit in the api.
>
>
>
> Thanks
>
>
>
> *From:* Michael Armbrust [mailto:michael@databricks.com]
> *Sent:* Friday, October 09, 2015 4:04 PM
> *To:* unk1102
> *Cc:* user
> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>
>
>
> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
> dataframes.
>
>
>
> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com> wrote:
>
> Hi how to calculate percentile of a column in a DataFrame? I cant find any
> percentile_approx function in Spark aggregation functions. For e.g. in Hive
> we have percentile_approx and we can use it in the following way
>
> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>
> I can see ntile function but not sure how it is gonna give results same as
> above query please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>
>

RE: How to calculate percentile of a column of DataFrame?

Posted by Sa...@wellsfargo.com.

Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes a percentile function lit(25)

Thanks for clarification
Saif

From: Umesh Kacha [mailto:umesh.kacha@gmail.com]
Sent: Friday, October 09, 2015 4:10 PM
To: Ellafi, Saif A.
Cc: Michael Armbrust; user
Subject: Re: How to calculate percentile of a column of DataFrame?

I found it in 1.3 documentation lit says something else not percent

public static Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> lit(Object literal)
Creates a Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> of literal value.

The passed in object is returned directly if it is already a Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>. If the object is a Scala Symbol, it is converted into a Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> also. Otherwise, a new Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html> is created to represent the literal value.

On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com>> wrote:
Where can we find other available functions such as lit() ? I can’t find lit in the api.

Thanks

From: Michael Armbrust [mailto:michael@databricks.com<ma...@databricks.com>]
Sent: Friday, October 09, 2015 4:04 PM
To: unk1102
Cc: user
Subject: Re: How to calculate percentile of a column of DataFrame?

You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes.

On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>> wrote:
Hi how to calculate percentile of a column in a DataFrame? I cant find any
percentile_approx function in Spark aggregation functions. For e.g. in Hive
we have percentile_approx and we can use it in the following way

hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);

I can see ntile function but not sure how it is gonna give results same as
above query please guide.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

Re: How to calculate percentile of a column of DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

I found it in 1.3 documentation lit says something else not percent

public static Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
lit(Object literal)

Creates a Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
of
literal value.

The passed in object is returned directly if it is already a Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
If the object is a Scala Symbol, it is converted into a Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
also.
Otherwise, a new Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
is
created to represent the literal value.

On Sat, Oct 10, 2015 at 12:39 AM, <Sa...@wellsfargo.com> wrote:

> Where can we find other available functions such as lit() ? I can’t find
> lit in the api.
>
>
>
> Thanks
>
>
>
> *From:* Michael Armbrust [mailto:michael@databricks.com]
> *Sent:* Friday, October 09, 2015 4:04 PM
> *To:* unk1102
> *Cc:* user
> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>
>
>
> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
> dataframes.
>
>
>
> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com> wrote:
>
> Hi how to calculate percentile of a column in a DataFrame? I cant find any
> percentile_approx function in Spark aggregation functions. For e.g. in Hive
> we have percentile_approx and we can use it in the following way
>
> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>
> I can see ntile function but not sure how it is gonna give results same as
> above query please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>

RE: How to calculate percentile of a column of DataFrame?

Posted by Sa...@wellsfargo.com.

Where can we find other available functions such as lit() ? I can’t find lit in the api.

Thanks

From: Michael Armbrust [mailto:michael@databricks.com]
Sent: Friday, October 09, 2015 4:04 PM
To: unk1102
Cc: user
Subject: Re: How to calculate percentile of a column of DataFrame?

You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes.

On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com>> wrote:
Hi how to calculate percentile of a column in a DataFrame? I cant find any
percentile_approx function in Spark aggregation functions. For e.g. in Hive
we have percentile_approx and we can use it in the following way

hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);

I can see ntile function but not sure how it is gonna give results same as
above query please guide.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

Re: How to calculate percentile of a column of DataFrame?

Posted by Michael Armbrust <mi...@databricks.com>.

You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
dataframes.

On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <um...@gmail.com> wrote:

> Hi how to calculate percentile of a column in a DataFrame? I cant find any
> percentile_approx function in Spark aggregation functions. For e.g. in Hive
> we have percentile_approx and we can use it in the following way
>
> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>
> I can see ntile function but not sure how it is gonna give results same as
> above query please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>