You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tzahi File <tz...@ironsrc.com> on 2019/11/11 14:45:41 UTC

Using Percentile in Spark SQL

Hi,

Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
percentile function. I'm trying to improve this job by moving it to run
with spark SQL.

Any suggestions on how to use a percentile function in Spark?


Thanks,
-- 
Tzahi File
Data Engineer
[image: ironSource] <http://www.ironsrc.com/>

email tzahi.file@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.

Re: Using Percentile in Spark SQL

Posted by Jerry Vinokurov <gr...@gmail.com>.

I don't think the Spark configuration is what you want to focus on. It's
hard to say without knowing the specifics of the job or the data volume,
but you should be able to accomplish this with the percent_rank function in
SparkSQL and a smart partitioning of the data. If your data has a lot of
skew, you can end up with a situation in which some executors are waiting
around to do work while others are stuck with processing larger partitions,
so you'll need to take a look at the actual stats of your data and figure
out if there's a more efficient partitioning strategy that you can use.

On Mon, Nov 11, 2019 at 10:34 AM Tzahi File <tz...@ironsrc.com> wrote:

> Currently, I'm using the percentile approx function with Hive.
> I'm looking for a better way to run this function or another way to get
> the same result with spark, but faster and not using gigantic instances..
>
> I'm trying to optimize this job by changing the Spark configuration. If
> you have any ideas how to approach this, it would be great (like instance
> type, number of instances, number of executers etc.)
>
>
> On Mon, Nov 11, 2019 at 5:16 PM Patrick McCarthy <pm...@dstillery.com>
> wrote:
>
>> Depending on your tolerance for error you could also use
>> percentile_approx().
>>
>> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov <gr...@gmail.com>
>> wrote:
>>
>>> Do you mean that you are trying to compute the percent rank of some
>>> data? You can use the SparkSQL percent_rank function for that, but I don't
>>> think that's going to give you any improvement over calling the percentRank
>>> function on the data frame. Are you currently using a user-defined function
>>> for this task? Because I bet that's what's slowing you down.
>>>
>>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File <tz...@ironsrc.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>>> percentile function. I'm trying to improve this job by moving it to run
>>>> with spark SQL.
>>>>
>>>> Any suggestions on how to use a percentile function in Spark?
>>>>
>>>>
>>>> Thanks,
>>>> --
>>>> Tzahi File
>>>> Data Engineer
>>>> [image: ironSource] <http://www.ironsrc.com/>
>>>>
>>>> email tzahi.file@ironsrc.com
>>>> mobile +972-546864835
>>>> fax +972-77-5448273
>>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>>> ironsrc.com <http://www.ironsrc.com/>
>>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>>> <https://plus.google.com/+ironsrc>
>>>> This email (including any attachments) is for the sole use of the
>>>> intended recipient and may contain confidential information which may be
>>>> protected by legal privilege. If you are not the intended recipient, or the
>>>> employee or agent responsible for delivering it to the intended recipient,
>>>> you are hereby notified that any use, dissemination, distribution or
>>>> copying of this communication and/or its content is strictly prohibited. If
>>>> you are not the intended recipient, please immediately notify us by reply
>>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>>
>>>
>>>
>>> --
>>> http://www.google.com/profiles/grapesmoker
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>
>
> --
> Tzahi File
> Data Engineer
> [image: ironSource] <http://www.ironsrc.com/>
>
> email tzahi.file@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>


-- 
http://www.google.com/profiles/grapesmoker

Re: Using Percentile in Spark SQL

Posted by Tzahi File <tz...@ironsrc.com>.

Currently, I'm using the percentile approx function with Hive.
I'm looking for a better way to run this function or another way to get the
same result with spark, but faster and not using gigantic instances..

I'm trying to optimize this job by changing the Spark configuration. If you
have any ideas how to approach this, it would be great (like instance type,
number of instances, number of executers etc.)

On Mon, Nov 11, 2019 at 5:16 PM Patrick McCarthy <pm...@dstillery.com>
wrote:

> Depending on your tolerance for error you could also use
> percentile_approx().
>
> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov <gr...@gmail.com>
> wrote:
>
>> Do you mean that you are trying to compute the percent rank of some data?
>> You can use the SparkSQL percent_rank function for that, but I don't think
>> that's going to give you any improvement over calling the percentRank
>> function on the data frame. Are you currently using a user-defined function
>> for this task? Because I bet that's what's slowing you down.
>>
>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File <tz...@ironsrc.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>> percentile function. I'm trying to improve this job by moving it to run
>>> with spark SQL.
>>>
>>> Any suggestions on how to use a percentile function in Spark?
>>>
>>>
>>> Thanks,
>>> --
>>> Tzahi File
>>> Data Engineer
>>> [image: ironSource] <http://www.ironsrc.com/>
>>>
>>> email tzahi.file@ironsrc.com
>>> mobile +972-546864835
>>> fax +972-77-5448273
>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>> ironsrc.com <http://www.ironsrc.com/>
>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>> <https://plus.google.com/+ironsrc>
>>> This email (including any attachments) is for the sole use of the
>>> intended recipient and may contain confidential information which may be
>>> protected by legal privilege. If you are not the intended recipient, or the
>>> employee or agent responsible for delivering it to the intended recipient,
>>> you are hereby notified that any use, dissemination, distribution or
>>> copying of this communication and/or its content is strictly prohibited. If
>>> you are not the intended recipient, please immediately notify us by reply
>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>
>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

-- 
Tzahi File
Data Engineer
[image: ironSource] <http://www.ironsrc.com/>

email tzahi.file@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.

Re: Using Percentile in Spark SQL

Posted by Muthu Jayakumar <ba...@gmail.com>.

If you would require higher precision, you may have to write a custom udaf.
In my case, I ended up storing the data as a key-value ordered list of
histograms.

Thanks
Muthu

On Mon, Nov 11, 2019, 20:46 Patrick McCarthy
<pm...@dstillery.com.invalid> wrote:

> Depending on your tolerance for error you could also use
> percentile_approx().
>
> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov <gr...@gmail.com>
> wrote:
>
>> Do you mean that you are trying to compute the percent rank of some data?
>> You can use the SparkSQL percent_rank function for that, but I don't think
>> that's going to give you any improvement over calling the percentRank
>> function on the data frame. Are you currently using a user-defined function
>> for this task? Because I bet that's what's slowing you down.
>>
>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File <tz...@ironsrc.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>> percentile function. I'm trying to improve this job by moving it to run
>>> with spark SQL.
>>>
>>> Any suggestions on how to use a percentile function in Spark?
>>>
>>>
>>> Thanks,
>>> --
>>> Tzahi File
>>> Data Engineer
>>> [image: ironSource] <http://www.ironsrc.com/>
>>>
>>> email tzahi.file@ironsrc.com
>>> mobile +972-546864835
>>> fax +972-77-5448273
>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>> ironsrc.com <http://www.ironsrc.com/>
>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>> <https://plus.google.com/+ironsrc>
>>> This email (including any attachments) is for the sole use of the
>>> intended recipient and may contain confidential information which may be
>>> protected by legal privilege. If you are not the intended recipient, or the
>>> employee or agent responsible for delivering it to the intended recipient,
>>> you are hereby notified that any use, dissemination, distribution or
>>> copying of this communication and/or its content is strictly prohibited. If
>>> you are not the intended recipient, please immediately notify us by reply
>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>
>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Re: Using Percentile in Spark SQL

Posted by Patrick McCarthy <pm...@dstillery.com.INVALID>.

Depending on your tolerance for error you could also use
percentile_approx().

On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov <gr...@gmail.com>
wrote:

> Do you mean that you are trying to compute the percent rank of some data?
> You can use the SparkSQL percent_rank function for that, but I don't think
> that's going to give you any improvement over calling the percentRank
> function on the data frame. Are you currently using a user-defined function
> for this task? Because I bet that's what's slowing you down.
>
> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File <tz...@ironsrc.com> wrote:
>
>> Hi,
>>
>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>> percentile function. I'm trying to improve this job by moving it to run
>> with spark SQL.
>>
>> Any suggestions on how to use a percentile function in Spark?
>>
>>
>> Thanks,
>> --
>> Tzahi File
>> Data Engineer
>> [image: ironSource] <http://www.ironsrc.com/>
>>
>> email tzahi.file@ironsrc.com
>> mobile +972-546864835
>> fax +972-77-5448273
>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>> ironsrc.com <http://www.ironsrc.com/>
>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>> twitter] <https://twitter.com/ironsource>[image: facebook]
>> <https://www.facebook.com/ironSource>[image: googleplus]
>> <https://plus.google.com/+ironsrc>
>> This email (including any attachments) is for the sole use of the
>> intended recipient and may contain confidential information which may be
>> protected by legal privilege. If you are not the intended recipient, or the
>> employee or agent responsible for delivering it to the intended recipient,
>> you are hereby notified that any use, dissemination, distribution or
>> copying of this communication and/or its content is strictly prohibited. If
>> you are not the intended recipient, please immediately notify us by reply
>> email or by telephone, delete this email and destroy any copies. Thank you.
>>
>
>
> --
> http://www.google.com/profiles/grapesmoker
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Re: Using Percentile in Spark SQL

Posted by Jerry Vinokurov <gr...@gmail.com>.

Do you mean that you are trying to compute the percent rank of some data?
You can use the SparkSQL percent_rank function for that, but I don't think
that's going to give you any improvement over calling the percentRank
function on the data frame. Are you currently using a user-defined function
for this task? Because I bet that's what's slowing you down.

On Mon, Nov 11, 2019 at 9:46 AM Tzahi File <tz...@ironsrc.com> wrote:

> Hi,
>
> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
> percentile function. I'm trying to improve this job by moving it to run
> with spark SQL.
>
> Any suggestions on how to use a percentile function in Spark?
>
>
> Thanks,
> --
> Tzahi File
> Data Engineer
> [image: ironSource] <http://www.ironsrc.com/>
>
> email tzahi.file@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>


-- 
http://www.google.com/profiles/grapesmoker