You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Grega Kešpret <gr...@celtra.com> on 2015/06/10 23:53:44 UTC

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

I have some time to work on it now. What's a good way to continue the
discussions before coding it?

This e-mail list, JIRA or something else?

On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <rx...@databricks.com> wrote:

> I think those are great to have. I would put them in the DataFrame API
> though, since this is applying to structured data. Many of the advanced
> functions on the PairRDDFunctions should really go into the DataFrame API
> now we have it.
>
> One thing that would be great to understand is what state-of-the-art
> alternatives are out there. I did a quick google scholar search using the
> keyword "approximate quantile" and found some older papers. Just the
> first few I found:
>
> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>  by Bruce Lindsay, IBM
>
> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>
>
>
>
>
> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com> wrote:
>
>> Hi!
>>
>> I'd like to get community's opinion on implementing a generic quantile
>> approximation algorithm for Spark that is O(n) and requires limited memory.
>> I would find it useful and I haven't found any existing implementation. The
>> plan was basically to wrap t-digest
>> <https://github.com/tdunning/t-digest>, implement the
>> serialization/deserialization boilerplate and provide
>>
>> def cdf(x: Double): Double
>> def quantile(q: Double): Double
>>
>>
>> on RDD[Double] and RDD[(K, Double)].
>>
>> Let me know what you think. Any other ideas/suggestions also welcome!
>>
>> Best,
>> Grega
>> --
>> [image: Inline image 1]*Grega Kešpret*
>> Senior Software Engineer, Analytics
>>
>> Skype: gregakespret
>> celtra.com <http://www.celtra.com/> | @celtramobile
>> <http://www.twitter.com/celtramobile>
>>
>>
>

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Posted by Reynold Xin <rx...@databricks.com>.
One way for this to happen is to have the intermediate data for the
aggregate function be a byte array operated using Unsafe -- that plays very
nicely with the binary data processing we are doing (i.e. fast
serialization, no gc).

The downside is that we'd need to re-implement whatever algorithm is out
there in order for them to be usable with Unsafe/byte arrays.



On Thu, Jun 18, 2015 at 1:22 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> If it's going into the DataFrame API (which it probably should rather than
> in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT)
> which would mean it doesn't have to implement Serializable, as it appears
> that serialization is taken care of in the UDT def (e.g.
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L254
> )
>
> If I understand correctly UDT SerDe correctly?
>
> On Thu, Jun 11, 2015 at 2:47 AM, Ray Ortigas <
> rortigas@linkedin.com.invalid> wrote:
>
>> Hi Grega and Reynold,
>>
>> Grega, if you still want to use t-digest, I filed this PR because I
>> thought your t-digest suggestion was a good idea.
>>
>> https://github.com/tdunning/t-digest/pull/56
>>
>> If it is helpful feel free to do whatever with it.
>>
>> Regards,
>> Ray
>>
>>
>> On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> This email is good. Just one note -- a lot of people are swamped right
>>> before Spark Summit, so you might not get prompt responses this week.
>>>
>>>
>>> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <gr...@celtra.com> wrote:
>>>
>>>> I have some time to work on it now. What's a good way to continue the
>>>> discussions before coding it?
>>>>
>>>> This e-mail list, JIRA or something else?
>>>>
>>>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> I think those are great to have. I would put them in the DataFrame API
>>>>> though, since this is applying to structured data. Many of the advanced
>>>>> functions on the PairRDDFunctions should really go into the DataFrame API
>>>>> now we have it.
>>>>>
>>>>> One thing that would be great to understand is what state-of-the-art
>>>>> alternatives are out there. I did a quick google scholar search using the
>>>>> keyword "approximate quantile" and found some older papers. Just the
>>>>> first few I found:
>>>>>
>>>>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>>>>>
>>>>>
>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>>>>>  by Bruce Lindsay, IBM
>>>>>
>>>>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com>
>>>>> wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I'd like to get community's opinion on implementing a generic
>>>>>> quantile approximation algorithm for Spark that is O(n) and requires
>>>>>> limited memory. I would find it useful and I haven't found any existing
>>>>>> implementation. The plan was basically to wrap t-digest
>>>>>> <https://github.com/tdunning/t-digest>, implement the
>>>>>> serialization/deserialization boilerplate and provide
>>>>>>
>>>>>> def cdf(x: Double): Double
>>>>>> def quantile(q: Double): Double
>>>>>>
>>>>>>
>>>>>> on RDD[Double] and RDD[(K, Double)].
>>>>>>
>>>>>> Let me know what you think. Any other ideas/suggestions also welcome!
>>>>>>
>>>>>> Best,
>>>>>> Grega
>>>>>> --
>>>>>> [image: Inline image 1]*Grega Kešpret*
>>>>>> Senior Software Engineer, Analytics
>>>>>>
>>>>>> Skype: gregakespret
>>>>>> celtra.com <http://www.celtra.com/> | @celtramobile
>>>>>> <http://www.twitter.com/celtramobile>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Posted by Nick Pentreath <ni...@gmail.com>.
If it's going into the DataFrame API (which it probably should rather than
in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT)
which would mean it doesn't have to implement Serializable, as it appears
that serialization is taken care of in the UDT def (e.g.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L254
)

If I understand correctly UDT SerDe correctly?

On Thu, Jun 11, 2015 at 2:47 AM, Ray Ortigas <ro...@linkedin.com.invalid>
wrote:

> Hi Grega and Reynold,
>
> Grega, if you still want to use t-digest, I filed this PR because I
> thought your t-digest suggestion was a good idea.
>
> https://github.com/tdunning/t-digest/pull/56
>
> If it is helpful feel free to do whatever with it.
>
> Regards,
> Ray
>
>
> On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> This email is good. Just one note -- a lot of people are swamped right
>> before Spark Summit, so you might not get prompt responses this week.
>>
>>
>> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <gr...@celtra.com> wrote:
>>
>>> I have some time to work on it now. What's a good way to continue the
>>> discussions before coding it?
>>>
>>> This e-mail list, JIRA or something else?
>>>
>>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> I think those are great to have. I would put them in the DataFrame API
>>>> though, since this is applying to structured data. Many of the advanced
>>>> functions on the PairRDDFunctions should really go into the DataFrame API
>>>> now we have it.
>>>>
>>>> One thing that would be great to understand is what state-of-the-art
>>>> alternatives are out there. I did a quick google scholar search using the
>>>> keyword "approximate quantile" and found some older papers. Just the
>>>> first few I found:
>>>>
>>>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>>>>
>>>>
>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>>>>  by Bruce Lindsay, IBM
>>>>
>>>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com>
>>>> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I'd like to get community's opinion on implementing a generic quantile
>>>>> approximation algorithm for Spark that is O(n) and requires limited memory.
>>>>> I would find it useful and I haven't found any existing implementation. The
>>>>> plan was basically to wrap t-digest
>>>>> <https://github.com/tdunning/t-digest>, implement the
>>>>> serialization/deserialization boilerplate and provide
>>>>>
>>>>> def cdf(x: Double): Double
>>>>> def quantile(q: Double): Double
>>>>>
>>>>>
>>>>> on RDD[Double] and RDD[(K, Double)].
>>>>>
>>>>> Let me know what you think. Any other ideas/suggestions also welcome!
>>>>>
>>>>> Best,
>>>>> Grega
>>>>> --
>>>>> [image: Inline image 1]*Grega Kešpret*
>>>>> Senior Software Engineer, Analytics
>>>>>
>>>>> Skype: gregakespret
>>>>> celtra.com <http://www.celtra.com/> | @celtramobile
>>>>> <http://www.twitter.com/celtramobile>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Posted by Ray Ortigas <ro...@linkedin.com.INVALID>.
Hi Grega and Reynold,

Grega, if you still want to use t-digest, I filed this PR because I thought
your t-digest suggestion was a good idea.

https://github.com/tdunning/t-digest/pull/56

If it is helpful feel free to do whatever with it.

Regards,
Ray


On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin <rx...@databricks.com> wrote:

> This email is good. Just one note -- a lot of people are swamped right
> before Spark Summit, so you might not get prompt responses this week.
>
>
> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <gr...@celtra.com> wrote:
>
>> I have some time to work on it now. What's a good way to continue the
>> discussions before coding it?
>>
>> This e-mail list, JIRA or something else?
>>
>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> I think those are great to have. I would put them in the DataFrame API
>>> though, since this is applying to structured data. Many of the advanced
>>> functions on the PairRDDFunctions should really go into the DataFrame API
>>> now we have it.
>>>
>>> One thing that would be great to understand is what state-of-the-art
>>> alternatives are out there. I did a quick google scholar search using the
>>> keyword "approximate quantile" and found some older papers. Just the
>>> first few I found:
>>>
>>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>>>
>>>
>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>>>  by Bruce Lindsay, IBM
>>>
>>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> I'd like to get community's opinion on implementing a generic quantile
>>>> approximation algorithm for Spark that is O(n) and requires limited memory.
>>>> I would find it useful and I haven't found any existing implementation. The
>>>> plan was basically to wrap t-digest
>>>> <https://github.com/tdunning/t-digest>, implement the
>>>> serialization/deserialization boilerplate and provide
>>>>
>>>> def cdf(x: Double): Double
>>>> def quantile(q: Double): Double
>>>>
>>>>
>>>> on RDD[Double] and RDD[(K, Double)].
>>>>
>>>> Let me know what you think. Any other ideas/suggestions also welcome!
>>>>
>>>> Best,
>>>> Grega
>>>> --
>>>> [image: Inline image 1]*Grega Kešpret*
>>>> Senior Software Engineer, Analytics
>>>>
>>>> Skype: gregakespret
>>>> celtra.com <http://www.celtra.com/> | @celtramobile
>>>> <http://www.twitter.com/celtramobile>
>>>>
>>>>
>>>
>>
>

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

Posted by Reynold Xin <rx...@databricks.com>.
This email is good. Just one note -- a lot of people are swamped right
before Spark Summit, so you might not get prompt responses this week.


On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret <gr...@celtra.com> wrote:

> I have some time to work on it now. What's a good way to continue the
> discussions before coding it?
>
> This e-mail list, JIRA or something else?
>
> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I think those are great to have. I would put them in the DataFrame API
>> though, since this is applying to structured data. Many of the advanced
>> functions on the PairRDDFunctions should really go into the DataFrame API
>> now we have it.
>>
>> One thing that would be great to understand is what state-of-the-art
>> alternatives are out there. I did a quick google scholar search using the
>> keyword "approximate quantile" and found some older papers. Just the
>> first few I found:
>>
>> http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs
>>
>>
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
>>  by Bruce Lindsay, IBM
>>
>> http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
>>
>>
>>
>>
>>
>> On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret <gr...@celtra.com> wrote:
>>
>>> Hi!
>>>
>>> I'd like to get community's opinion on implementing a generic quantile
>>> approximation algorithm for Spark that is O(n) and requires limited memory.
>>> I would find it useful and I haven't found any existing implementation. The
>>> plan was basically to wrap t-digest
>>> <https://github.com/tdunning/t-digest>, implement the
>>> serialization/deserialization boilerplate and provide
>>>
>>> def cdf(x: Double): Double
>>> def quantile(q: Double): Double
>>>
>>>
>>> on RDD[Double] and RDD[(K, Double)].
>>>
>>> Let me know what you think. Any other ideas/suggestions also welcome!
>>>
>>> Best,
>>> Grega
>>> --
>>> [image: Inline image 1]*Grega Kešpret*
>>> Senior Software Engineer, Analytics
>>>
>>> Skype: gregakespret
>>> celtra.com <http://www.celtra.com/> | @celtramobile
>>> <http://www.twitter.com/celtramobile>
>>>
>>>
>>
>