You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Uri Laserson <la...@cloudera.com> on 2014/02/01 01:33:23 UTC

Distributed streaming quantiles with PySpark

Hi everyone,

I implemented a version of distributed streaming quantiles for PySpark.  It
uses a count-min sketch approach.  You can find the code here:

https://github.com/laserson/dsq

Thought it might be of interest...

Uri

-- 
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laserson@cloudera.com

Re: Distributed streaming quantiles with PySpark

Posted by Nick Pentreath <ni...@gmail.com>.
Thanks Uri, I came across that and took a quick look, seems interesting.


On a related note, it would be quite cool to have a sort of port of Algebird (or at least count-min, top-k and HLL, perhaps bloom filter) to Python, that are monoid-style for us in PySpark...
—
Sent from Mailbox for iPhone

On Sat, Feb 1, 2014 at 2:34 AM, Uri Laserson <la...@cloudera.com>
wrote:

> Hi everyone,
> I implemented a version of distributed streaming quantiles for PySpark.  It
> uses a count-min sketch approach.  You can find the code here:
> https://github.com/laserson/dsq
> Thought it might be of interest...
> Uri
> -- 
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> laserson@cloudera.com