You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Uri Laserson <la...@cloudera.com> on 2014/02/01 01:33:23 UTC
Distributed streaming quantiles with PySpark
Hi everyone,
I implemented a version of distributed streaming quantiles for PySpark. It
uses a count-min sketch approach. You can find the code here:
https://github.com/laserson/dsq
Thought it might be of interest...
Uri
--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laserson@cloudera.com
Re: Distributed streaming quantiles with PySpark
Posted by Nick Pentreath <ni...@gmail.com>.
Thanks Uri, I came across that and took a quick look, seems interesting.
On a related note, it would be quite cool to have a sort of port of Algebird (or at least count-min, top-k and HLL, perhaps bloom filter) to Python, that are monoid-style for us in PySpark...
—
Sent from Mailbox for iPhone
On Sat, Feb 1, 2014 at 2:34 AM, Uri Laserson <la...@cloudera.com>
wrote:
> Hi everyone,
> I implemented a version of distributed streaming quantiles for PySpark. It
> uses a count-min sketch approach. You can find the code here:
> https://github.com/laserson/dsq
> Thought it might be of interest...
> Uri
> --
> Uri Laserson, PhD
> Data Scientist, Cloudera
> Twitter/GitHub: @laserson
> +1 617 910 0447
> laserson@cloudera.com