You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Franco Barrientos <fr...@exalitica.com> on 2014/11/27 16:28:27 UTC

Percentile

Hi folks!,

 

Anyone known how can I calculate for each elements of a variable in a RDD
its percentile? I tried to calculate trough Spark SQL with subqueries but I
think that is imposible in Spark SQL. Any idea will be welcome.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 <ma...@exalitica.com> franco.barrientos@exalitica.com 

 <http://www.exalitica.com/> www.exalitica.com


  <http://exalitica.com/web/img/frim.png>

Re: Percentile

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Hi Franco,

Hive percentile UDAF is added in the master branch. You can have a look at
it. I think it would work like "select percentile(col_name, 1) from
sigmoid_logs"

Thanks
Best Regards

On Thu, Nov 27, 2014 at 8:58 PM, Franco Barrientos <
franco.barrientos@exalitica.com> wrote:

> Hi folks!,
>
>
>
> Anyone known how can I calculate for each elements of a variable in a RDD
> its percentile? I tried to calculate trough Spark SQL with subqueries but I
> think that is imposible in Spark SQL. Any idea will be welcome.
>
>
>
> Thanks in advance,
>
>
>
> *Franco Barrientos*
> Data Scientist
>
> Málaga #115, Of. 1003, Las Condes.
> Santiago, Chile.
> (+562)-29699649
> (+569)-76347893
>
> franco.barrientos@exalitica.com
>
> www.exalitica.com
>
> [image: http://exalitica.com/web/img/frim.png]
>
>
>

Re: Percentile

Posted by Imran Rashid <im...@therashids.com>.

Hi Franco,

As a fast approximate way to get probability distributions, you might be
interested in t-digests:

https://github.com/tdunning/t-digest

In one pass, you could make a t-digest for each variable, to get its
distribution.  And after that, you could make another pass to map each data
point to its percentile in the distribution.

to create the tdigests, you would do something like this:

val myDataRDD = ...

myDataRDD.mapPartitions{itr =>
  xDistribution = TDigest.createArrayDigest(32, 100)
  yDistribution = TDigest.createArrayDigest(32, 100)
  ...
  itr.foreach{ data =>
    xDistribution.add(data.x)
    yDistribution.add(data.y)
    ...
  }

  Seq(
    "x" -> xDistribution,
    "y" -> yDistribution
  ).toIterator.map{case(k,v) =>
    val arr = new Array[Byte](t.byteSize)
    v.asBytes(ByteBuffer.wrap(arr))
    k -> arr
  }
}.reduceByKey{case(t1Arr,t2Arr) =>
  val merged =
ArrayDigest.fromBytes(ByteBuffer.wrap(t1Arr)).add(ArrayDigest.fromBytes(ByteBuffer.wrap(t2Arr))
  val arr = new Array[Byte](merged.byteSize)
  merged.asBytes(ByteBuffer.wrap(arr))
}


(the complication there is just that tdigests are not directly
serializable, so I need to do the manual work of converting to and from an
array of bytes).


On Thu, Nov 27, 2014 at 9:28 AM, Franco Barrientos <
franco.barrientos@exalitica.com> wrote:

> Hi folks!,
>
>
>
> Anyone known how can I calculate for each elements of a variable in a RDD
> its percentile? I tried to calculate trough Spark SQL with subqueries but I
> think that is imposible in Spark SQL. Any idea will be welcome.
>
>
>
> Thanks in advance,
>
>
>
> *Franco Barrientos*
> Data Scientist
>
> Málaga #115, Of. 1003, Las Condes.
> Santiago, Chile.
> (+562)-29699649
> (+569)-76347893
>
> franco.barrientos@exalitica.com
>
> www.exalitica.com
>
> [image: http://exalitica.com/web/img/frim.png]
>
>
>