You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2012/04/20 19:44:14 UTC

Quartiles computation with M/R or Pig (combine function states)

Hello,

There should be some way to compile quartiles in a map/reduce fashion
(i.e. with api similar to Pig's Arithmetic custom function) without
keeping enormous count hash?
There's this countsketch thing that i implemented before on map
reduce, but it is sort of like bloom filter: if it gives a wrong
result, the error is fairly huge (in case of bloom filter, 100%) and
to get good results it still requires quite a bit of memory

Re: Quartiles computation with M/R or Pig (combine function states)

Posted by Ted Dunning <te...@gmail.com>.
The basic idea is that you would extend the OnlineSummarize to get more
quantiles.  Then you would combine these OnlineSummarizer estimates
weighted by how much data they represent.  This won't work if the data is
perversely ordered.  Hector's suggestions will give you lower accuracy for
random ordered data, but better accuracy in the worst case.

On Fri, Apr 20, 2012 at 2:40 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Thank you, Ted.
>
> On Fri, Apr 20, 2012 at 2:30 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Look at our OnlineSummarizer.  THis should be roughly parallelizable.
> >
> > On Fri, Apr 20, 2012 at 2:12 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> Thank you, sir. Let me consider this.
> >>
> >> On Fri, Apr 20, 2012 at 11:50 AM, Hector Yee <he...@gmail.com>
> wrote:
> >> > how about this
> >> >
> >> > http://en.wikipedia.org/wiki/Reservoir_sampling
> >> >
> >> > On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >> >wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> There should be some way to compile quartiles in a map/reduce fashion
> >> >> (i.e. with api similar to Pig's Arithmetic custom function) without
> >> >> keeping enormous count hash?
> >> >> There's this countsketch thing that i implemented before on map
> >> >> reduce, but it is sort of like bloom filter: if it gives a wrong
> >> >> result, the error is fairly huge (in case of bloom filter, 100%) and
> >> >> to get good results it still requires quite a bit of memory
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Yee Yang Li Hector <https://plus.google.com/106746796711269457249>
> >> > Professional Profile <http://www.linkedin.com/in/yeehector>
> >> > http://hectorgon.blogspot.com/ (tech + travel)
> >> > http://hectorgon.com (book reviews)
> >>
>

Re: Quartiles computation with M/R or Pig (combine function states)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you, Ted.

On Fri, Apr 20, 2012 at 2:30 PM, Ted Dunning <te...@gmail.com> wrote:
> Look at our OnlineSummarizer.  THis should be roughly parallelizable.
>
> On Fri, Apr 20, 2012 at 2:12 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Thank you, sir. Let me consider this.
>>
>> On Fri, Apr 20, 2012 at 11:50 AM, Hector Yee <he...@gmail.com> wrote:
>> > how about this
>> >
>> > http://en.wikipedia.org/wiki/Reservoir_sampling
>> >
>> > On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >wrote:
>> >
>> >> Hello,
>> >>
>> >> There should be some way to compile quartiles in a map/reduce fashion
>> >> (i.e. with api similar to Pig's Arithmetic custom function) without
>> >> keeping enormous count hash?
>> >> There's this countsketch thing that i implemented before on map
>> >> reduce, but it is sort of like bloom filter: if it gives a wrong
>> >> result, the error is fairly huge (in case of bloom filter, 100%) and
>> >> to get good results it still requires quite a bit of memory
>> >>
>> >
>> >
>> >
>> > --
>> > Yee Yang Li Hector <https://plus.google.com/106746796711269457249>
>> > Professional Profile <http://www.linkedin.com/in/yeehector>
>> > http://hectorgon.blogspot.com/ (tech + travel)
>> > http://hectorgon.com (book reviews)
>>

Re: Quartiles computation with M/R or Pig (combine function states)

Posted by Ted Dunning <te...@gmail.com>.
Look at our OnlineSummarizer.  THis should be roughly parallelizable.

On Fri, Apr 20, 2012 at 2:12 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Thank you, sir. Let me consider this.
>
> On Fri, Apr 20, 2012 at 11:50 AM, Hector Yee <he...@gmail.com> wrote:
> > how about this
> >
> > http://en.wikipedia.org/wiki/Reservoir_sampling
> >
> > On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> Hello,
> >>
> >> There should be some way to compile quartiles in a map/reduce fashion
> >> (i.e. with api similar to Pig's Arithmetic custom function) without
> >> keeping enormous count hash?
> >> There's this countsketch thing that i implemented before on map
> >> reduce, but it is sort of like bloom filter: if it gives a wrong
> >> result, the error is fairly huge (in case of bloom filter, 100%) and
> >> to get good results it still requires quite a bit of memory
> >>
> >
> >
> >
> > --
> > Yee Yang Li Hector <https://plus.google.com/106746796711269457249>
> > Professional Profile <http://www.linkedin.com/in/yeehector>
> > http://hectorgon.blogspot.com/ (tech + travel)
> > http://hectorgon.com (book reviews)
>

Re: Quartiles computation with M/R or Pig (combine function states)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you, sir. Let me consider this.

On Fri, Apr 20, 2012 at 11:50 AM, Hector Yee <he...@gmail.com> wrote:
> how about this
>
> http://en.wikipedia.org/wiki/Reservoir_sampling
>
> On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> Hello,
>>
>> There should be some way to compile quartiles in a map/reduce fashion
>> (i.e. with api similar to Pig's Arithmetic custom function) without
>> keeping enormous count hash?
>> There's this countsketch thing that i implemented before on map
>> reduce, but it is sort of like bloom filter: if it gives a wrong
>> result, the error is fairly huge (in case of bloom filter, 100%) and
>> to get good results it still requires quite a bit of memory
>>
>
>
>
> --
> Yee Yang Li Hector <https://plus.google.com/106746796711269457249>
> Professional Profile <http://www.linkedin.com/in/yeehector>
> http://hectorgon.blogspot.com/ (tech + travel)
> http://hectorgon.com (book reviews)

Re: Quartiles computation with M/R or Pig (combine function states)

Posted by Hector Yee <he...@gmail.com>.
how about this

http://en.wikipedia.org/wiki/Reservoir_sampling

On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Hello,
>
> There should be some way to compile quartiles in a map/reduce fashion
> (i.e. with api similar to Pig's Arithmetic custom function) without
> keeping enormous count hash?
> There's this countsketch thing that i implemented before on map
> reduce, but it is sort of like bloom filter: if it gives a wrong
> result, the error is fairly huge (in case of bloom filter, 100%) and
> to get good results it still requires quite a bit of memory
>



-- 
Yee Yang Li Hector <https://plus.google.com/106746796711269457249>
Professional Profile <http://www.linkedin.com/in/yeehector>
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Re: Quartiles computation with M/R or Pig (combine function states)

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thanks in advance .

On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Hello,
>
> There should be some way to compile quartiles in a map/reduce fashion
> (i.e. with api similar to Pig's Arithmetic custom function) without
> keeping enormous count hash?
> There's this countsketch thing that i implemented before on map
> reduce, but it is sort of like bloom filter: if it gives a wrong
> result, the error is fairly huge (in case of bloom filter, 100%) and
> to get good results it still requires quite a bit of memory