You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Otis Gospodnetic <ot...@gmail.com> on 2013/08/07 13:21:03 UTC

Is OnlineSummarizer mergeable?

Hi,

Is OnlineSummarizer algo "mergeable"?

Say that we compute a percentile for some metric for time 12:00-12:01
and store that somewhere, then we compute it for 1201-12:02 and store
that separately, and so on.

Can we then later merge these computed and previously stored
percentile "instances" and get an accurate value?

Thanks,
Otis
--
Performance Monitoring -- http://sematext.com/spm
Solr & ElasticSearch Support -- http://sematext.com/

Re: Is OnlineSummarizer mergeable?

Posted by Ted Dunning <te...@gmail.com>.
Ouch.

You didn't mention accuracy.  I will assume a standard sort of 2-3%
accuracy or better and let you correct me if necessary.

I could meet all but one or two of those requirements several different
ways.

For instance, very high or low quantiles can be met with stacked min-sets
or max-sets.  The idea is that you keep the highest k values and the
highest k 10x downsampled data and so on.  This is pretty good for down to
the 90+%-ile (or up to the 10th %-ile).  This structure merges without loss
of accuracy.

For well-defined quantiles like 25-50-75, then the Mahout OnlineSummarizer
is excellent.  You can choose your arbitrary quantile ahead of time and you
can sometimes merge (but perverse data can kill you).

And then the QDigest.  It is, by definition, as big as a QDigest, but is
mergeable and allows any quantile. Also cool, is the fact that you can pick
the quantile late in the process.

Maybe the answer is to make the QDigest structure smaller.  How well is the
streamlib implementation cranked down?  Is it really tight?




On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> Hi Ted,
>
> I need percentiles.  Ideally not pre-defined ones, because one person may
> want e.g. 70th pctile, while somebody else might want 75th pctile for the
> same metric.
>
> Deal breakers:
> High memory footprint. ("high" means "higher than QDigest from stream-lib"
> for us.... and we could test and compare with QDigest relatively easily
> with live data)
> Algos that create data structures that cannot be merged
> Loss of accuracy that is not predictably small or configurable
>
> Thank you,
> Otis
> ----
>
> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
>
>
>
>
> >________________________________
> > From: Ted Dunning <te...@gmail.com>
> >To: "user@mahout.apache.org" <us...@mahout.apache.org>; Otis Gospodnetic <
> otis_gospodnetic@yahoo.com>
> >Sent: Wednesday, August 7, 2013 11:48 PM
> >Subject: Re: Is OnlineSummarizer mergeable?
> >
> >
> >
> >Otis,
> >
> >
> >What statistics do you need?
> >
> >
> >What guarantees?
> >
> >
> >
> >
> >
> >On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
> >
> >Hi Ted,
> >>
> >>I'm actually trying to find an alternative to QDigest (the stream-lib
> impl specifically) because even though it seems good, we have to deal with
> crazy volumes of data in SPM (performance monitoring service, see
> signature)... I'm hoping we can find something that has both a lower memory
> footprint than QDigest AND that is mergeable a la QDigest.  Utopia?
> >>
> >>Thanks,
> >>Otis
> >>----
> >>Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
> >>
> >>
> >>
> >>
> >>>________________________________
> >>> From: Ted Dunning <te...@gmail.com>
> >>>To: "user@mahout.apache.org" <us...@mahout.apache.org>
> >>>Sent: Wednesday, August 7, 2013 4:51 PM
> >>>Subject: Re: Is OnlineSummarizer mergeable?
> >>>
> >>>
> >>>It isn't as mergeable as I would like.  If you have randomized record
> >>>selection, it should be possible, but perverse ordering can cause
> serious
> >>>errors.
> >>>
> >>>It would be better to use something like a Q-digest.
> >>>
> >>>http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
> >>>
> >>>
> >>>
> >>>
> >>>On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com
> >>>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Is OnlineSummarizer algo "mergeable"?
> >>>>
> >>>> Say that we compute a percentile for some metric for time 12:00-12:01
> >>>> and store that somewhere, then we compute it for 1201-12:02 and store
> >>>> that separately, and so on.
> >>>>
> >>>> Can we then later merge these computed and previously stored
> >>>> percentile "instances" and get an accurate value?
> >>>>
> >>>> Thanks,
> >>>> Otis
> >>>> --
> >>>> Performance Monitoring -- http://sematext.com/spm
> >>>> Solr & ElasticSearch Support -- http://sematext.com/
> >>>>
> >>>
> >>>
> >>>
> >
> >
> >

Re: Is OnlineSummarizer mergeable?

Posted by Ted Dunning <te...@gmail.com>.
I was about to point you at that pull request.  How droll.

Didn't know it was from you guys.


On Thu, Aug 8, 2013 at 3:35 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> Hi Ted,
>
> Yes, that's what we did recently, too:
> https://github.com/clearspring/stream-lib/pull/47
>
> ... but it's still a little too phat...which is what made me think of your
> OnlineSummarizer as a possible, slimmer alternative.
>
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
>
>
>
>
> >________________________________
> > From: Ted Dunning <te...@gmail.com>
> >To: "user@mahout.apache.org" <us...@mahout.apache.org>; Otis Gospodnetic <
> otis_gospodnetic@yahoo.com>
> >Sent: Thursday, August 8, 2013 8:27 AM
> >Subject: Re: Is OnlineSummarizer mergeable?
> >
> >
> >
> >I just looked at the source for QDigest from streamlib.
> >
> >
> >I think that the memory usage could be trimmed substantially, possibly by
> as much as 5:1 by using more primitive friendly structures.
> >
> >
> >
> >
> >
> >On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
> >
> >Hi Ted,
> >>
> >>I need percentiles.  Ideally not pre-defined ones, because one person
> may want e.g. 70th pctile, while somebody else might want 75th pctile for
> the same metric.
> >>
> >>Deal breakers:
> >>High memory footprint. ("high" means "higher than QDigest from
> stream-lib" for us.... and we could test and compare with QDigest
> relatively easily with live data)
> >>Algos that create data structures that cannot be merged
> >>Loss of accuracy that is not predictably small or configurable
> >>
> >>Thank you,
> >>Otis
> >>----
> >>
> >>Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
> >>
> >>
> >>
> >>
> >>>________________________________
> >>> From: Ted Dunning <te...@gmail.com>
> >>>To: "user@mahout.apache.org" <us...@mahout.apache.org>; Otis
> Gospodnetic <ot...@yahoo.com>
> >>>Sent: Wednesday, August 7, 2013 11:48 PM
> >>>Subject: Re: Is OnlineSummarizer mergeable?
> >>>
> >>>
> >>>
> >>>Otis,
> >>>
> >>>
> >>>What statistics do you need?
> >>>
> >>>
> >>>What guarantees?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
> >>>
> >>>Hi Ted,
> >>>>
> >>>>I'm actually trying to find an alternative to QDigest (the stream-lib
> impl specifically) because even though it seems good, we have to deal with
> crazy volumes of data in SPM (performance monitoring service, see
> signature)... I'm hoping we can find something that has both a lower memory
> footprint than QDigest AND that is mergeable a la QDigest.  Utopia?
> >>>>
> >>>>Thanks,
> >>>>Otis
> >>>>----
> >>>>Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>________________________________
> >>>>> From: Ted Dunning <te...@gmail.com>
> >>>>>To: "user@mahout.apache.org" <us...@mahout.apache.org>
> >>>>>Sent: Wednesday, August 7, 2013 4:51 PM
> >>>>>Subject: Re: Is OnlineSummarizer mergeable?
> >>>>>
> >>>>>
> >>>>>It isn't as mergeable as I would like.  If you have randomized record
> >>>>>selection, it should be possible, but perverse ordering can cause
> serious
> >>>>>errors.
> >>>>>
> >>>>>It would be better to use something like a Q-digest.
> >>>>>
> >>>>>http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com
> >>>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Is OnlineSummarizer algo "mergeable"?
> >>>>>>
> >>>>>> Say that we compute a percentile for some metric for time
> 12:00-12:01
> >>>>>> and store that somewhere, then we compute it for 1201-12:02 and
> store
> >>>>>> that separately, and so on.
> >>>>>>
> >>>>>> Can we then later merge these computed and previously stored
> >>>>>> percentile "instances" and get an accurate value?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Otis
> >>>>>> --
> >>>>>> Performance Monitoring -- http://sematext.com/spm
> >>>>>> Solr & ElasticSearch Support -- http://sematext.com/
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>>
> >
> >
> >

Re: Is OnlineSummarizer mergeable?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Ted,

Yes, that's what we did recently, too: https://github.com/clearspring/stream-lib/pull/47

... but it's still a little too phat...which is what made me think of your OnlineSummarizer as a possible, slimmer alternative.

Otis 
----
Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm 




>________________________________
> From: Ted Dunning <te...@gmail.com>
>To: "user@mahout.apache.org" <us...@mahout.apache.org>; Otis Gospodnetic <ot...@yahoo.com> 
>Sent: Thursday, August 8, 2013 8:27 AM
>Subject: Re: Is OnlineSummarizer mergeable?
> 
>
>
>I just looked at the source for QDigest from streamlib.
>
>
>I think that the memory usage could be trimmed substantially, possibly by as much as 5:1 by using more primitive friendly structures.
>
>
>
>
>
>On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic <ot...@yahoo.com> wrote:
>
>Hi Ted,
>>
>>I need percentiles.  Ideally not pre-defined ones, because one person may want e.g. 70th pctile, while somebody else might want 75th pctile for the same metric.
>>
>>Deal breakers:
>>High memory footprint. ("high" means "higher than QDigest from stream-lib" for us.... and we could test and compare with QDigest relatively easily with live data)
>>Algos that create data structures that cannot be merged
>>Loss of accuracy that is not predictably small or configurable
>>
>>Thank you,
>>Otis
>>----
>>
>>Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm 
>>
>>
>>
>>
>>>________________________________
>>> From: Ted Dunning <te...@gmail.com>
>>>To: "user@mahout.apache.org" <us...@mahout.apache.org>; Otis Gospodnetic <ot...@yahoo.com>
>>>Sent: Wednesday, August 7, 2013 11:48 PM
>>>Subject: Re: Is OnlineSummarizer mergeable?
>>>
>>>
>>>
>>>Otis,
>>>
>>>
>>>What statistics do you need?
>>>
>>>
>>>What guarantees?
>>>
>>>
>>>
>>>
>>>
>>>On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic <ot...@yahoo.com> wrote:
>>>
>>>Hi Ted,
>>>>
>>>>I'm actually trying to find an alternative to QDigest (the stream-lib impl specifically) because even though it seems good, we have to deal with crazy volumes of data in SPM (performance monitoring service, see signature)... I'm hoping we can find something that has both a lower memory footprint than QDigest AND that is mergeable a la QDigest.  Utopia?
>>>>
>>>>Thanks,
>>>>Otis
>>>>----
>>>>Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm 
>>>>
>>>>
>>>>
>>>>
>>>>>________________________________
>>>>> From: Ted Dunning <te...@gmail.com>
>>>>>To: "user@mahout.apache.org" <us...@mahout.apache.org>
>>>>>Sent: Wednesday, August 7, 2013 4:51 PM
>>>>>Subject: Re: Is OnlineSummarizer mergeable?
>>>>>
>>>>>
>>>>>It isn't as mergeable as I would like.  If you have randomized record
>>>>>selection, it should be possible, but perverse ordering can cause serious
>>>>>errors.
>>>>>
>>>>>It would be better to use something like a Q-digest.
>>>>>
>>>>>http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com
>>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Is OnlineSummarizer algo "mergeable"?
>>>>>>
>>>>>> Say that we compute a percentile for some metric for time 12:00-12:01
>>>>>> and store that somewhere, then we compute it for 1201-12:02 and store
>>>>>> that separately, and so on.
>>>>>>
>>>>>> Can we then later merge these computed and previously stored
>>>>>> percentile "instances" and get an accurate value?
>>>>>>
>>>>>> Thanks,
>>>>>> Otis
>>>>>> --
>>>>>> Performance Monitoring -- http://sematext.com/spm
>>>>>> Solr & ElasticSearch Support -- http://sematext.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
>

Re: Is OnlineSummarizer mergeable?

Posted by Ted Dunning <te...@gmail.com>.
I just looked at the source for QDigest from streamlib.

I think that the memory usage could be trimmed substantially, possibly by
as much as 5:1 by using more primitive friendly structures.



On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> Hi Ted,
>
> I need percentiles.  Ideally not pre-defined ones, because one person may
> want e.g. 70th pctile, while somebody else might want 75th pctile for the
> same metric.
>
> Deal breakers:
> High memory footprint. ("high" means "higher than QDigest from stream-lib"
> for us.... and we could test and compare with QDigest relatively easily
> with live data)
> Algos that create data structures that cannot be merged
> Loss of accuracy that is not predictably small or configurable
>
> Thank you,
> Otis
> ----
>
> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
>
>
>
>
> >________________________________
> > From: Ted Dunning <te...@gmail.com>
> >To: "user@mahout.apache.org" <us...@mahout.apache.org>; Otis Gospodnetic <
> otis_gospodnetic@yahoo.com>
> >Sent: Wednesday, August 7, 2013 11:48 PM
> >Subject: Re: Is OnlineSummarizer mergeable?
> >
> >
> >
> >Otis,
> >
> >
> >What statistics do you need?
> >
> >
> >What guarantees?
> >
> >
> >
> >
> >
> >On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
> >
> >Hi Ted,
> >>
> >>I'm actually trying to find an alternative to QDigest (the stream-lib
> impl specifically) because even though it seems good, we have to deal with
> crazy volumes of data in SPM (performance monitoring service, see
> signature)... I'm hoping we can find something that has both a lower memory
> footprint than QDigest AND that is mergeable a la QDigest.  Utopia?
> >>
> >>Thanks,
> >>Otis
> >>----
> >>Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
> >>
> >>
> >>
> >>
> >>>________________________________
> >>> From: Ted Dunning <te...@gmail.com>
> >>>To: "user@mahout.apache.org" <us...@mahout.apache.org>
> >>>Sent: Wednesday, August 7, 2013 4:51 PM
> >>>Subject: Re: Is OnlineSummarizer mergeable?
> >>>
> >>>
> >>>It isn't as mergeable as I would like.  If you have randomized record
> >>>selection, it should be possible, but perverse ordering can cause
> serious
> >>>errors.
> >>>
> >>>It would be better to use something like a Q-digest.
> >>>
> >>>http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
> >>>
> >>>
> >>>
> >>>
> >>>On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com
> >>>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Is OnlineSummarizer algo "mergeable"?
> >>>>
> >>>> Say that we compute a percentile for some metric for time 12:00-12:01
> >>>> and store that somewhere, then we compute it for 1201-12:02 and store
> >>>> that separately, and so on.
> >>>>
> >>>> Can we then later merge these computed and previously stored
> >>>> percentile "instances" and get an accurate value?
> >>>>
> >>>> Thanks,
> >>>> Otis
> >>>> --
> >>>> Performance Monitoring -- http://sematext.com/spm
> >>>> Solr & ElasticSearch Support -- http://sematext.com/
> >>>>
> >>>
> >>>
> >>>
> >
> >
> >

Re: Is OnlineSummarizer mergeable?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Ted,

I need percentiles.  Ideally not pre-defined ones, because one person may want e.g. 70th pctile, while somebody else might want 75th pctile for the same metric.

Deal breakers:
High memory footprint. ("high" means "higher than QDigest from stream-lib" for us.... and we could test and compare with QDigest relatively easily with live data)
Algos that create data structures that cannot be merged
Loss of accuracy that is not predictably small or configurable

Thank you,
Otis
----

Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm 




>________________________________
> From: Ted Dunning <te...@gmail.com>
>To: "user@mahout.apache.org" <us...@mahout.apache.org>; Otis Gospodnetic <ot...@yahoo.com> 
>Sent: Wednesday, August 7, 2013 11:48 PM
>Subject: Re: Is OnlineSummarizer mergeable?
> 
>
>
>Otis,
>
>
>What statistics do you need?
>
>
>What guarantees?
>
>
>
>
>
>On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic <ot...@yahoo.com> wrote:
>
>Hi Ted,
>>
>>I'm actually trying to find an alternative to QDigest (the stream-lib impl specifically) because even though it seems good, we have to deal with crazy volumes of data in SPM (performance monitoring service, see signature)... I'm hoping we can find something that has both a lower memory footprint than QDigest AND that is mergeable a la QDigest.  Utopia?
>>
>>Thanks,
>>Otis
>>----
>>Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm 
>>
>>
>>
>>
>>>________________________________
>>> From: Ted Dunning <te...@gmail.com>
>>>To: "user@mahout.apache.org" <us...@mahout.apache.org>
>>>Sent: Wednesday, August 7, 2013 4:51 PM
>>>Subject: Re: Is OnlineSummarizer mergeable?
>>>
>>>
>>>It isn't as mergeable as I would like.  If you have randomized record
>>>selection, it should be possible, but perverse ordering can cause serious
>>>errors.
>>>
>>>It would be better to use something like a Q-digest.
>>>
>>>http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
>>>
>>>
>>>
>>>
>>>On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com
>>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is OnlineSummarizer algo "mergeable"?
>>>>
>>>> Say that we compute a percentile for some metric for time 12:00-12:01
>>>> and store that somewhere, then we compute it for 1201-12:02 and store
>>>> that separately, and so on.
>>>>
>>>> Can we then later merge these computed and previously stored
>>>> percentile "instances" and get an accurate value?
>>>>
>>>> Thanks,
>>>> Otis
>>>> --
>>>> Performance Monitoring -- http://sematext.com/spm
>>>> Solr & ElasticSearch Support -- http://sematext.com/
>>>>
>>>
>>>
>>>
>
>
>

Re: Is OnlineSummarizer mergeable?

Posted by Ted Dunning <te...@gmail.com>.
Otis,

What statistics do you need?

What guarantees?



On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

> Hi Ted,
>
> I'm actually trying to find an alternative to QDigest (the stream-lib impl
> specifically) because even though it seems good, we have to deal with crazy
> volumes of data in SPM (performance monitoring service, see signature)...
> I'm hoping we can find something that has both a lower memory footprint
> than QDigest AND that is mergeable a la QDigest.  Utopia?
>
> Thanks,
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
>
>
>
>
> >________________________________
> > From: Ted Dunning <te...@gmail.com>
> >To: "user@mahout.apache.org" <us...@mahout.apache.org>
> >Sent: Wednesday, August 7, 2013 4:51 PM
> >Subject: Re: Is OnlineSummarizer mergeable?
> >
> >
> >It isn't as mergeable as I would like.  If you have randomized record
> >selection, it should be possible, but perverse ordering can cause serious
> >errors.
> >
> >It would be better to use something like a Q-digest.
> >
> >http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
> >
> >
> >
> >
> >On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com
> >> wrote:
> >
> >> Hi,
> >>
> >> Is OnlineSummarizer algo "mergeable"?
> >>
> >> Say that we compute a percentile for some metric for time 12:00-12:01
> >> and store that somewhere, then we compute it for 1201-12:02 and store
> >> that separately, and so on.
> >>
> >> Can we then later merge these computed and previously stored
> >> percentile "instances" and get an accurate value?
> >>
> >> Thanks,
> >> Otis
> >> --
> >> Performance Monitoring -- http://sematext.com/spm
> >> Solr & ElasticSearch Support -- http://sematext.com/
> >>
> >
> >
> >

Re: Is OnlineSummarizer mergeable?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Ted,

I'm actually trying to find an alternative to QDigest (the stream-lib impl specifically) because even though it seems good, we have to deal with crazy volumes of data in SPM (performance monitoring service, see signature)... I'm hoping we can find something that has both a lower memory footprint than QDigest AND that is mergeable a la QDigest.  Utopia?

Thanks,
Otis
----
Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm 




>________________________________
> From: Ted Dunning <te...@gmail.com>
>To: "user@mahout.apache.org" <us...@mahout.apache.org> 
>Sent: Wednesday, August 7, 2013 4:51 PM
>Subject: Re: Is OnlineSummarizer mergeable?
> 
>
>It isn't as mergeable as I would like.  If you have randomized record
>selection, it should be possible, but perverse ordering can cause serious
>errors.
>
>It would be better to use something like a Q-digest.
>
>http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
>
>
>
>
>On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com
>> wrote:
>
>> Hi,
>>
>> Is OnlineSummarizer algo "mergeable"?
>>
>> Say that we compute a percentile for some metric for time 12:00-12:01
>> and store that somewhere, then we compute it for 1201-12:02 and store
>> that separately, and so on.
>>
>> Can we then later merge these computed and previously stored
>> percentile "instances" and get an accurate value?
>>
>> Thanks,
>> Otis
>> --
>> Performance Monitoring -- http://sematext.com/spm
>> Solr & ElasticSearch Support -- http://sematext.com/
>>
>
>
>

Re: Is OnlineSummarizer mergeable?

Posted by Ted Dunning <te...@gmail.com>.
It isn't as mergeable as I would like.  If you have randomized record
selection, it should be possible, but perverse ordering can cause serious
errors.

It would be better to use something like a Q-digest.

http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf




On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com
> wrote:

> Hi,
>
> Is OnlineSummarizer algo "mergeable"?
>
> Say that we compute a percentile for some metric for time 12:00-12:01
> and store that somewhere, then we compute it for 1201-12:02 and store
> that separately, and so on.
>
> Can we then later merge these computed and previously stored
> percentile "instances" and get an accurate value?
>
> Thanks,
> Otis
> --
> Performance Monitoring -- http://sematext.com/spm
> Solr & ElasticSearch Support -- http://sematext.com/
>