You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/04/17 08:53:40 UTC

Misfires in OnlineSummarizer

If you add the Java methods at the bottom to the
org.apache.mahout.stats.OnlineSummarizer and run the main(), a funny
thing prints out:

[(count=200.0),(sd=28.8660),(mean=49.5000),(min=0.0),(25%=34.1312),(median=60.2104),(75%=83.8722),(max=99.0),]

I added the numbers 0-99 twice to the summarizer. I would have
expected the 25%=25 +/- 1, median=50 +/- 1, and 75%=75 +/- 1
Note that the mean is correct.
---------------------------------------------------------------------------

  @Override
  public String toString() {
   return "[" +
   pair("count", getCount()) + pair("sd", getSD()) + pair("mean", getMean()) +
   pair("min", getMin()) + pair("25%", getQuartile(1)) +
pair("median", getMedian()) +
      pair("75%", getQuartile(3)) + pair("max", getMax()) + "]";
  }

  private String pair(String tag, double value) {
    String s = Double.toString(value);
    if (s.length() > 8)
      s = s.substring(0, 7);
    return "(" + tag + "=" + s + "),";
  }

  public static void main(String[] args) {
    OnlineSummarizer osQ = new OnlineSummarizer();
    for(int i = 0; i < 200; i++) {
      osQ.add(i % 100);
    }
    System.out.println(osQ.toString());
  }

-- 
Lance Norskog
goksron@gmail.com

Re: Misfires in OnlineSummarizer

Posted by Ted Dunning <te...@gmail.com>.
That is a problem.  Indeed, I don't think it is soluble for an online
quantile estimator to be completely correct in these cases.  As Lance noted,
however, with repetitions the effect subsides.

There is a related issue with all of the on-line learning and estimate code
including OnlineAuc, OnlineLogisticRegression, CrossFoldLearner and
AdaptiveLogisticRegression.  There are probably similar issues even in the
k-means code.

It is good practice to randomize the data somewhat.  The TrainNewsGroups
reads the training data in randomized order.  For very large training sets,
this is probably impractical.  For that purpose, I have sometime used a
buffer that I re-order on the fly.  That provides a windowed kind of
permutation that can help a lot.

Occasionally, this can be very helpful such as when we want to emphasize
recent or older data, but mostly it is a dangerous thing.

On Sun, Apr 17, 2011 at 9:13 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> If i read Lance's code correctly, he indeed gives them consecutively.
>
> On Sun, Apr 17, 2011 at 2:25 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > Yeah...
> >
> > What Sean says.  The inaccuracy surprises me a bit, but it is outside the
> > intended usage.
> >
> > Did you give the values in random order or in consecutive order?  If they
> > are consecutive, then I am not worried at all.  If you got this error
> from
> > random ordering, I am a bit more unhappy.
> >
> > On Sun, Apr 17, 2011 at 2:21 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> The implementation is intentionally an approximation which uses
> >> constant memory, instead of tracking the entire data set, which is
> >> necessary to get an exact answer. You should find it converges to the
> >> expected values with more data.
> >>
> >> On Sun, Apr 17, 2011 at 7:53 AM, Lance Norskog <go...@gmail.com>
> wrote:
> >> > If you add the Java methods at the bottom to the
> >> > org.apache.mahout.stats.OnlineSummarizer and run the main(), a funny
> >> > thing prints out:
> >> >
> >> >
> >>
> [(count=200.0),(sd=28.8660),(mean=49.5000),(min=0.0),(25%=34.1312),(median=60.2104),(75%=83.8722),(max=99.0),]
> >> >
> >> > I added the numbers 0-99 twice to the summarizer. I would have
> >> > expected the 25%=25 +/- 1, median=50 +/- 1, and 75%=75 +/- 1
> >> > Note that the mean is correct.
> >> >
> >>
> ---------------------------------------------------------------------------
> >> >
> >> >  @Override
> >> >  public String toString() {
> >> >   return "[" +
> >> >   pair("count", getCount()) + pair("sd", getSD()) + pair("mean",
> >> getMean()) +
> >> >   pair("min", getMin()) + pair("25%", getQuartile(1)) +
> >> > pair("median", getMedian()) +
> >> >      pair("75%", getQuartile(3)) + pair("max", getMax()) + "]";
> >> >  }
> >> >
> >> >  private String pair(String tag, double value) {
> >> >    String s = Double.toString(value);
> >> >    if (s.length() > 8)
> >> >      s = s.substring(0, 7);
> >> >    return "(" + tag + "=" + s + "),";
> >> >  }
> >> >
> >> >  public static void main(String[] args) {
> >> >    OnlineSummarizer osQ = new OnlineSummarizer();
> >> >    for(int i = 0; i < 200; i++) {
> >> >      osQ.add(i % 100);
> >> >    }
> >> >    System.out.println(osQ.toString());
> >> >  }
> >> >
> >> > --
> >> > Lance Norskog
> >> > goksron@gmail.com
> >> >
> >>
> >
>

Re: Misfires in OnlineSummarizer

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
If i read Lance's code correctly, he indeed gives them consecutively.

On Sun, Apr 17, 2011 at 2:25 AM, Ted Dunning <te...@gmail.com> wrote:
> Yeah...
>
> What Sean says.  The inaccuracy surprises me a bit, but it is outside the
> intended usage.
>
> Did you give the values in random order or in consecutive order?  If they
> are consecutive, then I am not worried at all.  If you got this error from
> random ordering, I am a bit more unhappy.
>
> On Sun, Apr 17, 2011 at 2:21 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> The implementation is intentionally an approximation which uses
>> constant memory, instead of tracking the entire data set, which is
>> necessary to get an exact answer. You should find it converges to the
>> expected values with more data.
>>
>> On Sun, Apr 17, 2011 at 7:53 AM, Lance Norskog <go...@gmail.com> wrote:
>> > If you add the Java methods at the bottom to the
>> > org.apache.mahout.stats.OnlineSummarizer and run the main(), a funny
>> > thing prints out:
>> >
>> >
>> [(count=200.0),(sd=28.8660),(mean=49.5000),(min=0.0),(25%=34.1312),(median=60.2104),(75%=83.8722),(max=99.0),]
>> >
>> > I added the numbers 0-99 twice to the summarizer. I would have
>> > expected the 25%=25 +/- 1, median=50 +/- 1, and 75%=75 +/- 1
>> > Note that the mean is correct.
>> >
>> ---------------------------------------------------------------------------
>> >
>> >  @Override
>> >  public String toString() {
>> >   return "[" +
>> >   pair("count", getCount()) + pair("sd", getSD()) + pair("mean",
>> getMean()) +
>> >   pair("min", getMin()) + pair("25%", getQuartile(1)) +
>> > pair("median", getMedian()) +
>> >      pair("75%", getQuartile(3)) + pair("max", getMax()) + "]";
>> >  }
>> >
>> >  private String pair(String tag, double value) {
>> >    String s = Double.toString(value);
>> >    if (s.length() > 8)
>> >      s = s.substring(0, 7);
>> >    return "(" + tag + "=" + s + "),";
>> >  }
>> >
>> >  public static void main(String[] args) {
>> >    OnlineSummarizer osQ = new OnlineSummarizer();
>> >    for(int i = 0; i < 200; i++) {
>> >      osQ.add(i % 100);
>> >    }
>> >    System.out.println(osQ.toString());
>> >  }
>> >
>> > --
>> > Lance Norskog
>> > goksron@gmail.com
>> >
>>
>

Re: Misfires in OnlineSummarizer

Posted by Ted Dunning <te...@gmail.com>.
Yeah...

What Sean says.  The inaccuracy surprises me a bit, but it is outside the
intended usage.

Did you give the values in random order or in consecutive order?  If they
are consecutive, then I am not worried at all.  If you got this error from
random ordering, I am a bit more unhappy.

On Sun, Apr 17, 2011 at 2:21 AM, Sean Owen <sr...@gmail.com> wrote:

> The implementation is intentionally an approximation which uses
> constant memory, instead of tracking the entire data set, which is
> necessary to get an exact answer. You should find it converges to the
> expected values with more data.
>
> On Sun, Apr 17, 2011 at 7:53 AM, Lance Norskog <go...@gmail.com> wrote:
> > If you add the Java methods at the bottom to the
> > org.apache.mahout.stats.OnlineSummarizer and run the main(), a funny
> > thing prints out:
> >
> >
> [(count=200.0),(sd=28.8660),(mean=49.5000),(min=0.0),(25%=34.1312),(median=60.2104),(75%=83.8722),(max=99.0),]
> >
> > I added the numbers 0-99 twice to the summarizer. I would have
> > expected the 25%=25 +/- 1, median=50 +/- 1, and 75%=75 +/- 1
> > Note that the mean is correct.
> >
> ---------------------------------------------------------------------------
> >
> >  @Override
> >  public String toString() {
> >   return "[" +
> >   pair("count", getCount()) + pair("sd", getSD()) + pair("mean",
> getMean()) +
> >   pair("min", getMin()) + pair("25%", getQuartile(1)) +
> > pair("median", getMedian()) +
> >      pair("75%", getQuartile(3)) + pair("max", getMax()) + "]";
> >  }
> >
> >  private String pair(String tag, double value) {
> >    String s = Double.toString(value);
> >    if (s.length() > 8)
> >      s = s.substring(0, 7);
> >    return "(" + tag + "=" + s + "),";
> >  }
> >
> >  public static void main(String[] args) {
> >    OnlineSummarizer osQ = new OnlineSummarizer();
> >    for(int i = 0; i < 200; i++) {
> >      osQ.add(i % 100);
> >    }
> >    System.out.println(osQ.toString());
> >  }
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>

Re: Misfires in OnlineSummarizer

Posted by Sean Owen <sr...@gmail.com>.
The implementation is intentionally an approximation which uses
constant memory, instead of tracking the entire data set, which is
necessary to get an exact answer. You should find it converges to the
expected values with more data.

On Sun, Apr 17, 2011 at 7:53 AM, Lance Norskog <go...@gmail.com> wrote:
> If you add the Java methods at the bottom to the
> org.apache.mahout.stats.OnlineSummarizer and run the main(), a funny
> thing prints out:
>
> [(count=200.0),(sd=28.8660),(mean=49.5000),(min=0.0),(25%=34.1312),(median=60.2104),(75%=83.8722),(max=99.0),]
>
> I added the numbers 0-99 twice to the summarizer. I would have
> expected the 25%=25 +/- 1, median=50 +/- 1, and 75%=75 +/- 1
> Note that the mean is correct.
> ---------------------------------------------------------------------------
>
>  @Override
>  public String toString() {
>   return "[" +
>   pair("count", getCount()) + pair("sd", getSD()) + pair("mean", getMean()) +
>   pair("min", getMin()) + pair("25%", getQuartile(1)) +
> pair("median", getMedian()) +
>      pair("75%", getQuartile(3)) + pair("max", getMax()) + "]";
>  }
>
>  private String pair(String tag, double value) {
>    String s = Double.toString(value);
>    if (s.length() > 8)
>      s = s.substring(0, 7);
>    return "(" + tag + "=" + s + "),";
>  }
>
>  public static void main(String[] args) {
>    OnlineSummarizer osQ = new OnlineSummarizer();
>    for(int i = 0; i < 200; i++) {
>      osQ.add(i % 100);
>    }
>    System.out.println(osQ.toString());
>  }
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Misfires in OnlineSummarizer

Posted by Lance Norskog <go...@gmail.com>.
Increasing from 200 to 2000 on upwards drives the 25/median/75 numbers
towards 25/50/75.


On Sat, Apr 16, 2011 at 11:53 PM, Lance Norskog <go...@gmail.com> wrote:
> If you add the Java methods at the bottom to the
> org.apache.mahout.stats.OnlineSummarizer and run the main(), a funny
> thing prints out:
>
> [(count=200.0),(sd=28.8660),(mean=49.5000),(min=0.0),(25%=34.1312),(median=60.2104),(75%=83.8722),(max=99.0),]
>
> I added the numbers 0-99 twice to the summarizer. I would have
> expected the 25%=25 +/- 1, median=50 +/- 1, and 75%=75 +/- 1
> Note that the mean is correct.
> ---------------------------------------------------------------------------
>
>  @Override
>  public String toString() {
>   return "[" +
>   pair("count", getCount()) + pair("sd", getSD()) + pair("mean", getMean()) +
>   pair("min", getMin()) + pair("25%", getQuartile(1)) +
> pair("median", getMedian()) +
>      pair("75%", getQuartile(3)) + pair("max", getMax()) + "]";
>  }
>
>  private String pair(String tag, double value) {
>    String s = Double.toString(value);
>    if (s.length() > 8)
>      s = s.substring(0, 7);
>    return "(" + tag + "=" + s + "),";
>  }
>
>  public static void main(String[] args) {
>    OnlineSummarizer osQ = new OnlineSummarizer();
>    for(int i = 0; i < 200; i++) {
>      osQ.add(i % 100);
>    }
>    System.out.println(osQ.toString());
>  }
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com