You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/03/25 10:19:22 UTC

CompactRunningAverage not quite right

The CompactRunningAverage class is
1) not compact, and
2) not coded right.

The code using 'count' seems to think it is a short. The code assumes
that shorts top out at 65535,
but shorts are signed and thus top out at 32767.
'count' is declared as a 'char' so it will never reach 128, let alone 65536.
The 'count' field uses 4 bytes so it should be an int.
Any object has 16 bytes. The CompactRunningAverage has (at least) 24
bytes. FullRunningAverage has (at least) 32 bytes.

Something that supplies compact running averages should directly
allocate and manage an array of shorts and an array of floats.
MemoryDiffStorage is the only use of CompactRunningAverage in the code
base. It can use "hundreds of thousands" of these counters.
It also needs 8 bytes per counter for each object reference, v.s. 4
bytes per array index.

-- 
Lance Norskog
goksron@gmail.com

Re: CompactRunningAverage not quite right

Posted by Sean Owen <sr...@gmail.com>.
The weird thing is, I get 24 bytes per object in an empirical test for both
just now on a 64-bit machine (not counting the reference size). On a 32-bit
machine it's 16 vs 24 bytes. Counting the reference overhead, the difference
is proportionally smaller. Those aren't the exact numbers I think either of
us are thinking. (It could be some quirk of Apple's JVM too.)

I wouldn't mind just removing CompactRunningAverage if it really makes no
difference on modern machines. It seems obsolete now.

Yes arrays of things are better -- this is why we have PreferenceArray for
instance. But the use case here isn't a "dense" set of averages, it's quite
sparse. There has got to be a way to still save some memory here. It's
probably parallel arrays, plus a mapping from item-item pair to offset into
those parallel arrays. If done carefully I bet it saves memory and would be
a worthwhile upgrade.

On Sat, Mar 26, 2011 at 4:22 AM, Lance Norskog <go...@gmail.com> wrote:

> > I think that's the bit you're missing.
> 8 bits I'm missing, actually. BA-ZING!
>
> 64-bit implementations of Java align on 8-byte boundaries. A 2-byte
> value plus a 4-byte value will be packed into 8 bytes. An 8-byte
> double and a 4-byte int will add 4 useless bytes with alignment,
> giving 16 bytes..
>
> Including 16 bytes per object, Compact v.s. Full is 24 bytes v.s. 32
> bytes. Since there are 2 unused bytes in the Compact object, the
> 16-bit 'char' counter might as well be a 31-positive-bit counter.
> Similarly, the integer counter in FullRunningAverage could be a long
> instead of an int.
>
> For comparison, an empty String is 40 bytes, or 5 8-byte words.
>
> On Fri, Mar 25, 2011 at 2:33 AM, Sean Owen <sr...@gmail.com> wrote:
> > A char is a 2-byte unsigned value in Java, not 1-byte signed -- I
> > think that's the bit you're missing.
> >
> > Ignoring object overhead, there are 2 (char) + 4 (float) = 6 bytes of
> > payload in CompactRunningAverage. There are 4 (int) + 8 (double) = 12
> > bytes of payload in FullRunningAverage.
> >
> > I don't know what you mean that 'count' should be an int.
> >
> > I do agree that it would be more memory-efficient to make a sort of
> > array of these things, in cases where you have a dense sequence of
> > them. However the use case you cite isn't one of those -- those diffs
> > are sparse. You can add into the object the row/col position but then
> > that just eats up the space savings.
> >
> > On Fri, Mar 25, 2011 at 9:19 AM, Lance Norskog <go...@gmail.com>
> wrote:
> >> The CompactRunningAverage class is
> >> 1) not compact, and
> >> 2) not coded right.
> >>
> >> The code using 'count' seems to think it is a short. The code assumes
> >> that shorts top out at 65535,
> >> but shorts are signed and thus top out at 32767.
> >> 'count' is declared as a 'char' so it will never reach 128, let alone
> 65536.
> >> The 'count' field uses 4 bytes so it should be an int.
> >> Any object has 16 bytes. The CompactRunningAverage has (at least) 24
> >> bytes. FullRunningAverage has (at least) 32 bytes.
> >>
> >> Something that supplies compact running averages should directly
> >> allocate and manage an array of shorts and an array of floats.
> >> MemoryDiffStorage is the only use of CompactRunningAverage in the code
> >> base. It can use "hundreds of thousands" of these counters.
> >> It also needs 8 bytes per counter for each object reference, v.s. 4
> >> bytes per array index.
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: CompactRunningAverage not quite right

Posted by Lance Norskog <go...@gmail.com>.
MemoryDiffStorage is the only class in Mahout that uses
CompactRunningAverage. If all of the compact averages are in two
arrays, the benefits of hardware caching go way up. This seems a more
compelling reason to recode MemoryDiffStorage :)

Lance

On Fri, Mar 25, 2011 at 9:22 PM, Lance Norskog <go...@gmail.com> wrote:
>> I think that's the bit you're missing.
> 8 bits I'm missing, actually. BA-ZING!
>
> 64-bit implementations of Java align on 8-byte boundaries. A 2-byte
> value plus a 4-byte value will be packed into 8 bytes. An 8-byte
> double and a 4-byte int will add 4 useless bytes with alignment,
> giving 16 bytes..
>
> Including 16 bytes per object, Compact v.s. Full is 24 bytes v.s. 32
> bytes. Since there are 2 unused bytes in the Compact object, the
> 16-bit 'char' counter might as well be a 31-positive-bit counter.
> Similarly, the integer counter in FullRunningAverage could be a long
> instead of an int.
>
> For comparison, an empty String is 40 bytes, or 5 8-byte words.
>
> On Fri, Mar 25, 2011 at 2:33 AM, Sean Owen <sr...@gmail.com> wrote:
>> A char is a 2-byte unsigned value in Java, not 1-byte signed -- I
>> think that's the bit you're missing.
>>
>> Ignoring object overhead, there are 2 (char) + 4 (float) = 6 bytes of
>> payload in CompactRunningAverage. There are 4 (int) + 8 (double) = 12
>> bytes of payload in FullRunningAverage.
>>
>> I don't know what you mean that 'count' should be an int.
>>
>> I do agree that it would be more memory-efficient to make a sort of
>> array of these things, in cases where you have a dense sequence of
>> them. However the use case you cite isn't one of those -- those diffs
>> are sparse. You can add into the object the row/col position but then
>> that just eats up the space savings.
>>
>> On Fri, Mar 25, 2011 at 9:19 AM, Lance Norskog <go...@gmail.com> wrote:
>>> The CompactRunningAverage class is
>>> 1) not compact, and
>>> 2) not coded right.
>>>
>>> The code using 'count' seems to think it is a short. The code assumes
>>> that shorts top out at 65535,
>>> but shorts are signed and thus top out at 32767.
>>> 'count' is declared as a 'char' so it will never reach 128, let alone 65536.
>>> The 'count' field uses 4 bytes so it should be an int.
>>> Any object has 16 bytes. The CompactRunningAverage has (at least) 24
>>> bytes. FullRunningAverage has (at least) 32 bytes.
>>>
>>> Something that supplies compact running averages should directly
>>> allocate and manage an array of shorts and an array of floats.
>>> MemoryDiffStorage is the only use of CompactRunningAverage in the code
>>> base. It can use "hundreds of thousands" of these counters.
>>> It also needs 8 bytes per counter for each object reference, v.s. 4
>>> bytes per array index.
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: CompactRunningAverage not quite right

Posted by Ted Dunning <te...@gmail.com>.
Standard practice is to pack these into separate arrays so that you don't
pay this penalty.

On Fri, Mar 25, 2011 at 9:22 PM, Lance Norskog <go...@gmail.com> wrote:

> An 8-byte
> double and a 4-byte int will add 4 useless bytes with alignment,
> giving 16 bytes..
>

Re: CompactRunningAverage not quite right

Posted by Lance Norskog <go...@gmail.com>.
> I think that's the bit you're missing.
8 bits I'm missing, actually. BA-ZING!

64-bit implementations of Java align on 8-byte boundaries. A 2-byte
value plus a 4-byte value will be packed into 8 bytes. An 8-byte
double and a 4-byte int will add 4 useless bytes with alignment,
giving 16 bytes..

Including 16 bytes per object, Compact v.s. Full is 24 bytes v.s. 32
bytes. Since there are 2 unused bytes in the Compact object, the
16-bit 'char' counter might as well be a 31-positive-bit counter.
Similarly, the integer counter in FullRunningAverage could be a long
instead of an int.

For comparison, an empty String is 40 bytes, or 5 8-byte words.

On Fri, Mar 25, 2011 at 2:33 AM, Sean Owen <sr...@gmail.com> wrote:
> A char is a 2-byte unsigned value in Java, not 1-byte signed -- I
> think that's the bit you're missing.
>
> Ignoring object overhead, there are 2 (char) + 4 (float) = 6 bytes of
> payload in CompactRunningAverage. There are 4 (int) + 8 (double) = 12
> bytes of payload in FullRunningAverage.
>
> I don't know what you mean that 'count' should be an int.
>
> I do agree that it would be more memory-efficient to make a sort of
> array of these things, in cases where you have a dense sequence of
> them. However the use case you cite isn't one of those -- those diffs
> are sparse. You can add into the object the row/col position but then
> that just eats up the space savings.
>
> On Fri, Mar 25, 2011 at 9:19 AM, Lance Norskog <go...@gmail.com> wrote:
>> The CompactRunningAverage class is
>> 1) not compact, and
>> 2) not coded right.
>>
>> The code using 'count' seems to think it is a short. The code assumes
>> that shorts top out at 65535,
>> but shorts are signed and thus top out at 32767.
>> 'count' is declared as a 'char' so it will never reach 128, let alone 65536.
>> The 'count' field uses 4 bytes so it should be an int.
>> Any object has 16 bytes. The CompactRunningAverage has (at least) 24
>> bytes. FullRunningAverage has (at least) 32 bytes.
>>
>> Something that supplies compact running averages should directly
>> allocate and manage an array of shorts and an array of floats.
>> MemoryDiffStorage is the only use of CompactRunningAverage in the code
>> base. It can use "hundreds of thousands" of these counters.
>> It also needs 8 bytes per counter for each object reference, v.s. 4
>> bytes per array index.
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: CompactRunningAverage not quite right

Posted by Sean Owen <sr...@gmail.com>.
A char is a 2-byte unsigned value in Java, not 1-byte signed -- I
think that's the bit you're missing.

Ignoring object overhead, there are 2 (char) + 4 (float) = 6 bytes of
payload in CompactRunningAverage. There are 4 (int) + 8 (double) = 12
bytes of payload in FullRunningAverage.

I don't know what you mean that 'count' should be an int.

I do agree that it would be more memory-efficient to make a sort of
array of these things, in cases where you have a dense sequence of
them. However the use case you cite isn't one of those -- those diffs
are sparse. You can add into the object the row/col position but then
that just eats up the space savings.

On Fri, Mar 25, 2011 at 9:19 AM, Lance Norskog <go...@gmail.com> wrote:
> The CompactRunningAverage class is
> 1) not compact, and
> 2) not coded right.
>
> The code using 'count' seems to think it is a short. The code assumes
> that shorts top out at 65535,
> but shorts are signed and thus top out at 32767.
> 'count' is declared as a 'char' so it will never reach 128, let alone 65536.
> The 'count' field uses 4 bytes so it should be an int.
> Any object has 16 bytes. The CompactRunningAverage has (at least) 24
> bytes. FullRunningAverage has (at least) 32 bytes.
>
> Something that supplies compact running averages should directly
> allocate and manage an array of shorts and an array of floats.
> MemoryDiffStorage is the only use of CompactRunningAverage in the code
> base. It can use "hundreds of thousands" of these counters.
> It also needs 8 bytes per counter for each object reference, v.s. 4
> bytes per array index.
>
> --
> Lance Norskog
> goksron@gmail.com
>