You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2010/05/02 18:02:24 UTC

Optimization opportunity: Speed up serialization and deserialization

I am getting more and  more ideas as I try to write about scaling Mahout
clustering. I added serialize and de serialize benchmark for Vectors and
checked the speed of our vectors.

Here is the output with Cardinality=1000 Sparsity=1000(dense) numVectors=100
loop=100 (hence writing 10K(int-doubles) to and reading back from disk)
Note: that these are not disk MB/s but the size of vectors/per sec
deserialized and the filesystem is a Ramdisk.

robinanil$ ls -lh /tmp/*vector
-rwxrwxrwx  1 robinanil     77M May  2 21:25 /tmp/ram/dense-vector
-rwxrwxrwx  1 robinanil    115M May  2 21:25 /tmp/ram/randsparse-vector
-rwxrwxrwx  1 robinanil    115M May  2 21:25 /tmp/ram/seqsparse-vector

BenchMarks              DenseVector             RandSparseVector
 SeqSparseVector
Deserialize

                        nCalls = 10000;         nCalls = 10000;
nCalls = 10000;
                        sum = 1.30432s;         sum = 2.207437s;        sum
= 1.681144s;
                        min = 0.045ms;          min = 0.152ms;          min
= 0.114ms;
                        max = 74.549ms;         max = 8.446ms;          max
= 3.748ms;
                        mean = 0.130432ms;      mean = 0.220743ms;      mean
= 0.168114ms;
                        stdDev = 0.904858ms;    stdDev = 0.206271ms;
 stdDev = 0.087123ms;
                        Speed = 7666.83 /sec    Speed = 4530.1406 /sec
 Speed = 5948.33 /sec
                        Rate = 92.00197 MB/s    Rate = 54.361687 MB/s   Rate
= 71.37997 MB/s

Serialize

                        nCalls = 10000;         nCalls = 10000;
nCalls = 10000;
                        sum = 3.391168s;        sum = 6.300965s;        sum
= 5.304873s;
                        min = 0.068ms;          min = 0.135ms;          min
= 0.12ms;
                        max = 254.635ms;        max = 1183.891ms;       max
= 639.583ms;
                        mean = 0.339116ms;      mean = 0.630096ms;      mean
= 0.530487ms;
                        stdDev = 5.558922ms;    stdDev = 13.460321ms;
stdDev = 8.618806ms;
                        Speed = 2948.8364 /sec  Speed = 1587.0585 /sec
 Speed = 1885.0592 /sec
                        Rate = 35.38604 MB/s    Rate = 19.044703 MB/s   Rate
= 22.620712 MB/s

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Ted Dunning <te...@gmail.com>.
The really major win would be if we handle integer (especially boolean)
matrices specially.  Attacking the 4 byte cost of the index in a sparse
vector, but attacking the 8 byte value would be even better.  For sparse
boolean matrices, the value can go away entirely.

All of these efforts will have the effect of making any downstream
compression less valuable resulting in much less impressive gains.  The
exception is delta encoding of indexes which will probably make the
downstream compressor more effective.

On Sun, May 2, 2010 at 12:45 PM, Drew Farris <dr...@gmail.com> wrote:

> Do anyone have any idea whether greater gains to be found by finely tuning
> the base encoding vs. relying on some form of SequenceFile block
> compression? (or do both approaches compliment each other nicely?)
>

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Ted Dunning <te...@gmail.com>.
The net effect will be to decrease the effect of the downstream compressor,
but I would still expect the final result to be a bit smaller with upstream
improvements in representation.  Speed will be better with the better
representations if only because the downstream compressor will have to deal
with fewer bytes which means fewer bytes copied around.

On Sun, May 2, 2010 at 1:13 PM, Sean Owen <sr...@gmail.com> wrote:

> My gut says it's not a step too far to implement variable-length
> encoding. It's a transformation that mostly chops out zero bytes,
> which only aids later compression. It's not as if it's a
> transformation that then compresses worse.
>

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Sean Owen <sr...@gmail.com>.
It's the same approach to variable-length encoding, yes. Zig-zag is a
trick to make negative numbers "compatible" with this encoding.
Because two's-complement negative numbers start with a bunch of 1s
their representation is terrible under this variable-length encoding
-- always of maximum length. Zig-zag defines a simple two-way mapping
between all integers and the nonnegative integers, so you can encode
even negative values as positive values, and encode those positive
values effectively.

I think you're touching on a good point -- that the logical conclusion
of these compression ideas is yet further out. And, the best encoding
probably does about as well as a good general algorithm like LZ{W,O}.
And one can already tell Hadoop to write out compressed sequence
files. So, why not just leverage that?

My gut says it's not a step too far to implement variable-length
encoding. It's a transformation that mostly chops out zero bytes,
which only aids later compression. It's not as if it's a
transformation that then compresses worse.

Once I get done with MAHOUT-302 (going to be another classic doozy of
a patch from me) I am happy to look into the real effects of this.

On Sun, May 2, 2010 at 8:45 PM, Drew Farris <dr...@gmail.com> wrote:
> Is this what is commonly referred to as zig-zag encoding? Avro uses the same
> technique:
> http://hadoop.apache.org/avro/docs/1.3.2/spec.html#binary_encoding
>
> For sequential sparse vectors it we could get an additional win by delta
> encoding the indexes. This would allow the index, stored as the difference
> from the previous index, to be kept to two bytes in many cases.
>
> Regardless, vint encoding will produce a significant space savings and
> Sean's right: it has also been my experience that space savings often trump
> speed simply because of the speed of network or storage.
>
> Do anyone have any idea whether greater gains to be found by finely tuning
> the base encoding vs. relying on some form of SequenceFile block
> compression? (or do both approaches compliment each other nicely?)

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Robin Anil <ro...@gmail.com>.
LZO is supposedly the best option, but due to GPL restrictions, it was
removed. The Quicklz hasnt yet been integrated into the Hadoop code base.

Robin

On Mon, May 3, 2010 at 1:15 AM, Drew Farris <dr...@gmail.com> wrote:

> Is this what is commonly referred to as zig-zag encoding? Avro uses the
> same
> technique:
> http://hadoop.apache.org/avro/docs/1.3.2/spec.html#binary_encoding
>
> For sequential sparse vectors it we could get an additional win by delta
> encoding the indexes. This would allow the index, stored as the difference
> from the previous index, to be kept to two bytes in many cases.
>
> Regardless, vint encoding will produce a significant space savings and
> Sean's right: it has also been my experience that space savings often trump
> speed simply because of the speed of network or storage.
>
> Do anyone have any idea whether greater gains to be found by finely tuning
> the base encoding vs. relying on some form of SequenceFile block
> compression? (or do both approaches compliment each other nicely?)
>
> On Sun, May 2, 2010 at 12:33 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > That's the one! I actually didn't know this was how PBs did the
> > variable length encoding but makes sense, it's about the most
> > efficient thing I can imagine.
> >
> > Values up to 16,383 fit in two bytes, which less than a 4-byte int and
> > the 3 bytes or so it would take the other scheme. Could add up over
> > thousands of elements times millions of vectors.
> >
> > Decoding isn't too slow and if one believes this isn't an unusual
> > encoding, it's not so problematic to use it in a format that others
> > outside Mahout may wish to consume.
> >
> > On Sun, May 2, 2010 at 5:23 PM, Robin Anil <ro...@gmail.com> wrote:
> > > You mean this type of encoding instead?
> > >  http://code.google.com/apis/protocolbuffers/docs/encoding.html
> >
>

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Drew Farris <dr...@gmail.com>.
Is this what is commonly referred to as zig-zag encoding? Avro uses the same
technique:
http://hadoop.apache.org/avro/docs/1.3.2/spec.html#binary_encoding

For sequential sparse vectors it we could get an additional win by delta
encoding the indexes. This would allow the index, stored as the difference
from the previous index, to be kept to two bytes in many cases.

Regardless, vint encoding will produce a significant space savings and
Sean's right: it has also been my experience that space savings often trump
speed simply because of the speed of network or storage.

Do anyone have any idea whether greater gains to be found by finely tuning
the base encoding vs. relying on some form of SequenceFile block
compression? (or do both approaches compliment each other nicely?)

On Sun, May 2, 2010 at 12:33 PM, Sean Owen <sr...@gmail.com> wrote:

> That's the one! I actually didn't know this was how PBs did the
> variable length encoding but makes sense, it's about the most
> efficient thing I can imagine.
>
> Values up to 16,383 fit in two bytes, which less than a 4-byte int and
> the 3 bytes or so it would take the other scheme. Could add up over
> thousands of elements times millions of vectors.
>
> Decoding isn't too slow and if one believes this isn't an unusual
> encoding, it's not so problematic to use it in a format that others
> outside Mahout may wish to consume.
>
> On Sun, May 2, 2010 at 5:23 PM, Robin Anil <ro...@gmail.com> wrote:
> > You mean this type of encoding instead?
> >  http://code.google.com/apis/protocolbuffers/docs/encoding.html
>

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Sean Owen <sr...@gmail.com>.
That much is expected right? Since it stores a 4-byte index along with
each 8-byte double value, the sparse representation is bigger when
over 8/(4+8) = 66% of the values are non-default / non-zero.

But variable-encoding the index value trims a byte or more per element
depending on your assumptions. It'll still be less efficient past a
certain (higher) point though that's by design. But a byte per element
adds up at huge scale, I still find this worth entertaining.

On Sun, May 2, 2010 at 5:49 PM, Robin Anil <ro...@gmail.com> wrote:
> PS, The size of the SparseVector is greater than the dense vector for a full
> vector. I guess something could be done about it.

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Robin Anil <ro...@gmail.com>.
PS, The size of the SparseVector is greater than the dense vector for a full
vector. I guess something could be done about it.

On Sun, May 2, 2010 at 10:03 PM, Sean Owen <sr...@gmail.com> wrote:

> That's the one! I actually didn't know this was how PBs did the
> variable length encoding but makes sense, it's about the most
> efficient thing I can imagine.
>
> Values up to 16,383 fit in two bytes, which less than a 4-byte int and
> the 3 bytes or so it would take the other scheme. Could add up over
> thousands of elements times millions of vectors.
>
> Decoding isn't too slow and if one believes this isn't an unusual
> encoding, it's not so problematic to use it in a format that others
> outside Mahout may wish to consume.
>
> On Sun, May 2, 2010 at 5:23 PM, Robin Anil <ro...@gmail.com> wrote:
> > You mean this type of encoding instead?
> >  http://code.google.com/apis/protocolbuffers/docs/encoding.html
>

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Sean Owen <sr...@gmail.com>.
That's the one! I actually didn't know this was how PBs did the
variable length encoding but makes sense, it's about the most
efficient thing I can imagine.

Values up to 16,383 fit in two bytes, which less than a 4-byte int and
the 3 bytes or so it would take the other scheme. Could add up over
thousands of elements times millions of vectors.

Decoding isn't too slow and if one believes this isn't an unusual
encoding, it's not so problematic to use it in a format that others
outside Mahout may wish to consume.

On Sun, May 2, 2010 at 5:23 PM, Robin Anil <ro...@gmail.com> wrote:
> You mean this type of encoding instead?
>  http://code.google.com/apis/protocolbuffers/docs/encoding.html

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Robin Anil <ro...@gmail.com>.
On Sun, May 2, 2010 at 9:40 PM, Sean Owen <sr...@gmail.com> wrote:

> What's the specific improvement idea?
>
> Size and speed improvements would be good. The Hadoop serialization
> mechanism is already pretty low-level, dealing directly in bytes (as
> opposed to fancier stuff like Avro). It's if anything fast and lean
> but quite manual. The latest Writable updates squeezed out most of the
> remaining overhead.
>
> One thing to recall is that in the tradeoff between size and speed, a
> test against a local ramdisk will make the cost of reading/writing
> bytes artificially low. That is to say I'd just err more on the side
> of compactness unless it makes a very big difference in decode time,
> as I imagine the cost of decoding bytes is nothing compared to that of
> storing and transmitting over a network. (Not to mention HDFS's work
> to replicate those bytes, etc.)
>
> I suspect there might be some value in storing vector indices as
> variable length ints, since they're usually not so large. I can also
> imagine more compact variable length encodings than the one in
> WritableUtils -- thinking of the encoding used in MIDI (and elsewhere
> I'd guess), where 7 bits per byte are used and the top bit signals the
> final value. IIRC WritableUtils always spends 8 bits writing the
> length of the encoding.
>
You mean this type of encoding instead?
 http://code.google.com/apis/protocolbuffers/docs/encoding.html

>
> On Sun, May 2, 2010 at 5:02 PM, Robin Anil <ro...@gmail.com> wrote:
> > I am getting more and  more ideas as I try to write about scaling Mahout
> > clustering. I added serialize and de serialize benchmark for Vectors and
> > checked the speed of our vectors.
> >
> > Here is the output with Cardinality=1000 Sparsity=1000(dense)
> numVectors=100
> > loop=100 (hence writing 10K(int-doubles) to and reading back from disk)
> > Note: that these are not disk MB/s but the size of vectors/per sec
> > deserialized and the filesystem is a Ramdisk.
>

Re: Optimization opportunity: Speed up serialization and deserialization

Posted by Sean Owen <sr...@gmail.com>.
What's the specific improvement idea?

Size and speed improvements would be good. The Hadoop serialization
mechanism is already pretty low-level, dealing directly in bytes (as
opposed to fancier stuff like Avro). It's if anything fast and lean
but quite manual. The latest Writable updates squeezed out most of the
remaining overhead.

One thing to recall is that in the tradeoff between size and speed, a
test against a local ramdisk will make the cost of reading/writing
bytes artificially low. That is to say I'd just err more on the side
of compactness unless it makes a very big difference in decode time,
as I imagine the cost of decoding bytes is nothing compared to that of
storing and transmitting over a network. (Not to mention HDFS's work
to replicate those bytes, etc.)

I suspect there might be some value in storing vector indices as
variable length ints, since they're usually not so large. I can also
imagine more compact variable length encodings than the one in
WritableUtils -- thinking of the encoding used in MIDI (and elsewhere
I'd guess), where 7 bits per byte are used and the top bit signals the
final value. IIRC WritableUtils always spends 8 bits writing the
length of the encoding.

On Sun, May 2, 2010 at 5:02 PM, Robin Anil <ro...@gmail.com> wrote:
> I am getting more and  more ideas as I try to write about scaling Mahout
> clustering. I added serialize and de serialize benchmark for Vectors and
> checked the speed of our vectors.
>
> Here is the output with Cardinality=1000 Sparsity=1000(dense) numVectors=100
> loop=100 (hence writing 10K(int-doubles) to and reading back from disk)
> Note: that these are not disk MB/s but the size of vectors/per sec
> deserialized and the filesystem is a Ramdisk.