You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by nir_zamir <ni...@gmail.com> on 2013/05/23 09:42:39 UTC

Compressed Avro vs. compressed Sequence - unexpected results?

Hi,

We're examining the storage of our data in Snappy-compressed files. Since we
want the data's structure to be self contained, we checked it with Avro and
with Sequence (both are splittable, which should best utilize our cluster).

We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster
on production environment (very strong machines).

Compression

What we did here (for test simplicity) is create two Hive tables: Avro-based
and Sequence-based. Then we enabled Snappy compression and INSERTed the data
from the RAW table (consisting of the 12GB file).

In terms of compression rate, Avro was better: 72% vs. 57%.
In both cases there were 45 mappers, and CPU/Mem were very far from their
limit on all machines.
Since there was no reduce operator, this created 45 files.

Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for
sequence files.

Decompression

What we did here was this Hive query:
SELECT COUNT(1) FROM table-name;

Here was the real difference: it took Avro about *75% longer* to perform
this (3 minutes vs. 0.5 minute).
This was very surprising since for our strong machines the I/O would be
expected to be the bottleneck, and since Avro files are smaller,we expected
them to be faster to decompress.
The number of mappers in both cases was similar (14 vs. 17) and again,
CPU/Mem didn't seem to be exausted.
Since our most critical time is reading, this issue makes it hard for us to
be using Avro.

Maybe we're doing something wrong - your input would be much appreciated!

Thanks,
Nir

--
View this message in context: http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequence-unexpected-results-tp4027467.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: Compressed Avro vs. compressed Sequence - unexpected results?

Posted by Scott Carey <sc...@apache.org>.

For your avro files, double check that snappy is used (use avro-tools to
peek at the metadata in the file, or simply view the head in a text
editor, the compression codec used will be in the header).

Snappy is very fast, most likely the time to read is dominated by
deserialization.  Avro will be slower than a trivial deserializer (but
more compact), but being many times slower is not expected.  I am not
entirely sure how Hive's Avro serDe works -- it is possible there is a
performance issue there.  If you were able to get a handful of stack
traces (kill -3 or jstack) from the mapper tasks (or a profiler output),
it would be very insightful.


On 5/23/13 12:42 AM, "nir_zamir" <ni...@gmail.com> wrote:

>Hi,
>
>We're examining the storage of our data in Snappy-compressed files. Since
>we
>want the data's structure to be self contained, we checked it with Avro
>and
>with Sequence (both are splittable, which should best utilize our
>cluster).
>
>We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster
>on production environment (very strong machines).
>
>Compression
>
>What we did here (for test simplicity) is create two Hive tables:
>Avro-based
>and Sequence-based. Then we enabled Snappy compression and INSERTed the
>data
>from the RAW table (consisting of the 12GB file).
>
>In terms of compression rate, Avro was better: 72% vs. 57%.
>In both cases there were 45 mappers, and CPU/Mem were very far from their
>limit on all machines.
>Since there was no reduce operator, this created 45 files.
>
>Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for
>sequence files.
>
>Decompression
>
>What we did here was this Hive query:
>SELECT COUNT(1) FROM table-name;
>
>Here was the real difference: it took Avro about *75% longer* to perform
>this (3 minutes vs. 0.5 minute).
>This was very surprising since for our strong machines the I/O would be
>expected to be the bottleneck, and since Avro files are smaller,we
>expected
>them to be faster to decompress.
>The number of mappers in both cases was similar (14 vs. 17) and again,
>CPU/Mem didn't seem to be exausted.
>Since our most critical time is reading, this issue makes it hard for us
>to
>be using Avro.
>
>Maybe we're doing something wrong - your input would be much appreciated!
>
>Thanks,
>Nir
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequ
>ence-unexpected-results-tp4027467.html
>Sent from the Avro - Users mailing list archive at Nabble.com.