You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Robert Synnott <rs...@gmail.com> on 2015/05/07 22:06:54 UTC

Performance issues using Avro

Hi,
I just started trying out Parquet, and ran into a performance issue. I
was using the Avro support to try working with a test schema, using
the 'standalone' approach from here:
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/

I took an existing Avro schema, consisting of a few columns each
containing a map, and wrote, then read back, about 40MB of data using
both Avro's own serialisation, and Parquet's. Parquet's ended up being
about five times slower. This ratio was maintained when I moved to
using ~1GB data. I'd expect it to be a little slower, as I was reading
back all columns, but five times seems high. Is there anything simple
I might be missing?
Thanks
Rob

Re: Performance issues using Avro

Posted by Ryan Blue <bl...@cloudera.com>.

Robert,

Thanks for taking the time to track this down and to let us know what 
the problem actually was. We're working on a new version of the Avro 
support that more closely matches what Avro does internally. I'll make 
sure the String/UTF8 problem that's slowing things down here is solved 
at the same time. Thanks!

By the way, that should also enable reflect support. I'm not sure if 
you're interested in it, but I think it should be a good addition. :)

rb

On 05/11/2015 11:15 AM, Robert Synnott wrote:
> I found out what was going on here in the end. It turns out that
> Avro's own decoder and Parquet's Avro support don't seem to behave the
> same way. Parquet decodes strings into Java Strings, while Avro seems
> to just wrap the byte array in this wrapper:
> https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/util/Utf8.java
> , avoiding the cost of decoding until someone wants the string value.
>
> Since my code doesn't need to actually read most of the map values on
> each run, the Avro decoder approach worked a lot faster for me. I can
> get around this by just using 'bytes' rather than 'string' and doing
> the decode myself where necessary, so that's fine.
>
>
> On 7 May 2015 at 22:27, Alex Levenson <al...@twitter.com.invalid> wrote:
>> Are you comparing the read speed on a hadoop cluster, or locally on a
>> single machine? In a micro benchmark like this, using hadoop local mode for
>> parquet, but not for avro, could introduce a lot of overhead. Just curious
>> how you're doing the comparison.
>>
>> On Thu, May 7, 2015 at 1:06 PM, Robert Synnott <rs...@gmail.com> wrote:
>>
>>> Hi,
>>> I just started trying out Parquet, and ran into a performance issue. I
>>> was using the Avro support to try working with a test schema, using
>>> the 'standalone' approach from here:
>>>
>>> http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
>>>
>>> I took an existing Avro schema, consisting of a few columns each
>>> containing a map, and wrote, then read back, about 40MB of data using
>>> both Avro's own serialisation, and Parquet's. Parquet's ended up being
>>> about five times slower. This ratio was maintained when I moved to
>>> using ~1GB data. I'd expect it to be a little slower, as I was reading
>>> back all columns, but five times seems high. Is there anything simple
>>> I might be missing?
>>> Thanks
>>> Rob
>>>
>>
>>
>>
>> --
>> Alex Levenson
>> @THISWILLWORK
>
>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Performance issues using Avro

Posted by Robert Synnott <rs...@gmail.com>.

I found out what was going on here in the end. It turns out that
Avro's own decoder and Parquet's Avro support don't seem to behave the
same way. Parquet decodes strings into Java Strings, while Avro seems
to just wrap the byte array in this wrapper:
https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/util/Utf8.java
, avoiding the cost of decoding until someone wants the string value.

Since my code doesn't need to actually read most of the map values on
each run, the Avro decoder approach worked a lot faster for me. I can
get around this by just using 'bytes' rather than 'string' and doing
the decode myself where necessary, so that's fine.

On 7 May 2015 at 22:27, Alex Levenson <al...@twitter.com.invalid> wrote:
> Are you comparing the read speed on a hadoop cluster, or locally on a
> single machine? In a micro benchmark like this, using hadoop local mode for
> parquet, but not for avro, could introduce a lot of overhead. Just curious
> how you're doing the comparison.
>
> On Thu, May 7, 2015 at 1:06 PM, Robert Synnott <rs...@gmail.com> wrote:
>
>> Hi,
>> I just started trying out Parquet, and ran into a performance issue. I
>> was using the Avro support to try working with a test schema, using
>> the 'standalone' approach from here:
>>
>> http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
>>
>> I took an existing Avro schema, consisting of a few columns each
>> containing a map, and wrote, then read back, about 40MB of data using
>> both Avro's own serialisation, and Parquet's. Parquet's ended up being
>> about five times slower. This ratio was maintained when I moved to
>> using ~1GB data. I'd expect it to be a little slower, as I was reading
>> back all columns, but five times seems high. Is there anything simple
>> I might be missing?
>> Thanks
>> Rob
>>
>
>
>
> --
> Alex Levenson
> @THISWILLWORK

-- 
Robert Synnott
http://myblog.rsynnott.com
MSN: rsynnott@gmail.com
Jabber: rsynnott@gmail.com

Re: Performance issues using Avro

Posted by Alex Levenson <al...@twitter.com.INVALID>.

Are you comparing the read speed on a hadoop cluster, or locally on a
single machine? In a micro benchmark like this, using hadoop local mode for
parquet, but not for avro, could introduce a lot of overhead. Just curious
how you're doing the comparison.

On Thu, May 7, 2015 at 1:06 PM, Robert Synnott <rs...@gmail.com> wrote:

> Hi,
> I just started trying out Parquet, and ran into a performance issue. I
> was using the Avro support to try working with a test schema, using
> the 'standalone' approach from here:
>
> http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
>
> I took an existing Avro schema, consisting of a few columns each
> containing a map, and wrote, then read back, about 40MB of data using
> both Avro's own serialisation, and Parquet's. Parquet's ended up being
> about five times slower. This ratio was maintained when I moved to
> using ~1GB data. I'd expect it to be a little slower, as I was reading
> back all columns, but five times seems high. Is there anything simple
> I might be missing?
> Thanks
> Rob
>



-- 
Alex Levenson
@THISWILLWORK