You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Dan Filimon <da...@gmail.com> on 2013/07/03 11:12:00 UTC

Mixed Avro/Hadoop Writable pipeline

Hi!

I'm working on integrating Avro into our data processing pipeline.
We're using quite a few standard Hadoop and Mahout writables (IntWritable,
VectorWritable).

I'm first going to replace the custom Writables with Avro, but in terms of
the other ones, how important would you say it is to use AvroKey<Integer>
instead of IntWritable for example?

The changes will happen gradually but are they even worth it?

Thanks!

Re: Mixed Avro/Hadoop Writable pipeline

Posted by Martin Kleppmann <ma...@rapportive.com>.

Hadoop's writables use Java's java.io.Data{Input,Output}Stream by default
(see org.apache.hadoop.io.serializer.WritableSerialization). This uses a
fixed-length encoding: 4 bytes for an int, 8 bytes for a long.
http://docs.oracle.com/javase/6/docs/api/java/io/DataOutputStream.html#writeInt(int)

Avro-encoded numbers are always variable-length (if you want fixed-length,
use a 'fixed' type in the schema).

Martin


On 4 July 2013 11:14, Dan Filimon <da...@gmail.com> wrote:

> The documentation for IntWritable doesn't explicitly mention it being
> fixed-length or not [1]. But, given there's also a VIntWritable [2], I
> think IntWritable is always 4 bytes.
>
> [1]
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/IntWritable.html
> [2]
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/VIntWritable.html
>
>
> On Thu, Jul 4, 2013 at 1:02 PM, Pradeep Gollakota <pr...@gmail.com>wrote:
>
>> Not sure about Avro<Integer> is 4 bytes or not. But IntWritable is
>> variable length. If the number can be represented in less than 4 bytes, it
>> will.
>> On Jul 4, 2013 2:22 AM, "Dan Filimon" <da...@gmail.com>
>> wrote:
>>
>>> Well, I got it working eventually. :)
>>>
>>> First of all, I'll mention that I'm using the new MapReduce API, so no
>>> AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and
>>> AvroValue<> wrappers and once I set the right properties using AvroJob's
>>> static methods (AvroJob.setMapOutputValueSchema() for example) and set the
>>> input to be an AvroKeyInputFormat, everything worked out fine.
>>>
>>> About the writables, I'm interested to know whether it'd be better to
>>> use Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume the
>>> speed/size of these two should be the same 4 bytes?
>>>
>>>
>>> On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann <ma...@rapportive.com>wrote:
>>>
>>>> Hi Dan,
>>>>
>>>> You're stepping off the documented path here, but I think that although
>>>> it might be a bit of work, it should be possible.
>>>>
>>>> Things to watch out for: you might not be able to use
>>>> AvroMapper/AvroReducer so easily, and you may have to mess around with the
>>>> job conf a bit (Avro-configured jobs use their own shuffle config with
>>>> AvroKeyComparator, which may not be what you want if you're also trying to
>>>> use writables). I'd suggest simply reading the code in
>>>> org.apache.avro.mapred[uce] -- it's not too complicated.
>>>>
>>>> Whether Avro files or writables (i.e. Hadoop sequence files) are better
>>>> for you depends mostly on which format you'd rather have your data in. If
>>>> you want to read the data files with something other than Hadoop, Avro is
>>>> definitely a good option. Also, Avro data files are self-describing (due to
>>>> their embedded schema) which makes them pleasant to use with tools like Pig
>>>> and Hive.
>>>>
>>>> Martin
>>>>
>>>>
>>>> On 3 July 2013 10:12, Dan Filimon <da...@gmail.com> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I'm working on integrating Avro into our data processing pipeline.
>>>>>  We're using quite a few standard Hadoop and Mahout writables
>>>>> (IntWritable, VectorWritable).
>>>>>
>>>>> I'm first going to replace the custom Writables with Avro, but in
>>>>> terms of the other ones, how important would you say it is to use
>>>>> AvroKey<Integer> instead of IntWritable for example?
>>>>>
>>>>> The changes will happen gradually but are they even worth it?
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>>
>>>
>

Re: Mixed Avro/Hadoop Writable pipeline

Posted by Dan Filimon <da...@gmail.com>.

The documentation for IntWritable doesn't explicitly mention it being
fixed-length or not [1]. But, given there's also a VIntWritable [2], I
think IntWritable is always 4 bytes.

[1]
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/IntWritable.html
[2]
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/VIntWritable.html


On Thu, Jul 4, 2013 at 1:02 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> Not sure about Avro<Integer> is 4 bytes or not. But IntWritable is
> variable length. If the number can be represented in less than 4 bytes, it
> will.
> On Jul 4, 2013 2:22 AM, "Dan Filimon" <da...@gmail.com> wrote:
>
>> Well, I got it working eventually. :)
>>
>> First of all, I'll mention that I'm using the new MapReduce API, so no
>> AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and
>> AvroValue<> wrappers and once I set the right properties using AvroJob's
>> static methods (AvroJob.setMapOutputValueSchema() for example) and set the
>> input to be an AvroKeyInputFormat, everything worked out fine.
>>
>> About the writables, I'm interested to know whether it'd be better to use
>> Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume the
>> speed/size of these two should be the same 4 bytes?
>>
>>
>> On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann <ma...@rapportive.com>wrote:
>>
>>> Hi Dan,
>>>
>>> You're stepping off the documented path here, but I think that although
>>> it might be a bit of work, it should be possible.
>>>
>>> Things to watch out for: you might not be able to use
>>> AvroMapper/AvroReducer so easily, and you may have to mess around with the
>>> job conf a bit (Avro-configured jobs use their own shuffle config with
>>> AvroKeyComparator, which may not be what you want if you're also trying to
>>> use writables). I'd suggest simply reading the code in
>>> org.apache.avro.mapred[uce] -- it's not too complicated.
>>>
>>> Whether Avro files or writables (i.e. Hadoop sequence files) are better
>>> for you depends mostly on which format you'd rather have your data in. If
>>> you want to read the data files with something other than Hadoop, Avro is
>>> definitely a good option. Also, Avro data files are self-describing (due to
>>> their embedded schema) which makes them pleasant to use with tools like Pig
>>> and Hive.
>>>
>>> Martin
>>>
>>>
>>> On 3 July 2013 10:12, Dan Filimon <da...@gmail.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> I'm working on integrating Avro into our data processing pipeline.
>>>>  We're using quite a few standard Hadoop and Mahout writables
>>>> (IntWritable, VectorWritable).
>>>>
>>>> I'm first going to replace the custom Writables with Avro, but in terms
>>>> of the other ones, how important would you say it is to use
>>>> AvroKey<Integer> instead of IntWritable for example?
>>>>
>>>> The changes will happen gradually but are they even worth it?
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>

Re: Mixed Avro/Hadoop Writable pipeline

Posted by Pradeep Gollakota <pr...@gmail.com>.

Not sure about Avro<Integer> is 4 bytes or not. But IntWritable is variable
length. If the number can be represented in less than 4 bytes, it will.
On Jul 4, 2013 2:22 AM, "Dan Filimon" <da...@gmail.com> wrote:

> Well, I got it working eventually. :)
>
> First of all, I'll mention that I'm using the new MapReduce API, so no
> AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and
> AvroValue<> wrappers and once I set the right properties using AvroJob's
> static methods (AvroJob.setMapOutputValueSchema() for example) and set the
> input to be an AvroKeyInputFormat, everything worked out fine.
>
> About the writables, I'm interested to know whether it'd be better to use
> Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume the
> speed/size of these two should be the same 4 bytes?
>
>
> On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann <ma...@rapportive.com>wrote:
>
>> Hi Dan,
>>
>> You're stepping off the documented path here, but I think that although
>> it might be a bit of work, it should be possible.
>>
>> Things to watch out for: you might not be able to use
>> AvroMapper/AvroReducer so easily, and you may have to mess around with the
>> job conf a bit (Avro-configured jobs use their own shuffle config with
>> AvroKeyComparator, which may not be what you want if you're also trying to
>> use writables). I'd suggest simply reading the code in
>> org.apache.avro.mapred[uce] -- it's not too complicated.
>>
>> Whether Avro files or writables (i.e. Hadoop sequence files) are better
>> for you depends mostly on which format you'd rather have your data in. If
>> you want to read the data files with something other than Hadoop, Avro is
>> definitely a good option. Also, Avro data files are self-describing (due to
>> their embedded schema) which makes them pleasant to use with tools like Pig
>> and Hive.
>>
>> Martin
>>
>>
>> On 3 July 2013 10:12, Dan Filimon <da...@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> I'm working on integrating Avro into our data processing pipeline.
>>> We're using quite a few standard Hadoop and Mahout writables
>>> (IntWritable, VectorWritable).
>>>
>>> I'm first going to replace the custom Writables with Avro, but in terms
>>> of the other ones, how important would you say it is to use
>>> AvroKey<Integer> instead of IntWritable for example?
>>>
>>> The changes will happen gradually but are they even worth it?
>>>
>>> Thanks!
>>>
>>
>>
>

Re: Mixed Avro/Hadoop Writable pipeline

Posted by Dan Filimon <da...@gmail.com>.

Well, I got it working eventually. :)

First of all, I'll mention that I'm using the new MapReduce API, so no
AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and
AvroValue<> wrappers and once I set the right properties using AvroJob's
static methods (AvroJob.setMapOutputValueSchema() for example) and set the
input to be an AvroKeyInputFormat, everything worked out fine.

About the writables, I'm interested to know whether it'd be better to use
Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume the
speed/size of these two should be the same 4 bytes?

On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann <ma...@rapportive.com>wrote:

> Hi Dan,
>
> You're stepping off the documented path here, but I think that although it
> might be a bit of work, it should be possible.
>
> Things to watch out for: you might not be able to use
> AvroMapper/AvroReducer so easily, and you may have to mess around with the
> job conf a bit (Avro-configured jobs use their own shuffle config with
> AvroKeyComparator, which may not be what you want if you're also trying to
> use writables). I'd suggest simply reading the code in
> org.apache.avro.mapred[uce] -- it's not too complicated.
>
> Whether Avro files or writables (i.e. Hadoop sequence files) are better
> for you depends mostly on which format you'd rather have your data in. If
> you want to read the data files with something other than Hadoop, Avro is
> definitely a good option. Also, Avro data files are self-describing (due to
> their embedded schema) which makes them pleasant to use with tools like Pig
> and Hive.
>
> Martin
>
>
> On 3 July 2013 10:12, Dan Filimon <da...@gmail.com> wrote:
>
>> Hi!
>>
>> I'm working on integrating Avro into our data processing pipeline.
>> We're using quite a few standard Hadoop and Mahout writables
>> (IntWritable, VectorWritable).
>>
>> I'm first going to replace the custom Writables with Avro, but in terms
>> of the other ones, how important would you say it is to use
>> AvroKey<Integer> instead of IntWritable for example?
>>
>> The changes will happen gradually but are they even worth it?
>>
>> Thanks!
>>
>
>

Re: Mixed Avro/Hadoop Writable pipeline

Posted by Martin Kleppmann <ma...@rapportive.com>.

Hi Dan,

You're stepping off the documented path here, but I think that although it
might be a bit of work, it should be possible.

Things to watch out for: you might not be able to use
AvroMapper/AvroReducer so easily, and you may have to mess around with the
job conf a bit (Avro-configured jobs use their own shuffle config with
AvroKeyComparator, which may not be what you want if you're also trying to
use writables). I'd suggest simply reading the code in
org.apache.avro.mapred[uce] -- it's not too complicated.

Whether Avro files or writables (i.e. Hadoop sequence files) are better for
you depends mostly on which format you'd rather have your data in. If you
want to read the data files with something other than Hadoop, Avro is
definitely a good option. Also, Avro data files are self-describing (due to
their embedded schema) which makes them pleasant to use with tools like Pig
and Hive.

Martin

On 3 July 2013 10:12, Dan Filimon <da...@gmail.com> wrote:

> Hi!
>
> I'm working on integrating Avro into our data processing pipeline.
> We're using quite a few standard Hadoop and Mahout writables (IntWritable,
> VectorWritable).
>
> I'm first going to replace the custom Writables with Avro, but in terms of
> the other ones, how important would you say it is to use AvroKey<Integer>
> instead of IntWritable for example?
>
> The changes will happen gradually but are they even worth it?
>
> Thanks!
>