You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Zach Bailey <zn...@gmail.com> on 2010/09/27 22:23:55 UTC

BytesWriteable support in Piggybank SequenceFileLoader?

Hey folks,

Not sure if this has been discussed already or if this is due to some 
limitation in pig, hadoop, or java - but is there a particular reason 
the PiggyBank SequenceFileLoader doesn't support the BytesWritable type 
for sequence file keys/values?

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/BytesWritable.html

Looking at the code, it maps the pig-specific DataByteArray class to the 
pig type "bytearray" - I don't understand this choice. Why use a 
pig-specific class here (which is not very friendly for a mixed 
pig/non-pig hadoop ecosystem)?

In fact, if you look at the SequenceFileLoader code you will see 
something that looks very strange:

protected Object translateWritableToPigDataType(*Writable w*, byte 
dataType) {
     switch(dataType) {
       case DataType.CHARARRAY: return ((Text) w).toString();
*      case DataType.BYTEARRAY: return((DataByteArray) w).get();*
       case DataType.INTEGER: return ((IntWritable) w).get();
       case DataType.LONG: return ((LongWritable) w).get();
       case DataType.FLOAT: return ((FloatWritable) w).get();
       case DataType.DOUBLE: return ((DoubleWritable) w).get();
       case DataType.BYTE: return ((ByteWritable) w).get();
     }

     return null;
   }

This code smells - the method takes a Writeable - which makes sense, but 
then for the BYTEARRAY type it's casting it to a DataByteArray, which 
doesn't implement Writable! WTF, mate?

I'm going to try my hand at switching this to use BytesWritable instead 
and see what explodes.

Cheers,
-Zach

Re: BytesWriteable support in Piggybank SequenceFileLoader?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I think the databyte thing you pointed out is likely a real bug, I tested it
on an incomplete subset of primitive types (just longs or something).

The main thing that keeps me from recommending it as is that if you are
reading a SequenceFile whose composition you already know, you should work
with it directly (manually translating into Pig tuples). IIRC, the loader as
written assumes that the key and value are both primitives -- what if your
value is a Pair of primitives? The sequence file loader won't support it.
So, "some assembly required."

As far as making sure it's working correctly -- you are going to want to
test that it handles records that span split edges correctly, not reading
them twice or dropping them. Actually this is probably taken care of by the
new API, so you should be fine if you're on 0.7+. Might be buggy in 0.6.

-D

On Mon, Sep 27, 2010 at 3:59 PM, Zach Bailey <za...@dataclip.com>wrote:

> Oh, gosh, well that makes me uneasy, since I was intending to really use
> this, in production.
>
> Is there something in particular about this class that makes it not
> intended for real-world use? Performance? The way it's written (i.e. still
> depends on old APIs, etc.)?
>
> Is there a loader you suggest I look at using instead that has been more
> battle-tested?
>
> -Zach
>
>
> Dmitriy Ryaboy wrote:
>
>> Zach,
>> Perhaps I should've documented that better.
>> That class is *not intended for real use*. As far as I know, it's never
>> been
>> used by anyone for anything in production.
>> It's a demo of how one would go about writing a real SequenceFileLoader
>> for
>> whatever internal stuff you are using. Feel free to replace anything that
>> makes sense for you in your implementation.
>>
>> -D
>>
>> On Mon, Sep 27, 2010 at 1:23 PM, Zach Bailey<zn...@gmail.com>  wrote:
>>
>>  Hey folks,
>>>
>>> Not sure if this has been discussed already or if this is due to some
>>> limitation in pig, hadoop, or java - but is there a particular reason the
>>> PiggyBank SequenceFileLoader doesn't support the BytesWritable type for
>>> sequence file keys/values?
>>>
>>>
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/BytesWritable.html
>>>
>>> Looking at the code, it maps the pig-specific DataByteArray class to the
>>> pig type "bytearray" - I don't understand this choice. Why use a
>>> pig-specific class here (which is not very friendly for a mixed
>>> pig/non-pig
>>> hadoop ecosystem)?
>>>
>>> In fact, if you look at the SequenceFileLoader code you will see
>>> something
>>> that looks very strange:
>>>
>>> protected Object translateWritableToPigDataType(*Writable w*, byte
>>> dataType) {
>>>    switch(dataType) {
>>>      case DataType.CHARARRAY: return ((Text) w).toString();
>>> *      case DataType.BYTEARRAY: return((DataByteArray) w).get();*
>>>      case DataType.INTEGER: return ((IntWritable) w).get();
>>>      case DataType.LONG: return ((LongWritable) w).get();
>>>      case DataType.FLOAT: return ((FloatWritable) w).get();
>>>      case DataType.DOUBLE: return ((DoubleWritable) w).get();
>>>      case DataType.BYTE: return ((ByteWritable) w).get();
>>>    }
>>>
>>>    return null;
>>>  }
>>>
>>> This code smells - the method takes a Writeable - which makes sense, but
>>> then for the BYTEARRAY type it's casting it to a DataByteArray, which
>>> doesn't implement Writable! WTF, mate?
>>>
>>> I'm going to try my hand at switching this to use BytesWritable instead
>>> and
>>> see what explodes.
>>>
>>> Cheers,
>>> -Zach
>>>
>>>
>>

Re: BytesWriteable support in Piggybank SequenceFileLoader?

Posted by Zach Bailey <za...@dataclip.com>.

Oh, gosh, well that makes me uneasy, since I was intending to really use 
this, in production.

Is there something in particular about this class that makes it not 
intended for real-world use? Performance? The way it's written (i.e. 
still depends on old APIs, etc.)?

Is there a loader you suggest I look at using instead that has been more 
battle-tested?

-Zach

Dmitriy Ryaboy wrote:
> Zach,
> Perhaps I should've documented that better.
> That class is *not intended for real use*. As far as I know, it's never been
> used by anyone for anything in production.
> It's a demo of how one would go about writing a real SequenceFileLoader for
> whatever internal stuff you are using. Feel free to replace anything that
> makes sense for you in your implementation.
>
> -D
>
> On Mon, Sep 27, 2010 at 1:23 PM, Zach Bailey<zn...@gmail.com>  wrote:
>
>> Hey folks,
>>
>> Not sure if this has been discussed already or if this is due to some
>> limitation in pig, hadoop, or java - but is there a particular reason the
>> PiggyBank SequenceFileLoader doesn't support the BytesWritable type for
>> sequence file keys/values?
>>
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/BytesWritable.html
>>
>> Looking at the code, it maps the pig-specific DataByteArray class to the
>> pig type "bytearray" - I don't understand this choice. Why use a
>> pig-specific class here (which is not very friendly for a mixed pig/non-pig
>> hadoop ecosystem)?
>>
>> In fact, if you look at the SequenceFileLoader code you will see something
>> that looks very strange:
>>
>> protected Object translateWritableToPigDataType(*Writable w*, byte
>> dataType) {
>>     switch(dataType) {
>>       case DataType.CHARARRAY: return ((Text) w).toString();
>> *      case DataType.BYTEARRAY: return((DataByteArray) w).get();*
>>       case DataType.INTEGER: return ((IntWritable) w).get();
>>       case DataType.LONG: return ((LongWritable) w).get();
>>       case DataType.FLOAT: return ((FloatWritable) w).get();
>>       case DataType.DOUBLE: return ((DoubleWritable) w).get();
>>       case DataType.BYTE: return ((ByteWritable) w).get();
>>     }
>>
>>     return null;
>>   }
>>
>> This code smells - the method takes a Writeable - which makes sense, but
>> then for the BYTEARRAY type it's casting it to a DataByteArray, which
>> doesn't implement Writable! WTF, mate?
>>
>> I'm going to try my hand at switching this to use BytesWritable instead and
>> see what explodes.
>>
>> Cheers,
>> -Zach
>>
>

Re: BytesWriteable support in Piggybank SequenceFileLoader?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Zach,
Perhaps I should've documented that better.
That class is *not intended for real use*. As far as I know, it's never been
used by anyone for anything in production.
It's a demo of how one would go about writing a real SequenceFileLoader for
whatever internal stuff you are using. Feel free to replace anything that
makes sense for you in your implementation.

-D

On Mon, Sep 27, 2010 at 1:23 PM, Zach Bailey <zn...@gmail.com> wrote:

> Hey folks,
>
> Not sure if this has been discussed already or if this is due to some
> limitation in pig, hadoop, or java - but is there a particular reason the
> PiggyBank SequenceFileLoader doesn't support the BytesWritable type for
> sequence file keys/values?
>
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/BytesWritable.html
>
> Looking at the code, it maps the pig-specific DataByteArray class to the
> pig type "bytearray" - I don't understand this choice. Why use a
> pig-specific class here (which is not very friendly for a mixed pig/non-pig
> hadoop ecosystem)?
>
> In fact, if you look at the SequenceFileLoader code you will see something
> that looks very strange:
>
> protected Object translateWritableToPigDataType(*Writable w*, byte
> dataType) {
>    switch(dataType) {
>      case DataType.CHARARRAY: return ((Text) w).toString();
> *      case DataType.BYTEARRAY: return((DataByteArray) w).get();*
>      case DataType.INTEGER: return ((IntWritable) w).get();
>      case DataType.LONG: return ((LongWritable) w).get();
>      case DataType.FLOAT: return ((FloatWritable) w).get();
>      case DataType.DOUBLE: return ((DoubleWritable) w).get();
>      case DataType.BYTE: return ((ByteWritable) w).get();
>    }
>
>    return null;
>  }
>
> This code smells - the method takes a Writeable - which makes sense, but
> then for the BYTEARRAY type it's casting it to a DataByteArray, which
> doesn't implement Writable! WTF, mate?
>
> I'm going to try my hand at switching this to use BytesWritable instead and
> see what explodes.
>
> Cheers,
> -Zach
>