You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by bzheng <bi...@gmail.com> on 2009/04/09 01:59:32 UTC

BytesWritable get() returns more bytes then what's stored

I tried to store protocolbuffer as BytesWritable in a sequence file <Text,
BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key), new
BytesWritable(protobuf.convertToBytes())).  When reading the values from
key/value pairs using value.get(), it returns more then what's stored. 
However, value.getSize() returns the correct number.  This means in order to
convert the byte[] to protocol buffer again, I have to do
Arrays.copyOf(value.get(), value.getSize()).  This happens on both version
0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for a
few entries in the sequence file below.  The extra bytes in value.get() all
have values of zero.  

value.getSize(): 7066   value.get().length: 10599
value.getSize(): 36456  value.get().length: 54684
value.getSize(): 32275  value.get().length: 54684
value.getSize(): 40561  value.get().length: 54684
value.getSize(): 16855  value.get().length: 54684
value.getSize(): 66304  value.get().length: 99456
value.getSize(): 26488  value.get().length: 99456
value.getSize(): 59327  value.get().length: 99456
value.getSize(): 36865  value.get().length: 99456

-- 
View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: BytesWritable get() returns more bytes then what's stored

Posted by Gaurav Chandalia <gc...@adknowledge.com>.

Arrays.copyOf isn't required, protocol buffer has a method to merge  
from bytes. you can do:

protobuf.newBuilder().mergeFrom(value.getBytes(), 0, value.getLength())

the above is for hadoop 0.19.1, the corresponding method names for  
BytesWritable for earlier version of hadoop might be slightly different.

--
gaurav



On Apr 8, 2009, at 7:13 PM, Todd Lipcon wrote:

> Hi Bing,
>
> The issue here is that BytesWritable uses an internal buffer which  
> is grown
> but not shrunk. The cause of this is that Writables in general are  
> single
> instances that are shared across multiple input records. If you look  
> at the
> internals of the input reader, you'll see that a single  
> BytesWritable is
> instantiated, and then each time a record is read, it's read into  
> that same
> instance. The purpose here is to avoid the allocation cost for each  
> row.
>
> The end result is, as you've seen, that getBytes() returns an array  
> which
> may be larger than the actual amount of data. In fact, the extra bytes
> (between .getSize() and .get().length) have undefined contents, not  
> zero.
>
> Unfortunately, if the protobuffer API doesn't allow you to  
> deserialize out
> of a smaller portion of a byte array, you're out of luck and will  
> have to do
> the copy like you've mentioned. I imagine, though, that there's some  
> way
> around this in the protobuffer API - perhaps you can use a
> ByteArrayInputStream here to your advantage.
>
> Hope that helps
> -Todd
>
> On Wed, Apr 8, 2009 at 4:59 PM, bzheng <bi...@gmail.com> wrote:
>
>>
>> I tried to store protocolbuffer as BytesWritable in a sequence file  
>> <Text,
>> BytesWritable>.  It's stored using SequenceFile.Writer(new  
>> Text(key), new
>> BytesWritable(protobuf.convertToBytes())).  When reading the values  
>> from
>> key/value pairs using value.get(), it returns more then what's  
>> stored.
>> However, value.getSize() returns the correct number.  This means in  
>> order
>> to
>> convert the byte[] to protocol buffer again, I have to do
>> Arrays.copyOf(value.get(), value.getSize()).  This happens on both  
>> version
>> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample  
>> sizes for a
>> few entries in the sequence file below.  The extra bytes in  
>> value.get() all
>> have values of zero.
>>
>> value.getSize(): 7066   value.get().length: 10599
>> value.getSize(): 36456  value.get().length: 54684
>> value.getSize(): 32275  value.get().length: 54684
>> value.getSize(): 40561  value.get().length: 54684
>> value.getSize(): 16855  value.get().length: 54684
>> value.getSize(): 66304  value.get().length: 99456
>> value.getSize(): 26488  value.get().length: 99456
>> value.getSize(): 59327  value.get().length: 99456
>> value.getSize(): 36865  value.get().length: 99456
>>
>> --
>> View this message in context:
>> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>

Re: BytesWritable get() returns more bytes then what's stored

Posted by Alex Loddengaard <al...@cloudera.com>.

FYI: this (open) JIRA might be interesting to you:

<http://issues.apache.org/jira/browse/HADOOP-3788>

Alex

On Wed, Apr 8, 2009 at 7:18 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Wed, Apr 8, 2009 at 7:14 PM, bzheng <bi...@gmail.com> wrote:
>
> >
> > Thanks for the clarification.  Though I still find it strange why not
> have
> > the get() method return what's actually stored regardless of buffer size.
> > Is there any reason why you'd want to use/examine what's in the buffer?
> >
>
> Because doing so requires an array copy. It's important for hadoop
> performance to avoid needless copies of data when they're unnecessary. Most
> APIs that take byte[] arrays have a version that includes an offset and
> length.
>
> -Todd
>
>
>
> >
> >
> > Todd Lipcon-4 wrote:
> > >
> > > Hi Bing,
> > >
> > > The issue here is that BytesWritable uses an internal buffer which is
> > > grown
> > > but not shrunk. The cause of this is that Writables in general are
> single
> > > instances that are shared across multiple input records. If you look at
> > > the
> > > internals of the input reader, you'll see that a single BytesWritable
> is
> > > instantiated, and then each time a record is read, it's read into that
> > > same
> > > instance. The purpose here is to avoid the allocation cost for each
> row.
> > >
> > > The end result is, as you've seen, that getBytes() returns an array
> which
> > > may be larger than the actual amount of data. In fact, the extra bytes
> > > (between .getSize() and .get().length) have undefined contents, not
> zero.
> > >
> > > Unfortunately, if the protobuffer API doesn't allow you to deserialize
> > out
> > > of a smaller portion of a byte array, you're out of luck and will have
> to
> > > do
> > > the copy like you've mentioned. I imagine, though, that there's some
> way
> > > around this in the protobuffer API - perhaps you can use a
> > > ByteArrayInputStream here to your advantage.
> > >
> > > Hope that helps
> > > -Todd
> > >
> > > On Wed, Apr 8, 2009 at 4:59 PM, bzheng <bi...@gmail.com> wrote:
> > >
> > >>
> > >> I tried to store protocolbuffer as BytesWritable in a sequence file
> > >> <Text,
> > >> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key),
> > new
> > >> BytesWritable(protobuf.convertToBytes())).  When reading the values
> from
> > >> key/value pairs using value.get(), it returns more then what's stored.
> > >> However, value.getSize() returns the correct number.  This means in
> > order
> > >> to
> > >> convert the byte[] to protocol buffer again, I have to do
> > >> Arrays.copyOf(value.get(), value.getSize()).  This happens on both
> > >> version
> > >> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes
> for
> > >> a
> > >> few entries in the sequence file below.  The extra bytes in
> value.get()
> > >> all
> > >> have values of zero.
> > >>
> > >> value.getSize(): 7066   value.get().length: 10599
> > >> value.getSize(): 36456  value.get().length: 54684
> > >> value.getSize(): 32275  value.get().length: 54684
> > >> value.getSize(): 40561  value.get().length: 54684
> > >> value.getSize(): 16855  value.get().length: 54684
> > >> value.getSize(): 66304  value.get().length: 99456
> > >> value.getSize(): 26488  value.get().length: 99456
> > >> value.getSize(): 59327  value.get().length: 99456
> > >> value.getSize(): 36865  value.get().length: 99456
> > >>
> > >> --
> > >> View this message in context:
> > >>
> >
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
> > >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Re: BytesWritable get() returns more bytes then what's stored

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Apr 8, 2009 at 7:14 PM, bzheng <bi...@gmail.com> wrote:

>
> Thanks for the clarification.  Though I still find it strange why not have
> the get() method return what's actually stored regardless of buffer size.
> Is there any reason why you'd want to use/examine what's in the buffer?
>

Because doing so requires an array copy. It's important for hadoop
performance to avoid needless copies of data when they're unnecessary. Most
APIs that take byte[] arrays have a version that includes an offset and
length.

-Todd



>
>
> Todd Lipcon-4 wrote:
> >
> > Hi Bing,
> >
> > The issue here is that BytesWritable uses an internal buffer which is
> > grown
> > but not shrunk. The cause of this is that Writables in general are single
> > instances that are shared across multiple input records. If you look at
> > the
> > internals of the input reader, you'll see that a single BytesWritable is
> > instantiated, and then each time a record is read, it's read into that
> > same
> > instance. The purpose here is to avoid the allocation cost for each row.
> >
> > The end result is, as you've seen, that getBytes() returns an array which
> > may be larger than the actual amount of data. In fact, the extra bytes
> > (between .getSize() and .get().length) have undefined contents, not zero.
> >
> > Unfortunately, if the protobuffer API doesn't allow you to deserialize
> out
> > of a smaller portion of a byte array, you're out of luck and will have to
> > do
> > the copy like you've mentioned. I imagine, though, that there's some way
> > around this in the protobuffer API - perhaps you can use a
> > ByteArrayInputStream here to your advantage.
> >
> > Hope that helps
> > -Todd
> >
> > On Wed, Apr 8, 2009 at 4:59 PM, bzheng <bi...@gmail.com> wrote:
> >
> >>
> >> I tried to store protocolbuffer as BytesWritable in a sequence file
> >> <Text,
> >> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key),
> new
> >> BytesWritable(protobuf.convertToBytes())).  When reading the values from
> >> key/value pairs using value.get(), it returns more then what's stored.
> >> However, value.getSize() returns the correct number.  This means in
> order
> >> to
> >> convert the byte[] to protocol buffer again, I have to do
> >> Arrays.copyOf(value.get(), value.getSize()).  This happens on both
> >> version
> >> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for
> >> a
> >> few entries in the sequence file below.  The extra bytes in value.get()
> >> all
> >> have values of zero.
> >>
> >> value.getSize(): 7066   value.get().length: 10599
> >> value.getSize(): 36456  value.get().length: 54684
> >> value.getSize(): 32275  value.get().length: 54684
> >> value.getSize(): 40561  value.get().length: 54684
> >> value.getSize(): 16855  value.get().length: 54684
> >> value.getSize(): 66304  value.get().length: 99456
> >> value.getSize(): 26488  value.get().length: 99456
> >> value.getSize(): 59327  value.get().length: 99456
> >> value.getSize(): 36865  value.get().length: 99456
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: BytesWritable get() returns more bytes then what's stored

Posted by bzheng <bi...@gmail.com>.

Thanks for the clarification.  Though I still find it strange why not have
the get() method return what's actually stored regardless of buffer size. 
Is there any reason why you'd want to use/examine what's in the buffer?


Todd Lipcon-4 wrote:
> 
> Hi Bing,
> 
> The issue here is that BytesWritable uses an internal buffer which is
> grown
> but not shrunk. The cause of this is that Writables in general are single
> instances that are shared across multiple input records. If you look at
> the
> internals of the input reader, you'll see that a single BytesWritable is
> instantiated, and then each time a record is read, it's read into that
> same
> instance. The purpose here is to avoid the allocation cost for each row.
> 
> The end result is, as you've seen, that getBytes() returns an array which
> may be larger than the actual amount of data. In fact, the extra bytes
> (between .getSize() and .get().length) have undefined contents, not zero.
> 
> Unfortunately, if the protobuffer API doesn't allow you to deserialize out
> of a smaller portion of a byte array, you're out of luck and will have to
> do
> the copy like you've mentioned. I imagine, though, that there's some way
> around this in the protobuffer API - perhaps you can use a
> ByteArrayInputStream here to your advantage.
> 
> Hope that helps
> -Todd
> 
> On Wed, Apr 8, 2009 at 4:59 PM, bzheng <bi...@gmail.com> wrote:
> 
>>
>> I tried to store protocolbuffer as BytesWritable in a sequence file
>> <Text,
>> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key), new
>> BytesWritable(protobuf.convertToBytes())).  When reading the values from
>> key/value pairs using value.get(), it returns more then what's stored.
>> However, value.getSize() returns the correct number.  This means in order
>> to
>> convert the byte[] to protocol buffer again, I have to do
>> Arrays.copyOf(value.get(), value.getSize()).  This happens on both
>> version
>> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for
>> a
>> few entries in the sequence file below.  The extra bytes in value.get()
>> all
>> have values of zero.
>>
>> value.getSize(): 7066   value.get().length: 10599
>> value.getSize(): 36456  value.get().length: 54684
>> value.getSize(): 32275  value.get().length: 54684
>> value.getSize(): 40561  value.get().length: 54684
>> value.getSize(): 16855  value.get().length: 54684
>> value.getSize(): 66304  value.get().length: 99456
>> value.getSize(): 26488  value.get().length: 99456
>> value.getSize(): 59327  value.get().length: 99456
>> value.getSize(): 36865  value.get().length: 99456
>>
>> --
>> View this message in context:
>> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: BytesWritable get() returns more bytes then what's stored

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Bing,

The issue here is that BytesWritable uses an internal buffer which is grown
but not shrunk. The cause of this is that Writables in general are single
instances that are shared across multiple input records. If you look at the
internals of the input reader, you'll see that a single BytesWritable is
instantiated, and then each time a record is read, it's read into that same
instance. The purpose here is to avoid the allocation cost for each row.

The end result is, as you've seen, that getBytes() returns an array which
may be larger than the actual amount of data. In fact, the extra bytes
(between .getSize() and .get().length) have undefined contents, not zero.

Unfortunately, if the protobuffer API doesn't allow you to deserialize out
of a smaller portion of a byte array, you're out of luck and will have to do
the copy like you've mentioned. I imagine, though, that there's some way
around this in the protobuffer API - perhaps you can use a
ByteArrayInputStream here to your advantage.

Hope that helps
-Todd

On Wed, Apr 8, 2009 at 4:59 PM, bzheng <bi...@gmail.com> wrote:

>
> I tried to store protocolbuffer as BytesWritable in a sequence file <Text,
> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key), new
> BytesWritable(protobuf.convertToBytes())).  When reading the values from
> key/value pairs using value.get(), it returns more then what's stored.
> However, value.getSize() returns the correct number.  This means in order
> to
> convert the byte[] to protocol buffer again, I have to do
> Arrays.copyOf(value.get(), value.getSize()).  This happens on both version
> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for a
> few entries in the sequence file below.  The extra bytes in value.get() all
> have values of zero.
>
> value.getSize(): 7066   value.get().length: 10599
> value.getSize(): 36456  value.get().length: 54684
> value.getSize(): 32275  value.get().length: 54684
> value.getSize(): 40561  value.get().length: 54684
> value.getSize(): 16855  value.get().length: 54684
> value.getSize(): 66304  value.get().length: 99456
> value.getSize(): 26488  value.get().length: 99456
> value.getSize(): 59327  value.get().length: 99456
> value.getSize(): 36865  value.get().length: 99456
>
> --
> View this message in context:
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>