You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rob Verkuylen <ro...@verkuylen.net> on 2013/06/04 21:58:36 UTC

Re: Explosion in datasize using HBase as a MR sink

Finally fixed this, my code was at fault.

Protobufs require a builder object which was a (non static) protected object in an abstract class all parsers extend. The mapper calls a parser factory depending on the input record. Because we designed the parser instances as singletons, the builder object in the abstract class got reused and all data got appended to the same builder. Doh! This only shows up in a job, not in single tests. Ah well, I've learned a lot  :)

@Asaf we will be moving to LoadIncrementalHFiles asap. I had the code ready, but obviously it showed the same size problems before the fix.

Thnx for the thoughts!

On May 31, 2013, at 22:02, Asaf Mesika <as...@gmail.com> wrote:

> On your data set size, I would go on HFile OutputFormat and then bulk load in into HBase. Why go through the Put flow anyway (memstore, flush, WAL), especially if you have the input ready at your disposal for re-try if something fails?
> Sounds faster to me anyway.
> 
> On May 30, 2013, at 10:52 PM, Rob Verkuylen <ro...@verkuylen.net> wrote:
> 
>> 
>> On May 30, 2013, at 4:51, Stack <st...@duboce.net> wrote:
>> 
>>> Triggering a major compaction does not alter the overall 217.5GB size?
>> 
>> A major compaction reduces the size from the original 219GB to the 217,5GB, so barely a reduction. 
>> 80% of the region sizes are 1,4GB before and after. I haven't merged the smaller regions,
>> but that still would not bring the size down to the 2,5-5 or so GB I would expect given T2's size.
>> 
>>> You have speculative execution turned on in your MR job so its possible you
>>> write many versions?
>> 
>> I've turned off speculative execution (through conf.set) just for the mappers, since we're not using reducers, should we? 
>> I will triple check the actual job settings in the job tracker, since I need to make the settings on a job level.
>> 
>>> Does your MR job fail many tasks (and though it fails, until it fails, it
>>> will have written some subset of the task hence bloating your versions?).
>> 
>> We've had problems with failing mappers, because of zookeeper timeouts on large inserts,
>> we increased zookeeper timeout and blockingstorefiles to accommodate. Now we don't
>> get failures. This job writes to a cleanly made table, versions set to 1, so there shouldn't be
>> extra versions I assume(?).
>> 
>>> You are putting everything into protobufs?  Could that be bloating your
>>> data?  Can you take a smaller subset and dump to the log a string version
>>> of the pb.  Use TextFormat
>>> https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/TextFormat#shortDebugString(com.google.protobuf.MessageOrBuilder)
>> 
>> The protobufs reduce the size to roughly 40% of the original XML data in T1. 
>> The MR parser is a port of the python parse code we use going from T1 to T2.
>> I've done manual comparisons on 20-30 records from T2.1 and T2 and they are identical, 
>> with only minute differences, because of slightly different parsing. I've done these in hbase shell,
>> I will try log dumping them too.
>> 
>>> It can be informative looking at hfile content.  It could give you a clue
>>> as to the bloat.  See http://hbase.apache.org/book.html#hfile_tool
>> 
>> I will give this a go and report back. Any other debugging suggestions are more then welcome :)
>> 
>> Thnx, Rob
>> 
> 


Re: Explosion in datasize using HBase as a MR sink

Posted by Stack <st...@duboce.net>.
On Tue, Jun 4, 2013 at 9:58 PM, Rob Verkuylen <ro...@verkuylen.net> wrote:

> Finally fixed this, my code was at fault.
>
> Protobufs require a builder object which was a (non static) protected
> object in an abstract class all parsers extend. The mapper calls a parser
> factory depending on the input record. Because we designed the parser
> instances as singletons, the builder object in the abstract class got
> reused and all data got appended to the same builder. Doh! This only shows
> up in a job, not in single tests. Ah well, I've learned a lot  :)
>
>
Thanks for updating the list Rob.

Yours is a classic except it is first time I've heard of someone
protobufing it..  Usually it is a reuse of an Hadoop Writable instance
accumulating....

St.Ack