You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Stanislav Barton <st...@internetmemory.net> on 2012/01/11 15:47:18 UTC

bulk loading and RegionObservers

Hello,

I tried to find the information in the documentation but it is still
not clear to me. I do a lot of bulk loading using the MapReduce job
whose output is HFiles that are automatically loaded to HBase and I
was wondering whether this way (my guess is that it is so) I do bypass
the RegionObserver mechanisms. Meaning that such defined coprocessors
won't get fired up when the new data is loaded in HBase. Is my
assumption correct?

Stan

Re: bulk loading and RegionObservers

Posted by Stanislav Barton <st...@internetmemory.net>.

Andrew Purtell <ap...@...> writes:

 
> CPs hook compaction by allowing one to wrap the scanner that is iterating over
the store files. So the
> wrapper gets a chance to examine the KeyValues being processed and also has an
opportunity to modify or
> drop them. 
>  
> Similarly for incoming HFiles for bulk load, the CP could be given a scanner
iterating over those files, if
> you had a RegionObserver installed. You would be given the option in effect to
rewrite the incoming HFiles
> before they are handed over to the RegionServer for addition to the region.
> 
> This is the right approach to interface design here, IMO, because the fact you
are given a scanner
> highlights the bulk nature of the input.
> 
> Is this something you could use?
> 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via
Tom White)
> 


Yes, I think that might become handy, at least to let know the trigger code 
that new stuff is coming.

stan

Re: bulk loading and RegionObservers

Posted by Andrew Purtell <ap...@apache.org>.

> I think that the people demanding such method of access would like to have the
> ability to trigger the action on a row level (so again when a Put with new
> values come). But I think that this would not scale - it would take a long time
> to scan the new region and fire prePut() call on RO for the new region?

CPs hook compaction by allowing one to wrap the scanner that is iterating over the store files. So the wrapper gets a chance to examine the KeyValues being processed and also has an opportunity to modify or drop them. 
 
Similarly for incoming HFiles for bulk load, the CP could be given a scanner iterating over those files, if you had a RegionObserver installed. You would be given the option in effect to rewrite the incoming HFiles before they are handed over to the RegionServer for addition to the region.

This is the right approach to interface design here, IMO, because the fact you are given a scanner highlights the bulk nature of the input.

Is this something you could use?


Best regards,


  - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


----- Original Message -----
> From: Stanislav Barton <st...@internetmemory.net>
> To: user@hbase.apache.org
> Cc: 
> Sent: Thursday, January 12, 2012 3:03 AM
> Subject: Re: bulk loading and RegionObservers
> 
> Andrew Purtell <ap...@...> writes:
> 
>> 
>>  Yes this is correct.
>> 
>>  Coprocessors / RegionObservers and bulk loading have been developing
> separately in parallel. 
>> 
>>  Now that bulk loading changes are settling down, I've been considering 
> adding
> CP hooks into the bulk load
>>  process, at the HRegion level, without complicating atomicity. A simple and
> straightforward course of
>>  action is to give the CP the option of rewriting the submitted store 
> file(s)
> before the regionserver
>>  attempts to validate and move them into the store. This is similar to how 
> CPs
> are hooked into compaction.
>>  Would this be sufficient for what you want to do?
>>   
>>  Best regards,
>> 
>>         - Andy
>> 
>>  Problems worthy of attack prove their worth by hitting back. - Piet Hein 
> (via
> Tom White)
>> 
>>  >________________________________
>>  > From: Stanislav Barton <stanislav.barton <at> 
> internetmemory.net>
>>  >To: user@... 
>>  >Sent: Wednesday, January 11, 2012 6:47 AM
>>  >Subject: bulk loading and RegionObservers
>>  > 
>>  >Hello,
>>  >
>>  >I tried to find the information in the documentation but it is still
>>  >not clear to me. I do a lot of bulk loading using the MapReduce job
>>  >whose output is HFiles that are automatically loaded to HBase and I
>>  >was wondering whether this way (my guess is that it is so) I do bypass
>>  >the RegionObserver mechanisms. Meaning that such defined coprocessors
>>  >won't get fired up when the new data is loaded in HBase. Is my
>>  >assumption correct?
>>  >
>>  >Stan
>>  >
>>  >
>>  >
> 
> 
> I think that the people demanding such method of access would like to have the
> ability to trigger the action on a row level (so again when a Put with new
> values come). But I think that this would not scale - it would take a long time
> to scan the new region and fire prePut() call on RO for the new region? I have
> experience in doing 30GB bulk load steps to pre-splitted table in order to
> maintain highest throughput and diminish overhead as possible (on fairly small
> cluster (~10) of small machines). 
> 
> --
> 
> Stan
>

Re: bulk loading and RegionObservers

Posted by Stanislav Barton <st...@internetmemory.net>.

Andrew Purtell <ap...@...> writes:

> 
> Yes this is correct.
> 
> Coprocessors / RegionObservers and bulk loading have been developing
separately in parallel. 
> 
> Now that bulk loading changes are settling down, I've been considering adding
CP hooks into the bulk load
> process, at the HRegion level, without complicating atomicity. A simple and
straightforward course of
> action is to give the CP the option of rewriting the submitted store file(s)
before the regionserver
> attempts to validate and move them into the store. This is similar to how CPs
are hooked into compaction.
> Would this be sufficient for what you want to do?
>  
> Best regards,
> 
>        - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via
Tom White)
> 
> >________________________________
> > From: Stanislav Barton <stanislav.barton <at> internetmemory.net>
> >To: user@... 
> >Sent: Wednesday, January 11, 2012 6:47 AM
> >Subject: bulk loading and RegionObservers
> > 
> >Hello,
> >
> >I tried to find the information in the documentation but it is still
> >not clear to me. I do a lot of bulk loading using the MapReduce job
> >whose output is HFiles that are automatically loaded to HBase and I
> >was wondering whether this way (my guess is that it is so) I do bypass
> >the RegionObserver mechanisms. Meaning that such defined coprocessors
> >won't get fired up when the new data is loaded in HBase. Is my
> >assumption correct?
> >
> >Stan
> >
> >
> >


I think that the people demanding such method of access would like to have the
ability to trigger the action on a row level (so again when a Put with new
values come). But I think that this would not scale - it would take a long time
to scan the new region and fire prePut() call on RO for the new region? I have
experience in doing 30GB bulk load steps to pre-splitted table in order to
maintain highest throughput and diminish overhead as possible (on fairly small
cluster (~10) of small machines). 

--

Stan

Re: bulk loading and RegionObservers

Posted by Andrew Purtell <ap...@apache.org>.

Yes this is correct.


Coprocessors / RegionObservers and bulk loading have been developing separately in parallel. 

Now that bulk loading changes are settling down, I've been considering adding CP hooks into the bulk load process, at the HRegion level, without complicating atomicity. A simple and straightforward course of action is to give the CP the option of rewriting the submitted store file(s) before the regionserver attempts to validate and move them into the store. This is similar to how CPs are hooked into compaction. Would this be sufficient for what you want to do?
 
Best regards,


       - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


>________________________________
> From: Stanislav Barton <st...@internetmemory.net>
>To: user@hbase.apache.org 
>Sent: Wednesday, January 11, 2012 6:47 AM
>Subject: bulk loading and RegionObservers
> 
>Hello,
>
>I tried to find the information in the documentation but it is still
>not clear to me. I do a lot of bulk loading using the MapReduce job
>whose output is HFiles that are automatically loaded to HBase and I
>was wondering whether this way (my guess is that it is so) I do bypass
>the RegionObserver mechanisms. Meaning that such defined coprocessors
>won't get fired up when the new data is loaded in HBase. Is my
>assumption correct?
>
>Stan
>
>
>