You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Bryan Keller <br...@gmail.com> on 2011/09/10 22:22:11 UTC

Speculative execution and TableOutputFormat

I believe there is a problem with Hadoop's speculative execution (which is on by default), and HBase's TableOutputFormat. If I understand correctly, speculative execution can launch the same task on multiple nodes, but only "commit" the one that finishes first. The other tasks that didn't complete are killed.

I encountered some strange behavior with speculative execution and TableOutputFormat. It looks like context.write() will submit the rows to HBase (when the write buffer is full). But there is no "rollback" if the task that submitted the rows did not finish first and is later killed. The rows remain submitted.

My particular job uses a partitioner so one node will process all records that match the partition. The reducer selects among the records and persists these to HBase. With speculative execution turned on, the reducer for the partition is actually run on 2 nodes, and both end up inserting into HBase, even though the second reducer is eventually killed. The results were not what I wanted.

Turning off speculative execution resolved my issue. I think this should be set off by default when using TableOutputFormat, unless there is a better way to handle this.

Re: Speculative execution and TableOutputFormat

Posted by Leif Wickland <le...@gmail.com>.

I'm also quite interested in anyone has have feedback on Ryan's reasoning.

On Mon, Sep 12, 2011 at 1:18 PM, Brush,Ryan <RB...@cerner.com> wrote:

> If you'll forgive the slight topic shift, it seems like the pattern of
> writing directly to HFiles rather than the TableOutputFormat would be
> better for several cases. For instance, TableOutputFormat results in
> everything being written to the WAL, and later compacted into HFiles.
> When practical, why not skip that interim state and produce the HFile
> directly, then do a bulk load?
>
>
> Of course not all jobs that use the TableOutputFormat can easily write to
> Hfiles; those files require a strict ordering of row keys being output,
> and bulk loads are optimal only if the HFiles align with existing regions.
> But if such requirements are met, it seems like moving away from
> TableOutputFormat could help IO-bound jobs significantly.
>
> Is my reasoning sound?
>
> On 9/12/11 12:40 PM, "Leif Wickland" <le...@gmail.com> wrote:
>
> >Thanks, Bryan.  I'd love to hear any lessons you learn.  I've used that
> >technique successfully at a prototype level, but haven't yet moved it to
> >production.
> >
> >Leif
> >
> >On Mon, Sep 12, 2011 at 10:51 AM, Bryan Keller <br...@gmail.com> wrote:
> >
> >> Ah that is a very interesting solution Leif, this seems optimal to me.
> >>I am
> >> going to try this and I'll report back.
> >>
> >> On Sep 12, 2011, at 9:09 AM, Leif Wickland wrote:
> >>
> >> >
> >> > Bryan,
> >> >
> >> > Have you considered writing your MR output to HFileFormat and then
> >>asking
> >> > the regions to adopt the result?   That would allow you to avoid
> >> committing
> >> > any changes to HBase until you knew that the MR job ran successfully.
> >> >
> >> > Leif
> >>
> >>
>
> ----------------------------------------------------------------------
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>

Re: Speculative execution and TableOutputFormat

Posted by "Brush,Ryan" <RB...@CERNER.COM>.

If you'll forgive the slight topic shift, it seems like the pattern of
writing directly to HFiles rather than the TableOutputFormat would be
better for several cases. For instance, TableOutputFormat results in
everything being written to the WAL, and later compacted into HFiles.
When practical, why not skip that interim state and produce the HFile
directly, then do a bulk load?

Of course not all jobs that use the TableOutputFormat can easily write to
Hfiles; those files require a strict ordering of row keys being output,
and bulk loads are optimal only if the HFiles align with existing regions.
But if such requirements are met, it seems like moving away from
TableOutputFormat could help IO-bound jobs significantly.

Is my reasoning sound?

On 9/12/11 12:40 PM, "Leif Wickland" <le...@gmail.com> wrote:

>Thanks, Bryan.  I'd love to hear any lessons you learn.  I've used that
>technique successfully at a prototype level, but haven't yet moved it to
>production.
>
>Leif
>
>On Mon, Sep 12, 2011 at 10:51 AM, Bryan Keller <br...@gmail.com> wrote:
>
>> Ah that is a very interesting solution Leif, this seems optimal to me.
>>I am
>> going to try this and I'll report back.
>>
>> On Sep 12, 2011, at 9:09 AM, Leif Wickland wrote:
>>
>> >
>> > Bryan,
>> >
>> > Have you considered writing your MR output to HFileFormat and then
>>asking
>> > the regions to adopt the result?   That would allow you to avoid
>> committing
>> > any changes to HBase until you knew that the MR job ran successfully.
>> >
>> > Leif
>>
>>

----------------------------------------------------------------------
CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Re: Speculative execution and TableOutputFormat

Posted by Leif Wickland <le...@gmail.com>.

Thanks, Bryan.  I'd love to hear any lessons you learn.  I've used that
technique successfully at a prototype level, but haven't yet moved it to
production.

Leif

On Mon, Sep 12, 2011 at 10:51 AM, Bryan Keller <br...@gmail.com> wrote:

> Ah that is a very interesting solution Leif, this seems optimal to me. I am
> going to try this and I'll report back.
>
> On Sep 12, 2011, at 9:09 AM, Leif Wickland wrote:
>
> >
> > Bryan,
> >
> > Have you considered writing your MR output to HFileFormat and then asking
> > the regions to adopt the result?   That would allow you to avoid
> committing
> > any changes to HBase until you knew that the MR job ran successfully.
> >
> > Leif
>
>

Re: Speculative execution and TableOutputFormat

Posted by Bryan Keller <br...@gmail.com>.

Ah that is a very interesting solution Leif, this seems optimal to me. I am going to try this and I'll report back.

On Sep 12, 2011, at 9:09 AM, Leif Wickland wrote:

> 
> Bryan,
> 
> Have you considered writing your MR output to HFileFormat and then asking
> the regions to adopt the result?   That would allow you to avoid committing
> any changes to HBase until you knew that the MR job ran successfully.
> 
> Leif

Re: Speculative execution and TableOutputFormat

Posted by Leif Wickland <le...@gmail.com>.

>
> I am thinking to write the results to a file first, then read and persist
> to HBase from the file, to avoid this. The failover would work as Hadoop
> will throw out parts of a file that are not marked as completed. Though this
> does put a lot of extra IO on the cluster.


Bryan,

Have you considered writing your MR output to HFileFormat and then asking
the regions to adopt the result?   That would allow you to avoid committing
any changes to HBase until you knew that the MR job ran successfully.

Leif

Re: Speculative execution and TableOutputFormat

Posted by Bryan Keller <br...@gmail.com>.

Thanks for pointing that out in the book. I suppose a similar problem could occur even with speculative execution turned off. If a task is using HBase as sink, and it runs halfway before the node fails, the task is rerun on another node. But the rows from the failed task are not rolled back. So it seems you could have extra rows inserted if the 2nd run has a different result set (perhaps due to time-sensitive data or randomization in the result).

I am thinking to write the results to a file first, then read and persist to HBase from the file, to avoid this. The failover would work as Hadoop will throw out parts of a file that are not marked as completed. Though this does put a lot of extra IO on the cluster.

On Sep 10, 2011, at 2:46 PM, Doug Meil wrote:

> 
> Hi Bryan, yep, that same advice is in the hbase book.
> 
> http://hbase.apache.org/book.html#mapreduce.specex
> 
> That's a good suggestion, and perhaps moving that config to
> TableMapReduceUtil would be beneficial.
> 
> 
> 
> 
> On 9/10/11 4:22 PM, "Bryan Keller" <br...@gmail.com> wrote:
> 
>> I believe there is a problem with Hadoop's speculative execution (which
>> is on by default), and HBase's TableOutputFormat. If I understand
>> correctly, speculative execution can launch the same task on multiple
>> nodes, but only "commit" the one that finishes first. The other tasks
>> that didn't complete are killed.
>> 
>> I encountered some strange behavior with speculative execution and
>> TableOutputFormat. It looks like context.write() will submit the rows to
>> HBase (when the write buffer is full). But there is no "rollback" if the
>> task that submitted the rows did not finish first and is later killed.
>> The rows remain submitted.
>> 
>> My particular job uses a partitioner so one node will process all records
>> that match the partition. The reducer selects among the records and
>> persists these to HBase. With speculative execution turned on, the
>> reducer for the partition is actually run on 2 nodes, and both end up
>> inserting into HBase, even though the second reducer is eventually
>> killed. The results were not what I wanted.
>> 
>> Turning off speculative execution resolved my issue. I think this should
>> be set off by default when using TableOutputFormat, unless there is a
>> better way to handle this.
>

Re: Speculative execution and TableOutputFormat

Posted by Doug Meil <do...@explorysmedical.com>.

Hi Bryan, yep, that same advice is in the hbase book.

http://hbase.apache.org/book.html#mapreduce.specex

That's a good suggestion, and perhaps moving that config to
TableMapReduceUtil would be beneficial.




On 9/10/11 4:22 PM, "Bryan Keller" <br...@gmail.com> wrote:

>I believe there is a problem with Hadoop's speculative execution (which
>is on by default), and HBase's TableOutputFormat. If I understand
>correctly, speculative execution can launch the same task on multiple
>nodes, but only "commit" the one that finishes first. The other tasks
>that didn't complete are killed.
>
>I encountered some strange behavior with speculative execution and
>TableOutputFormat. It looks like context.write() will submit the rows to
>HBase (when the write buffer is full). But there is no "rollback" if the
>task that submitted the rows did not finish first and is later killed.
>The rows remain submitted.
>
>My particular job uses a partitioner so one node will process all records
>that match the partition. The reducer selects among the records and
>persists these to HBase. With speculative execution turned on, the
>reducer for the partition is actually run on 2 nodes, and both end up
>inserting into HBase, even though the second reducer is eventually
>killed. The results were not what I wanted.
>
>Turning off speculative execution resolved my issue. I think this should
>be set off by default when using TableOutputFormat, unless there is a
>better way to handle this.