You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jacques <wh...@gmail.com> on 2011/08/03 00:44:09 UTC

Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Given the hardy reviews and timing, we recently shifted from 90.3 (apache)
to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4, r1150278).

We had a network switch go down last night which caused an apparent network
partition between two of our region servers and one or more zk nodes.
 (We're still piecing together the situation).  Anyway, things *seemed* to
recover fine.  However, this morning we realized that we lost some data that
was generated just before the problems occurred.

It looks like h002 went down nearly immediately at around 8pm while h001
didn't go down until around 8:10pm (somewhat confused by this).  We're
thinking that this may have contributed to the problem.  The particular
table that had data issues is a very small table with a single region that
was running on h002 when it went down.

We know the corruption/lack of edits affected two tables.  It extended
across a number of rows and actually appears to reach back up to data
inserted 6 hours earlier (estimate).  The two tables we can verify errors on
are each probably at most 10-20k <1k rows.  Some places rows that were added
are completely missing and some just had missing cell edits.  As an aside, I
was thinking there was a time based memstore flush in addition to a size
one.  But upon reviewing the hbase default configuration, I don't see
mention of it.  Is this purely size based?

We don't have the tools in place to verify exactly what other data or tables
may have been impacted.

The log files are at the paste bin links below.  The whole cluster is 8
nodes + master, 3 zk nodes running on separate machines.  We run with mostly
standard settings but do have the following settings:
heap: 12gb
regionsize 4gb, (due to lots of cold data and not enough servers, avg 300
regions/server)
mslab: 4m/512k (due to somewhat frequent updates to larger objects in the
200-500k size range)

We've been using hbase for about a year now and have been nothing but happy
with it.  The failure state that we had last night (where only some region
servers cannot talk to some zk servers) seems like a strange one.

Any thoughts? (beyond chiding for switching to a rc)    Any opinions whether
we should we roll back to 90.3 (or 90.3+cloudera)?

Thanks for any help,
Jacques

master: http://pastebin.com/aG8fm2KZ
h001: http://pastebin.com/nLLk06EC
h002: http://pastebin.com/0wPFuZDx
h003: http://pastebin.com/3ZMV01mA
h004: http://pastebin.com/0YVefuqS
h005: http://pastebin.com/N90LDjvs
h006: http://pastebin.com/gM8umekW
h007: http://pastebin.com/0TVvX68d
h008: http://pastebin.com/mV968Cem

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by "M. C. Srivas" <mc...@gmail.com>.
On Fri, Aug 5, 2011 at 11:28 AM, Todd Lipcon <to...@cloudera.com> wrote:

> On Fri, Aug 5, 2011 at 10:21 AM, M. C. Srivas <mc...@gmail.com> wrote:
> > On Fri, Aug 5, 2011 at 9:42 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> On Fri, Aug 5, 2011 at 8:52 AM, M. C. Srivas <mc...@gmail.com>
> wrote:
> >> > The normal behavior would be for the HMaster to make the hlog
> read-only
> >> > before processing it.... very simple fencing and works on all Posix or
> >> > close-to-Posix systems.  Does that not work on HDFS?
> >>
> >> I'm sure you know the answer to that :)
> >>
> >
> > Actually, I really don't know HDFS details ... so does chmod work for
> all?
> >
> > We're thinking for submitting a patch to Hbase that does the following:
> >
> > recoverLease() {
> >   if (fs is DistributedFileSystem) {
> >     call the NN's recover lease API
> >   } else if (fs supports setPermissions) {
> >     call fs.setPermissions
>
> In most filesystems, permissions are only checked when a file is
> opened, not on every write. That's true of HDFS as well as Linux's
> behavior with all local filesystems I'm aware of. As far as I know
> it's the case with most NFS filers as well. Hence the existence of
> explicit fencing commands accessible via remote administrative
> interfaces on NetApps, for example.
>

Well, there's no "open" or "close" in the NFS protocol.  A client does
lookup/getattr and may cache the results forever ... and a week later come
back and issue a write.

Looks like NFS does something else altogether. From the RFC
http://www.armware.dk/RFC/rfc/rfc1813.html

   "On a local file system, a user can open
   a file and then change the permissions so that no one is
   allowed to touch it, but will still be able to write to the
   file because it is open. On a remote file system, by
   contrast, the write would fail. To get around this problem,
   the server's permission checking algorithm should allow the
   owner of a file to access it regardless of the permission
   setting. This is needed in a practical NFS version 3 protocol
   server implementation, but it does depart from correct local
   file system semantics."




>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Todd Lipcon <to...@cloudera.com>.
On Fri, Aug 5, 2011 at 10:21 AM, M. C. Srivas <mc...@gmail.com> wrote:
> On Fri, Aug 5, 2011 at 9:42 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> On Fri, Aug 5, 2011 at 8:52 AM, M. C. Srivas <mc...@gmail.com> wrote:
>> > The normal behavior would be for the HMaster to make the hlog read-only
>> > before processing it.... very simple fencing and works on all Posix or
>> > close-to-Posix systems.  Does that not work on HDFS?
>>
>> I'm sure you know the answer to that :)
>>
>
> Actually, I really don't know HDFS details ... so does chmod work for all?
>
> We're thinking for submitting a patch to Hbase that does the following:
>
> recoverLease() {
>   if (fs is DistributedFileSystem) {
>     call the NN's recover lease API
>   } else if (fs supports setPermissions) {
>     call fs.setPermissions

In most filesystems, permissions are only checked when a file is
opened, not on every write. That's true of HDFS as well as Linux's
behavior with all local filesystems I'm aware of. As far as I know
it's the case with most NFS filers as well. Hence the existence of
explicit fencing commands accessible via remote administrative
interfaces on NetApps, for example.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by "M. C. Srivas" <mc...@gmail.com>.
On Fri, Aug 5, 2011 at 9:42 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> On Fri, Aug 5, 2011 at 8:52 AM, M. C. Srivas <mc...@gmail.com> wrote:
> > The normal behavior would be for the HMaster to make the hlog read-only
> > before processing it.... very simple fencing and works on all Posix or
> > close-to-Posix systems.  Does that not work on HDFS?
>
> I'm sure you know the answer to that :)
>

Actually, I really don't know HDFS details ... so does chmod work for all?

We're thinking for submitting a patch to Hbase that does the following:

recoverLease() {
   if (fs is DistributedFileSystem) {
     call the NN's recover lease API
   } else if (fs supports setPermissions) {
     call fs.setPermissions
   } else {
       log WARN ....
  }
}

Does that look ok?



>
> I agree it would be a better solution as it won't rely on the
> "particular" semantics of a single dfs.
>
> J-D
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Jean-Daniel Cryans <jd...@apache.org>.
HDFS-1520 was forward ported to trunk by Stack:

https://issues.apache.org/jira/browse/HDFS-1948

J-D

On Fri, Aug 5, 2011 at 9:45 AM, Ryan Rawson <ry...@gmail.com> wrote:
> The IO fencing was an accidental byproduct of how HDFS-200 was
> implemented, so in fact, HBase won't run correctly on HDFS-265 which
> does NOT have that IO fencing, right?
>
>
>
> On Fri, Aug 5, 2011 at 9:42 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> On Fri, Aug 5, 2011 at 8:52 AM, M. C. Srivas <mc...@gmail.com> wrote:
>>> The normal behavior would be for the HMaster to make the hlog read-only
>>> before processing it.... very simple fencing and works on all Posix or
>>> close-to-Posix systems.  Does that not work on HDFS?
>>
>> I'm sure you know the answer to that :)
>>
>> I agree it would be a better solution as it won't rely on the
>> "particular" semantics of a single dfs.
>>
>> J-D
>>
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Ryan Rawson <ry...@gmail.com>.
The IO fencing was an accidental byproduct of how HDFS-200 was
implemented, so in fact, HBase won't run correctly on HDFS-265 which
does NOT have that IO fencing, right?



On Fri, Aug 5, 2011 at 9:42 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> On Fri, Aug 5, 2011 at 8:52 AM, M. C. Srivas <mc...@gmail.com> wrote:
>> The normal behavior would be for the HMaster to make the hlog read-only
>> before processing it.... very simple fencing and works on all Posix or
>> close-to-Posix systems.  Does that not work on HDFS?
>
> I'm sure you know the answer to that :)
>
> I agree it would be a better solution as it won't rely on the
> "particular" semantics of a single dfs.
>
> J-D
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Jean-Daniel Cryans <jd...@apache.org>.
On Fri, Aug 5, 2011 at 8:52 AM, M. C. Srivas <mc...@gmail.com> wrote:
> The normal behavior would be for the HMaster to make the hlog read-only
> before processing it.... very simple fencing and works on all Posix or
> close-to-Posix systems.  Does that not work on HDFS?

I'm sure you know the answer to that :)

I agree it would be a better solution as it won't rely on the
"particular" semantics of a single dfs.

J-D

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by "M. C. Srivas" <mc...@gmail.com>.
The normal behavior would be for the HMaster to make the hlog read-only
before processing it.... very simple fencing and works on all Posix or
close-to-Posix systems.  Does that not work on HDFS?


On Fri, Aug 5, 2011 at 7:07 AM, M. C. Srivas <mc...@gmail.com> wrote:

>
>
> On Thu, Aug 4, 2011 at 9:01 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Thu, Aug 4, 2011 at 8:36 PM, lohit <lo...@gmail.com> wrote:
>> > 2011/8/4 Ryan Rawson <ry...@gmail.com>
>> >
>> >> Yes, that is what JD is referring to, the so-called IO fence.
>> >>
>> >> It works like so:
>> >> - regionserver is appending to an HLog, continues to do so, hasnt
>> >> gotten the ZK "kill yourself signal" yet
>> >> - hmaster splits the logs
>> >> - the hmaster yanks the writer from under the regionserver, and the RS
>> >> then starts to kill itself
>> >>
>> > Can you tell more about how this is done with HDFS. If RS has the lease,
>> how
>> > did master get hold of that lease. Or is it removing file?
>>
>> In older versions, it would call append() which recovered the lease,
>> so long as the soft lease timeout had expired. More recently, it calls
>> an HDFS "recoverLease" API that provides fencing.
>>
>
> Looks like we need a patch in both HBase and MapR ... even if MapR had
> leases, this piece of code in FSUtils.java prevents it being called.
>
>     if (!(fs instanceof DistributedFileSystem)) {
>       return;
>     }
>
> Someone will be issuing a patch for both MapR and HBase to fix this in a
> couple of days. (I am on vacation).
>
>
>
>
>>
>> >>
>> >>
>> >> This can happen because ZK can deliver the session lost message late,
>> >> and there is a race.
>> >>
>> >> -ryan
>> >>
>> >> On Thu, Aug 4, 2011 at 8:13 PM, M. C. Srivas <mc...@gmail.com>
>> wrote:
>> >> > On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans <
>> jdcryans@apache.org
>> >> >wrote:
>> >> >
>> >> >> > Thanks for the feedback.  So you're inclined to think it would be
>> at
>> >> the
>> >> >> dfs
>> >> >> > layer?
>> >> >>
>> >> >> That's where the evidence seems to point.
>> >> >>
>> >> >> >
>> >> >> > Is it accurate to say the most likely places where the data could
>> have
>> >> >> been
>> >> >> > lost were:
>> >> >> > 1. wal writes didn't actually get written to disk (no log entries
>> to
>> >> >> suggest
>> >> >> > any issues)
>> >> >>
>> >> >> Most likely.
>> >> >>
>> >> >> > 2. wal corrupted (no log entries suggest any trouble reading the
>> log)
>> >> >>
>> >> >> In that case the logs would scream (and I didn't see that in the
>> logs
>> >> >> I looked at).
>> >> >>
>> >> >> > 3. not all split logs were read by regionservers  (?? is there any
>> way
>> >> to
>> >> >> > ensure this either way... should I look at the filesystem some
>> place?)
>> >> >>
>> >> >> Some regions would have recovered edits files, but that seems highly
>> >> >> unlikely. With DEBUG enabled we could have seen which files were
>> split
>> >> >> by the master and which ones were created for the regions, and then
>> >> >> which were read by the region servers.
>> >> >>
>> >> >> >
>> >> >> > Do you think the type of network partition I'm talking about is
>> >> >> adequately
>> >> >> > covered in existing tests? (Specifically running an external zk
>> >> cluster?)
>> >> >>
>> >> >> The IO fencing was only tested with HDFS, I don't know what happens
>> in
>> >> >> that case with MapR. What I mean is that when the master splits the
>> >> >> logs, it takes ownership of the HDFS writer lease (only one per
>> file)
>> >> >> so that it can safely close the log file. Then after that it checks
>> if
>> >> >> there are any new log files that were created (the region server
>> could
>> >> >> have rolled a log while the master was splitting them) and will
>> >> >> restart if that situation happens until it's able to own all files
>> and
>> >> >> split them.
>> >> >>
>> >> >
>> >> > JD,   I didn't think the master explicitly dealt with writer leases.
>> >> >
>> >> > Does HBase rely on single-writer semantics on the log file? That is,
>> if
>> >> the
>> >> > master and a RS both decide to mucky-muck with a log file, you expect
>> the
>> >> FS
>> >> > to lock out one of the writers?
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> > Have you heard if anyone else is been having problems with the
>> second
>> >> >> 90.4
>> >> >> > rc?
>> >> >>
>> >> >> Nope, we run it here on our dev cluster and didn't encounter any
>> issue
>> >> >> (with the code or node failure).
>> >> >>
>> >> >> >
>> >> >> > Thanks again for your help.  I'm following up with the MapR guys
>> as
>> >> well.
>> >> >>
>> >> >> Good idea!
>> >> >>
>> >> >> J-D
>> >> >>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Have a Nice Day!
>> > Lohit
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by "M. C. Srivas" <mc...@gmail.com>.
On Thu, Aug 4, 2011 at 9:01 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Thu, Aug 4, 2011 at 8:36 PM, lohit <lo...@gmail.com> wrote:
> > 2011/8/4 Ryan Rawson <ry...@gmail.com>
> >
> >> Yes, that is what JD is referring to, the so-called IO fence.
> >>
> >> It works like so:
> >> - regionserver is appending to an HLog, continues to do so, hasnt
> >> gotten the ZK "kill yourself signal" yet
> >> - hmaster splits the logs
> >> - the hmaster yanks the writer from under the regionserver, and the RS
> >> then starts to kill itself
> >>
> > Can you tell more about how this is done with HDFS. If RS has the lease,
> how
> > did master get hold of that lease. Or is it removing file?
>
> In older versions, it would call append() which recovered the lease,
> so long as the soft lease timeout had expired. More recently, it calls
> an HDFS "recoverLease" API that provides fencing.
>

Looks like we need a patch in both HBase and MapR ... even if MapR had
leases, this piece of code in FSUtils.java prevents it being called.

    if (!(fs instanceof DistributedFileSystem)) {
      return;
    }

Someone will be issuing a patch for both MapR and HBase to fix this in a
couple of days. (I am on vacation).




>
> >>
> >>
> >> This can happen because ZK can deliver the session lost message late,
> >> and there is a race.
> >>
> >> -ryan
> >>
> >> On Thu, Aug 4, 2011 at 8:13 PM, M. C. Srivas <mc...@gmail.com>
> wrote:
> >> > On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> >wrote:
> >> >
> >> >> > Thanks for the feedback.  So you're inclined to think it would be
> at
> >> the
> >> >> dfs
> >> >> > layer?
> >> >>
> >> >> That's where the evidence seems to point.
> >> >>
> >> >> >
> >> >> > Is it accurate to say the most likely places where the data could
> have
> >> >> been
> >> >> > lost were:
> >> >> > 1. wal writes didn't actually get written to disk (no log entries
> to
> >> >> suggest
> >> >> > any issues)
> >> >>
> >> >> Most likely.
> >> >>
> >> >> > 2. wal corrupted (no log entries suggest any trouble reading the
> log)
> >> >>
> >> >> In that case the logs would scream (and I didn't see that in the logs
> >> >> I looked at).
> >> >>
> >> >> > 3. not all split logs were read by regionservers  (?? is there any
> way
> >> to
> >> >> > ensure this either way... should I look at the filesystem some
> place?)
> >> >>
> >> >> Some regions would have recovered edits files, but that seems highly
> >> >> unlikely. With DEBUG enabled we could have seen which files were
> split
> >> >> by the master and which ones were created for the regions, and then
> >> >> which were read by the region servers.
> >> >>
> >> >> >
> >> >> > Do you think the type of network partition I'm talking about is
> >> >> adequately
> >> >> > covered in existing tests? (Specifically running an external zk
> >> cluster?)
> >> >>
> >> >> The IO fencing was only tested with HDFS, I don't know what happens
> in
> >> >> that case with MapR. What I mean is that when the master splits the
> >> >> logs, it takes ownership of the HDFS writer lease (only one per file)
> >> >> so that it can safely close the log file. Then after that it checks
> if
> >> >> there are any new log files that were created (the region server
> could
> >> >> have rolled a log while the master was splitting them) and will
> >> >> restart if that situation happens until it's able to own all files
> and
> >> >> split them.
> >> >>
> >> >
> >> > JD,   I didn't think the master explicitly dealt with writer leases.
> >> >
> >> > Does HBase rely on single-writer semantics on the log file? That is,
> if
> >> the
> >> > master and a RS both decide to mucky-muck with a log file, you expect
> the
> >> FS
> >> > to lock out one of the writers?
> >> >
> >> >
> >> >
> >> >
> >> >>
> >> >> >
> >> >> > Have you heard if anyone else is been having problems with the
> second
> >> >> 90.4
> >> >> > rc?
> >> >>
> >> >> Nope, we run it here on our dev cluster and didn't encounter any
> issue
> >> >> (with the code or node failure).
> >> >>
> >> >> >
> >> >> > Thanks again for your help.  I'm following up with the MapR guys as
> >> well.
> >> >>
> >> >> Good idea!
> >> >>
> >> >> J-D
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> > Have a Nice Day!
> > Lohit
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Todd Lipcon <to...@cloudera.com>.
On Thu, Aug 4, 2011 at 8:36 PM, lohit <lo...@gmail.com> wrote:
> 2011/8/4 Ryan Rawson <ry...@gmail.com>
>
>> Yes, that is what JD is referring to, the so-called IO fence.
>>
>> It works like so:
>> - regionserver is appending to an HLog, continues to do so, hasnt
>> gotten the ZK "kill yourself signal" yet
>> - hmaster splits the logs
>> - the hmaster yanks the writer from under the regionserver, and the RS
>> then starts to kill itself
>>
> Can you tell more about how this is done with HDFS. If RS has the lease, how
> did master get hold of that lease. Or is it removing file?

In older versions, it would call append() which recovered the lease,
so long as the soft lease timeout had expired. More recently, it calls
an HDFS "recoverLease" API that provides fencing.

>>
>>
>> This can happen because ZK can deliver the session lost message late,
>> and there is a race.
>>
>> -ryan
>>
>> On Thu, Aug 4, 2011 at 8:13 PM, M. C. Srivas <mc...@gmail.com> wrote:
>> > On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>> >
>> >> > Thanks for the feedback.  So you're inclined to think it would be at
>> the
>> >> dfs
>> >> > layer?
>> >>
>> >> That's where the evidence seems to point.
>> >>
>> >> >
>> >> > Is it accurate to say the most likely places where the data could have
>> >> been
>> >> > lost were:
>> >> > 1. wal writes didn't actually get written to disk (no log entries to
>> >> suggest
>> >> > any issues)
>> >>
>> >> Most likely.
>> >>
>> >> > 2. wal corrupted (no log entries suggest any trouble reading the log)
>> >>
>> >> In that case the logs would scream (and I didn't see that in the logs
>> >> I looked at).
>> >>
>> >> > 3. not all split logs were read by regionservers  (?? is there any way
>> to
>> >> > ensure this either way... should I look at the filesystem some place?)
>> >>
>> >> Some regions would have recovered edits files, but that seems highly
>> >> unlikely. With DEBUG enabled we could have seen which files were split
>> >> by the master and which ones were created for the regions, and then
>> >> which were read by the region servers.
>> >>
>> >> >
>> >> > Do you think the type of network partition I'm talking about is
>> >> adequately
>> >> > covered in existing tests? (Specifically running an external zk
>> cluster?)
>> >>
>> >> The IO fencing was only tested with HDFS, I don't know what happens in
>> >> that case with MapR. What I mean is that when the master splits the
>> >> logs, it takes ownership of the HDFS writer lease (only one per file)
>> >> so that it can safely close the log file. Then after that it checks if
>> >> there are any new log files that were created (the region server could
>> >> have rolled a log while the master was splitting them) and will
>> >> restart if that situation happens until it's able to own all files and
>> >> split them.
>> >>
>> >
>> > JD,   I didn't think the master explicitly dealt with writer leases.
>> >
>> > Does HBase rely on single-writer semantics on the log file? That is, if
>> the
>> > master and a RS both decide to mucky-muck with a log file, you expect the
>> FS
>> > to lock out one of the writers?
>> >
>> >
>> >
>> >
>> >>
>> >> >
>> >> > Have you heard if anyone else is been having problems with the second
>> >> 90.4
>> >> > rc?
>> >>
>> >> Nope, we run it here on our dev cluster and didn't encounter any issue
>> >> (with the code or node failure).
>> >>
>> >> >
>> >> > Thanks again for your help.  I'm following up with the MapR guys as
>> well.
>> >>
>> >> Good idea!
>> >>
>> >> J-D
>> >>
>> >
>>
>
>
>
> --
> Have a Nice Day!
> Lohit
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by lohit <lo...@gmail.com>.
2011/8/4 Ryan Rawson <ry...@gmail.com>

> Yes, that is what JD is referring to, the so-called IO fence.
>
> It works like so:
> - regionserver is appending to an HLog, continues to do so, hasnt
> gotten the ZK "kill yourself signal" yet
> - hmaster splits the logs
> - the hmaster yanks the writer from under the regionserver, and the RS
> then starts to kill itself
>
Can you tell more about how this is done with HDFS. If RS has the lease, how
did master get hold of that lease. Or is it removing file?

>
>
> This can happen because ZK can deliver the session lost message late,
> and there is a race.
>
> -ryan
>
> On Thu, Aug 4, 2011 at 8:13 PM, M. C. Srivas <mc...@gmail.com> wrote:
> > On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> > Thanks for the feedback.  So you're inclined to think it would be at
> the
> >> dfs
> >> > layer?
> >>
> >> That's where the evidence seems to point.
> >>
> >> >
> >> > Is it accurate to say the most likely places where the data could have
> >> been
> >> > lost were:
> >> > 1. wal writes didn't actually get written to disk (no log entries to
> >> suggest
> >> > any issues)
> >>
> >> Most likely.
> >>
> >> > 2. wal corrupted (no log entries suggest any trouble reading the log)
> >>
> >> In that case the logs would scream (and I didn't see that in the logs
> >> I looked at).
> >>
> >> > 3. not all split logs were read by regionservers  (?? is there any way
> to
> >> > ensure this either way... should I look at the filesystem some place?)
> >>
> >> Some regions would have recovered edits files, but that seems highly
> >> unlikely. With DEBUG enabled we could have seen which files were split
> >> by the master and which ones were created for the regions, and then
> >> which were read by the region servers.
> >>
> >> >
> >> > Do you think the type of network partition I'm talking about is
> >> adequately
> >> > covered in existing tests? (Specifically running an external zk
> cluster?)
> >>
> >> The IO fencing was only tested with HDFS, I don't know what happens in
> >> that case with MapR. What I mean is that when the master splits the
> >> logs, it takes ownership of the HDFS writer lease (only one per file)
> >> so that it can safely close the log file. Then after that it checks if
> >> there are any new log files that were created (the region server could
> >> have rolled a log while the master was splitting them) and will
> >> restart if that situation happens until it's able to own all files and
> >> split them.
> >>
> >
> > JD,   I didn't think the master explicitly dealt with writer leases.
> >
> > Does HBase rely on single-writer semantics on the log file? That is, if
> the
> > master and a RS both decide to mucky-muck with a log file, you expect the
> FS
> > to lock out one of the writers?
> >
> >
> >
> >
> >>
> >> >
> >> > Have you heard if anyone else is been having problems with the second
> >> 90.4
> >> > rc?
> >>
> >> Nope, we run it here on our dev cluster and didn't encounter any issue
> >> (with the code or node failure).
> >>
> >> >
> >> > Thanks again for your help.  I'm following up with the MapR guys as
> well.
> >>
> >> Good idea!
> >>
> >> J-D
> >>
> >
>



-- 
Have a Nice Day!
Lohit

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Ryan Rawson <ry...@gmail.com>.
Yes, that is what JD is referring to, the so-called IO fence.

It works like so:
- regionserver is appending to an HLog, continues to do so, hasnt
gotten the ZK "kill yourself signal" yet
- hmaster splits the logs
- the hmaster yanks the writer from under the regionserver, and the RS
then starts to kill itself


This can happen because ZK can deliver the session lost message late,
and there is a race.

-ryan

On Thu, Aug 4, 2011 at 8:13 PM, M. C. Srivas <mc...@gmail.com> wrote:
> On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> > Thanks for the feedback.  So you're inclined to think it would be at the
>> dfs
>> > layer?
>>
>> That's where the evidence seems to point.
>>
>> >
>> > Is it accurate to say the most likely places where the data could have
>> been
>> > lost were:
>> > 1. wal writes didn't actually get written to disk (no log entries to
>> suggest
>> > any issues)
>>
>> Most likely.
>>
>> > 2. wal corrupted (no log entries suggest any trouble reading the log)
>>
>> In that case the logs would scream (and I didn't see that in the logs
>> I looked at).
>>
>> > 3. not all split logs were read by regionservers  (?? is there any way to
>> > ensure this either way... should I look at the filesystem some place?)
>>
>> Some regions would have recovered edits files, but that seems highly
>> unlikely. With DEBUG enabled we could have seen which files were split
>> by the master and which ones were created for the regions, and then
>> which were read by the region servers.
>>
>> >
>> > Do you think the type of network partition I'm talking about is
>> adequately
>> > covered in existing tests? (Specifically running an external zk cluster?)
>>
>> The IO fencing was only tested with HDFS, I don't know what happens in
>> that case with MapR. What I mean is that when the master splits the
>> logs, it takes ownership of the HDFS writer lease (only one per file)
>> so that it can safely close the log file. Then after that it checks if
>> there are any new log files that were created (the region server could
>> have rolled a log while the master was splitting them) and will
>> restart if that situation happens until it's able to own all files and
>> split them.
>>
>
> JD,   I didn't think the master explicitly dealt with writer leases.
>
> Does HBase rely on single-writer semantics on the log file? That is, if the
> master and a RS both decide to mucky-muck with a log file, you expect the FS
> to lock out one of the writers?
>
>
>
>
>>
>> >
>> > Have you heard if anyone else is been having problems with the second
>> 90.4
>> > rc?
>>
>> Nope, we run it here on our dev cluster and didn't encounter any issue
>> (with the code or node failure).
>>
>> >
>> > Thanks again for your help.  I'm following up with the MapR guys as well.
>>
>> Good idea!
>>
>> J-D
>>
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by "M. C. Srivas" <mc...@gmail.com>.
On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> > Thanks for the feedback.  So you're inclined to think it would be at the
> dfs
> > layer?
>
> That's where the evidence seems to point.
>
> >
> > Is it accurate to say the most likely places where the data could have
> been
> > lost were:
> > 1. wal writes didn't actually get written to disk (no log entries to
> suggest
> > any issues)
>
> Most likely.
>
> > 2. wal corrupted (no log entries suggest any trouble reading the log)
>
> In that case the logs would scream (and I didn't see that in the logs
> I looked at).
>
> > 3. not all split logs were read by regionservers  (?? is there any way to
> > ensure this either way... should I look at the filesystem some place?)
>
> Some regions would have recovered edits files, but that seems highly
> unlikely. With DEBUG enabled we could have seen which files were split
> by the master and which ones were created for the regions, and then
> which were read by the region servers.
>
> >
> > Do you think the type of network partition I'm talking about is
> adequately
> > covered in existing tests? (Specifically running an external zk cluster?)
>
> The IO fencing was only tested with HDFS, I don't know what happens in
> that case with MapR. What I mean is that when the master splits the
> logs, it takes ownership of the HDFS writer lease (only one per file)
> so that it can safely close the log file. Then after that it checks if
> there are any new log files that were created (the region server could
> have rolled a log while the master was splitting them) and will
> restart if that situation happens until it's able to own all files and
> split them.
>

JD,   I didn't think the master explicitly dealt with writer leases.

Does HBase rely on single-writer semantics on the log file? That is, if the
master and a RS both decide to mucky-muck with a log file, you expect the FS
to lock out one of the writers?




>
> >
> > Have you heard if anyone else is been having problems with the second
> 90.4
> > rc?
>
> Nope, we run it here on our dev cluster and didn't encounter any issue
> (with the code or node failure).
>
> >
> > Thanks again for your help.  I'm following up with the MapR guys as well.
>
> Good idea!
>
> J-D
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Jean-Daniel Cryans <jd...@apache.org>.
> Thanks for the feedback.  So you're inclined to think it would be at the dfs
> layer?

That's where the evidence seems to point.

>
> Is it accurate to say the most likely places where the data could have been
> lost were:
> 1. wal writes didn't actually get written to disk (no log entries to suggest
> any issues)

Most likely.

> 2. wal corrupted (no log entries suggest any trouble reading the log)

In that case the logs would scream (and I didn't see that in the logs
I looked at).

> 3. not all split logs were read by regionservers  (?? is there any way to
> ensure this either way... should I look at the filesystem some place?)

Some regions would have recovered edits files, but that seems highly
unlikely. With DEBUG enabled we could have seen which files were split
by the master and which ones were created for the regions, and then
which were read by the region servers.

>
> Do you think the type of network partition I'm talking about is adequately
> covered in existing tests? (Specifically running an external zk cluster?)

The IO fencing was only tested with HDFS, I don't know what happens in
that case with MapR. What I mean is that when the master splits the
logs, it takes ownership of the HDFS writer lease (only one per file)
so that it can safely close the log file. Then after that it checks if
there are any new log files that were created (the region server could
have rolled a log while the master was splitting them) and will
restart if that situation happens until it's able to own all files and
split them.

>
> Have you heard if anyone else is been having problems with the second 90.4
> rc?

Nope, we run it here on our dev cluster and didn't encounter any issue
(with the code or node failure).

>
> Thanks again for your help.  I'm following up with the MapR guys as well.

Good idea!

J-D

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Jacques <wh...@gmail.com>.
I will take a look and see what I can figure out.

Thanks for your help.

Jacques

On Thu, Aug 4, 2011 at 9:52 AM, Ryan Rawson <ry...@gmail.com> wrote:

> The regionserver logs that talk about the hlog replay might shed some
> light, it should tell you what entries were skipped, etc.  Having a
> look at the hfile structure of the regions, see if there are holes,
> the HFile.main tool can come in handy here, you can run it as:
> hbase org.apache.hadoop.hbase.io.hfile.HFile
>
> it will give you usage.
>
> Mapr might be able to give you audit logs of the time in question,
> that could be useful as well.
>
>
>
> On Thu, Aug 4, 2011 at 9:40 AM, Jacques <wh...@gmail.com> wrote:
> > Do you have any suggestions of things I should look at to confirm/deny
> these
> > possibilities?
> >
> > The tables are very small and inactive (probably only 50-100 rows
> changing
> > per day).
> >
> > Thanks,
> > Jacques
> >
> > On Thu, Aug 4, 2011 at 9:09 AM, Ryan Rawson <ry...@gmail.com> wrote:
> >
> >> Another possibility is the logs were not replayed correctly during the
> >> region startup.  We put in a lot of tests to cover this case, so it
> >> should not be so.
> >>
> >> Essentially the WAL replay looks at the current HFiles state, then
> >> decides which log entries to replay or skip. This is because a log
> >> might have more data than what is strictly missing from the HFiles.
> >>
> >> If the data that is missing is over 6 hours old, that is a very weird
> >> bug, it suggests to me that either an hfile is missing for some
> >> reason, or the WAL replay didnt include some for some reason.
> >>
> >> -ryan
> >>
> >> On Thu, Aug 4, 2011 at 8:38 AM, Jacques <wh...@gmail.com> wrote:
> >> > Thanks for the feedback.  So you're inclined to think it would be at
> the
> >> dfs
> >> > layer?
> >> >
> >> > Is it accurate to say the most likely places where the data could have
> >> been
> >> > lost were:
> >> > 1. wal writes didn't actually get written to disk (no log entries to
> >> suggest
> >> > any issues)
> >> > 2. wal corrupted (no log entries suggest any trouble reading the log)
> >> > 3. not all split logs were read by regionservers  (?? is there any way
> to
> >> > ensure this either way... should I look at the filesystem some place?)
> >> >
> >> > Do you think the type of network partition I'm talking about is
> >> adequately
> >> > covered in existing tests? (Specifically running an external zk
> cluster?)
> >> >
> >> > Have you heard if anyone else is been having problems with the second
> >> 90.4
> >> > rc?
> >> >
> >> > Thanks again for your help.  I'm following up with the MapR guys as
> well.
> >> >
> >> > Jacques
> >> >
> >> > On Wed, Aug 3, 2011 at 3:49 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> >wrote:
> >> >
> >> >> Hi Jacques,
> >> >>
> >> >> Sorry to hear about that.
> >> >>
> >> >> Regarding MapR, I personally don't have hands-on experience so it's a
> >> >> little bit hard for me to help you. You might want to ping them and
> >> >> ask their opinion (and I know they are watching, Ted? Srivas?)
> >> >>
> >> >> What I can do is telling you if things look normal from the HBase
> >> >> point of view, but I see you're not running with DEBUG so I might
> miss
> >> >> some information.
> >> >>
> >> >> Looking at the master log, it tells us that it was able to split the
> >> >> logs correctly.
> >> >>
> >> >> Looking at a few regionserver logs, it doesn't seem to say that it
> had
> >> >> issues replaying the logs so that's good too.
> >> >>
> >> >> About the memstore questions, it's almost purely size-based (64MB). I
> >> >> say almost because we limit the number of WALs a regionserver can
> >> >> carry so that when it reaches that limit it force flushes the
> >> >> memstores with older edits. There's also a thread that rolls the
> >> >> latest log if it's more than an hour old, so in the extreme case it
> >> >> could take 32 hours for an edit in the memstore to make it to a
> >> >> StoreFile. It used to be that without appends rolling those files
> >> >> often would prevent losses older than 1 hour, but I haven't seen
> those
> >> >> issues since we started using appends. But you're not using HDFS, and
> >> >> I don't have MapR experience, so I can't really go any further...
> >> >>
> >> >> J-D
> >> >>
> >> >> On Tue, Aug 2, 2011 at 3:44 PM, Jacques <wh...@gmail.com> wrote:
> >> >> > Given the hardy reviews and timing, we recently shifted from 90.3
> >> >> (apache)
> >> >> > to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4,
> r1150278).
> >> >> >
> >> >> > We had a network switch go down last night which caused an apparent
> >> >> network
> >> >> > partition between two of our region servers and one or more zk
> nodes.
> >> >> >  (We're still piecing together the situation).  Anyway, things
> >> *seemed*
> >> >> to
> >> >> > recover fine.  However, this morning we realized that we lost some
> >> data
> >> >> that
> >> >> > was generated just before the problems occurred.
> >> >> >
> >> >> > It looks like h002 went down nearly immediately at around 8pm while
> >> h001
> >> >> > didn't go down until around 8:10pm (somewhat confused by this).
>  We're
> >> >> > thinking that this may have contributed to the problem.  The
> >> particular
> >> >> > table that had data issues is a very small table with a single
> region
> >> >> that
> >> >> > was running on h002 when it went down.
> >> >> >
> >> >> > We know the corruption/lack of edits affected two tables.  It
> extended
> >> >> > across a number of rows and actually appears to reach back up to
> data
> >> >> > inserted 6 hours earlier (estimate).  The two tables we can verify
> >> errors
> >> >> on
> >> >> > are each probably at most 10-20k <1k rows.  Some places rows that
> were
> >> >> added
> >> >> > are completely missing and some just had missing cell edits.  As an
> >> >> aside, I
> >> >> > was thinking there was a time based memstore flush in addition to a
> >> size
> >> >> > one.  But upon reviewing the hbase default configuration, I don't
> see
> >> >> > mention of it.  Is this purely size based?
> >> >> >
> >> >> > We don't have the tools in place to verify exactly what other data
> or
> >> >> tables
> >> >> > may have been impacted.
> >> >> >
> >> >> > The log files are at the paste bin links below.  The whole cluster
> is
> >> 8
> >> >> > nodes + master, 3 zk nodes running on separate machines.  We run
> with
> >> >> mostly
> >> >> > standard settings but do have the following settings:
> >> >> > heap: 12gb
> >> >> > regionsize 4gb, (due to lots of cold data and not enough servers,
> avg
> >> 300
> >> >> > regions/server)
> >> >> > mslab: 4m/512k (due to somewhat frequent updates to larger objects
> in
> >> the
> >> >> > 200-500k size range)
> >> >> >
> >> >> > We've been using hbase for about a year now and have been nothing
> but
> >> >> happy
> >> >> > with it.  The failure state that we had last night (where only some
> >> >> region
> >> >> > servers cannot talk to some zk servers) seems like a strange one.
> >> >> >
> >> >> > Any thoughts? (beyond chiding for switching to a rc)    Any
> opinions
> >> >> whether
> >> >> > we should we roll back to 90.3 (or 90.3+cloudera)?
> >> >> >
> >> >> > Thanks for any help,
> >> >> > Jacques
> >> >> >
> >> >> > master: http://pastebin.com/aG8fm2KZ
> >> >> > h001: http://pastebin.com/nLLk06EC
> >> >> > h002: http://pastebin.com/0wPFuZDx
> >> >> > h003: http://pastebin.com/3ZMV01mA
> >> >> > h004: http://pastebin.com/0YVefuqS
> >> >> > h005: http://pastebin.com/N90LDjvs
> >> >> > h006: http://pastebin.com/gM8umekW
> >> >> > h007: http://pastebin.com/0TVvX68d
> >> >> > h008: http://pastebin.com/mV968Cem
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Ryan Rawson <ry...@gmail.com>.
The regionserver logs that talk about the hlog replay might shed some
light, it should tell you what entries were skipped, etc.  Having a
look at the hfile structure of the regions, see if there are holes,
the HFile.main tool can come in handy here, you can run it as:
hbase org.apache.hadoop.hbase.io.hfile.HFile

it will give you usage.

Mapr might be able to give you audit logs of the time in question,
that could be useful as well.



On Thu, Aug 4, 2011 at 9:40 AM, Jacques <wh...@gmail.com> wrote:
> Do you have any suggestions of things I should look at to confirm/deny these
> possibilities?
>
> The tables are very small and inactive (probably only 50-100 rows changing
> per day).
>
> Thanks,
> Jacques
>
> On Thu, Aug 4, 2011 at 9:09 AM, Ryan Rawson <ry...@gmail.com> wrote:
>
>> Another possibility is the logs were not replayed correctly during the
>> region startup.  We put in a lot of tests to cover this case, so it
>> should not be so.
>>
>> Essentially the WAL replay looks at the current HFiles state, then
>> decides which log entries to replay or skip. This is because a log
>> might have more data than what is strictly missing from the HFiles.
>>
>> If the data that is missing is over 6 hours old, that is a very weird
>> bug, it suggests to me that either an hfile is missing for some
>> reason, or the WAL replay didnt include some for some reason.
>>
>> -ryan
>>
>> On Thu, Aug 4, 2011 at 8:38 AM, Jacques <wh...@gmail.com> wrote:
>> > Thanks for the feedback.  So you're inclined to think it would be at the
>> dfs
>> > layer?
>> >
>> > Is it accurate to say the most likely places where the data could have
>> been
>> > lost were:
>> > 1. wal writes didn't actually get written to disk (no log entries to
>> suggest
>> > any issues)
>> > 2. wal corrupted (no log entries suggest any trouble reading the log)
>> > 3. not all split logs were read by regionservers  (?? is there any way to
>> > ensure this either way... should I look at the filesystem some place?)
>> >
>> > Do you think the type of network partition I'm talking about is
>> adequately
>> > covered in existing tests? (Specifically running an external zk cluster?)
>> >
>> > Have you heard if anyone else is been having problems with the second
>> 90.4
>> > rc?
>> >
>> > Thanks again for your help.  I'm following up with the MapR guys as well.
>> >
>> > Jacques
>> >
>> > On Wed, Aug 3, 2011 at 3:49 PM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>> >
>> >> Hi Jacques,
>> >>
>> >> Sorry to hear about that.
>> >>
>> >> Regarding MapR, I personally don't have hands-on experience so it's a
>> >> little bit hard for me to help you. You might want to ping them and
>> >> ask their opinion (and I know they are watching, Ted? Srivas?)
>> >>
>> >> What I can do is telling you if things look normal from the HBase
>> >> point of view, but I see you're not running with DEBUG so I might miss
>> >> some information.
>> >>
>> >> Looking at the master log, it tells us that it was able to split the
>> >> logs correctly.
>> >>
>> >> Looking at a few regionserver logs, it doesn't seem to say that it had
>> >> issues replaying the logs so that's good too.
>> >>
>> >> About the memstore questions, it's almost purely size-based (64MB). I
>> >> say almost because we limit the number of WALs a regionserver can
>> >> carry so that when it reaches that limit it force flushes the
>> >> memstores with older edits. There's also a thread that rolls the
>> >> latest log if it's more than an hour old, so in the extreme case it
>> >> could take 32 hours for an edit in the memstore to make it to a
>> >> StoreFile. It used to be that without appends rolling those files
>> >> often would prevent losses older than 1 hour, but I haven't seen those
>> >> issues since we started using appends. But you're not using HDFS, and
>> >> I don't have MapR experience, so I can't really go any further...
>> >>
>> >> J-D
>> >>
>> >> On Tue, Aug 2, 2011 at 3:44 PM, Jacques <wh...@gmail.com> wrote:
>> >> > Given the hardy reviews and timing, we recently shifted from 90.3
>> >> (apache)
>> >> > to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4, r1150278).
>> >> >
>> >> > We had a network switch go down last night which caused an apparent
>> >> network
>> >> > partition between two of our region servers and one or more zk nodes.
>> >> >  (We're still piecing together the situation).  Anyway, things
>> *seemed*
>> >> to
>> >> > recover fine.  However, this morning we realized that we lost some
>> data
>> >> that
>> >> > was generated just before the problems occurred.
>> >> >
>> >> > It looks like h002 went down nearly immediately at around 8pm while
>> h001
>> >> > didn't go down until around 8:10pm (somewhat confused by this).  We're
>> >> > thinking that this may have contributed to the problem.  The
>> particular
>> >> > table that had data issues is a very small table with a single region
>> >> that
>> >> > was running on h002 when it went down.
>> >> >
>> >> > We know the corruption/lack of edits affected two tables.  It extended
>> >> > across a number of rows and actually appears to reach back up to data
>> >> > inserted 6 hours earlier (estimate).  The two tables we can verify
>> errors
>> >> on
>> >> > are each probably at most 10-20k <1k rows.  Some places rows that were
>> >> added
>> >> > are completely missing and some just had missing cell edits.  As an
>> >> aside, I
>> >> > was thinking there was a time based memstore flush in addition to a
>> size
>> >> > one.  But upon reviewing the hbase default configuration, I don't see
>> >> > mention of it.  Is this purely size based?
>> >> >
>> >> > We don't have the tools in place to verify exactly what other data or
>> >> tables
>> >> > may have been impacted.
>> >> >
>> >> > The log files are at the paste bin links below.  The whole cluster is
>> 8
>> >> > nodes + master, 3 zk nodes running on separate machines.  We run with
>> >> mostly
>> >> > standard settings but do have the following settings:
>> >> > heap: 12gb
>> >> > regionsize 4gb, (due to lots of cold data and not enough servers, avg
>> 300
>> >> > regions/server)
>> >> > mslab: 4m/512k (due to somewhat frequent updates to larger objects in
>> the
>> >> > 200-500k size range)
>> >> >
>> >> > We've been using hbase for about a year now and have been nothing but
>> >> happy
>> >> > with it.  The failure state that we had last night (where only some
>> >> region
>> >> > servers cannot talk to some zk servers) seems like a strange one.
>> >> >
>> >> > Any thoughts? (beyond chiding for switching to a rc)    Any opinions
>> >> whether
>> >> > we should we roll back to 90.3 (or 90.3+cloudera)?
>> >> >
>> >> > Thanks for any help,
>> >> > Jacques
>> >> >
>> >> > master: http://pastebin.com/aG8fm2KZ
>> >> > h001: http://pastebin.com/nLLk06EC
>> >> > h002: http://pastebin.com/0wPFuZDx
>> >> > h003: http://pastebin.com/3ZMV01mA
>> >> > h004: http://pastebin.com/0YVefuqS
>> >> > h005: http://pastebin.com/N90LDjvs
>> >> > h006: http://pastebin.com/gM8umekW
>> >> > h007: http://pastebin.com/0TVvX68d
>> >> > h008: http://pastebin.com/mV968Cem
>> >> >
>> >>
>> >
>>
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Jacques <wh...@gmail.com>.
Do you have any suggestions of things I should look at to confirm/deny these
possibilities?

The tables are very small and inactive (probably only 50-100 rows changing
per day).

Thanks,
Jacques

On Thu, Aug 4, 2011 at 9:09 AM, Ryan Rawson <ry...@gmail.com> wrote:

> Another possibility is the logs were not replayed correctly during the
> region startup.  We put in a lot of tests to cover this case, so it
> should not be so.
>
> Essentially the WAL replay looks at the current HFiles state, then
> decides which log entries to replay or skip. This is because a log
> might have more data than what is strictly missing from the HFiles.
>
> If the data that is missing is over 6 hours old, that is a very weird
> bug, it suggests to me that either an hfile is missing for some
> reason, or the WAL replay didnt include some for some reason.
>
> -ryan
>
> On Thu, Aug 4, 2011 at 8:38 AM, Jacques <wh...@gmail.com> wrote:
> > Thanks for the feedback.  So you're inclined to think it would be at the
> dfs
> > layer?
> >
> > Is it accurate to say the most likely places where the data could have
> been
> > lost were:
> > 1. wal writes didn't actually get written to disk (no log entries to
> suggest
> > any issues)
> > 2. wal corrupted (no log entries suggest any trouble reading the log)
> > 3. not all split logs were read by regionservers  (?? is there any way to
> > ensure this either way... should I look at the filesystem some place?)
> >
> > Do you think the type of network partition I'm talking about is
> adequately
> > covered in existing tests? (Specifically running an external zk cluster?)
> >
> > Have you heard if anyone else is been having problems with the second
> 90.4
> > rc?
> >
> > Thanks again for your help.  I'm following up with the MapR guys as well.
> >
> > Jacques
> >
> > On Wed, Aug 3, 2011 at 3:49 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Hi Jacques,
> >>
> >> Sorry to hear about that.
> >>
> >> Regarding MapR, I personally don't have hands-on experience so it's a
> >> little bit hard for me to help you. You might want to ping them and
> >> ask their opinion (and I know they are watching, Ted? Srivas?)
> >>
> >> What I can do is telling you if things look normal from the HBase
> >> point of view, but I see you're not running with DEBUG so I might miss
> >> some information.
> >>
> >> Looking at the master log, it tells us that it was able to split the
> >> logs correctly.
> >>
> >> Looking at a few regionserver logs, it doesn't seem to say that it had
> >> issues replaying the logs so that's good too.
> >>
> >> About the memstore questions, it's almost purely size-based (64MB). I
> >> say almost because we limit the number of WALs a regionserver can
> >> carry so that when it reaches that limit it force flushes the
> >> memstores with older edits. There's also a thread that rolls the
> >> latest log if it's more than an hour old, so in the extreme case it
> >> could take 32 hours for an edit in the memstore to make it to a
> >> StoreFile. It used to be that without appends rolling those files
> >> often would prevent losses older than 1 hour, but I haven't seen those
> >> issues since we started using appends. But you're not using HDFS, and
> >> I don't have MapR experience, so I can't really go any further...
> >>
> >> J-D
> >>
> >> On Tue, Aug 2, 2011 at 3:44 PM, Jacques <wh...@gmail.com> wrote:
> >> > Given the hardy reviews and timing, we recently shifted from 90.3
> >> (apache)
> >> > to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4, r1150278).
> >> >
> >> > We had a network switch go down last night which caused an apparent
> >> network
> >> > partition between two of our region servers and one or more zk nodes.
> >> >  (We're still piecing together the situation).  Anyway, things
> *seemed*
> >> to
> >> > recover fine.  However, this morning we realized that we lost some
> data
> >> that
> >> > was generated just before the problems occurred.
> >> >
> >> > It looks like h002 went down nearly immediately at around 8pm while
> h001
> >> > didn't go down until around 8:10pm (somewhat confused by this).  We're
> >> > thinking that this may have contributed to the problem.  The
> particular
> >> > table that had data issues is a very small table with a single region
> >> that
> >> > was running on h002 when it went down.
> >> >
> >> > We know the corruption/lack of edits affected two tables.  It extended
> >> > across a number of rows and actually appears to reach back up to data
> >> > inserted 6 hours earlier (estimate).  The two tables we can verify
> errors
> >> on
> >> > are each probably at most 10-20k <1k rows.  Some places rows that were
> >> added
> >> > are completely missing and some just had missing cell edits.  As an
> >> aside, I
> >> > was thinking there was a time based memstore flush in addition to a
> size
> >> > one.  But upon reviewing the hbase default configuration, I don't see
> >> > mention of it.  Is this purely size based?
> >> >
> >> > We don't have the tools in place to verify exactly what other data or
> >> tables
> >> > may have been impacted.
> >> >
> >> > The log files are at the paste bin links below.  The whole cluster is
> 8
> >> > nodes + master, 3 zk nodes running on separate machines.  We run with
> >> mostly
> >> > standard settings but do have the following settings:
> >> > heap: 12gb
> >> > regionsize 4gb, (due to lots of cold data and not enough servers, avg
> 300
> >> > regions/server)
> >> > mslab: 4m/512k (due to somewhat frequent updates to larger objects in
> the
> >> > 200-500k size range)
> >> >
> >> > We've been using hbase for about a year now and have been nothing but
> >> happy
> >> > with it.  The failure state that we had last night (where only some
> >> region
> >> > servers cannot talk to some zk servers) seems like a strange one.
> >> >
> >> > Any thoughts? (beyond chiding for switching to a rc)    Any opinions
> >> whether
> >> > we should we roll back to 90.3 (or 90.3+cloudera)?
> >> >
> >> > Thanks for any help,
> >> > Jacques
> >> >
> >> > master: http://pastebin.com/aG8fm2KZ
> >> > h001: http://pastebin.com/nLLk06EC
> >> > h002: http://pastebin.com/0wPFuZDx
> >> > h003: http://pastebin.com/3ZMV01mA
> >> > h004: http://pastebin.com/0YVefuqS
> >> > h005: http://pastebin.com/N90LDjvs
> >> > h006: http://pastebin.com/gM8umekW
> >> > h007: http://pastebin.com/0TVvX68d
> >> > h008: http://pastebin.com/mV968Cem
> >> >
> >>
> >
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Ryan Rawson <ry...@gmail.com>.
Another possibility is the logs were not replayed correctly during the
region startup.  We put in a lot of tests to cover this case, so it
should not be so.

Essentially the WAL replay looks at the current HFiles state, then
decides which log entries to replay or skip. This is because a log
might have more data than what is strictly missing from the HFiles.

If the data that is missing is over 6 hours old, that is a very weird
bug, it suggests to me that either an hfile is missing for some
reason, or the WAL replay didnt include some for some reason.

-ryan

On Thu, Aug 4, 2011 at 8:38 AM, Jacques <wh...@gmail.com> wrote:
> Thanks for the feedback.  So you're inclined to think it would be at the dfs
> layer?
>
> Is it accurate to say the most likely places where the data could have been
> lost were:
> 1. wal writes didn't actually get written to disk (no log entries to suggest
> any issues)
> 2. wal corrupted (no log entries suggest any trouble reading the log)
> 3. not all split logs were read by regionservers  (?? is there any way to
> ensure this either way... should I look at the filesystem some place?)
>
> Do you think the type of network partition I'm talking about is adequately
> covered in existing tests? (Specifically running an external zk cluster?)
>
> Have you heard if anyone else is been having problems with the second 90.4
> rc?
>
> Thanks again for your help.  I'm following up with the MapR guys as well.
>
> Jacques
>
> On Wed, Aug 3, 2011 at 3:49 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Hi Jacques,
>>
>> Sorry to hear about that.
>>
>> Regarding MapR, I personally don't have hands-on experience so it's a
>> little bit hard for me to help you. You might want to ping them and
>> ask their opinion (and I know they are watching, Ted? Srivas?)
>>
>> What I can do is telling you if things look normal from the HBase
>> point of view, but I see you're not running with DEBUG so I might miss
>> some information.
>>
>> Looking at the master log, it tells us that it was able to split the
>> logs correctly.
>>
>> Looking at a few regionserver logs, it doesn't seem to say that it had
>> issues replaying the logs so that's good too.
>>
>> About the memstore questions, it's almost purely size-based (64MB). I
>> say almost because we limit the number of WALs a regionserver can
>> carry so that when it reaches that limit it force flushes the
>> memstores with older edits. There's also a thread that rolls the
>> latest log if it's more than an hour old, so in the extreme case it
>> could take 32 hours for an edit in the memstore to make it to a
>> StoreFile. It used to be that without appends rolling those files
>> often would prevent losses older than 1 hour, but I haven't seen those
>> issues since we started using appends. But you're not using HDFS, and
>> I don't have MapR experience, so I can't really go any further...
>>
>> J-D
>>
>> On Tue, Aug 2, 2011 at 3:44 PM, Jacques <wh...@gmail.com> wrote:
>> > Given the hardy reviews and timing, we recently shifted from 90.3
>> (apache)
>> > to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4, r1150278).
>> >
>> > We had a network switch go down last night which caused an apparent
>> network
>> > partition between two of our region servers and one or more zk nodes.
>> >  (We're still piecing together the situation).  Anyway, things *seemed*
>> to
>> > recover fine.  However, this morning we realized that we lost some data
>> that
>> > was generated just before the problems occurred.
>> >
>> > It looks like h002 went down nearly immediately at around 8pm while h001
>> > didn't go down until around 8:10pm (somewhat confused by this).  We're
>> > thinking that this may have contributed to the problem.  The particular
>> > table that had data issues is a very small table with a single region
>> that
>> > was running on h002 when it went down.
>> >
>> > We know the corruption/lack of edits affected two tables.  It extended
>> > across a number of rows and actually appears to reach back up to data
>> > inserted 6 hours earlier (estimate).  The two tables we can verify errors
>> on
>> > are each probably at most 10-20k <1k rows.  Some places rows that were
>> added
>> > are completely missing and some just had missing cell edits.  As an
>> aside, I
>> > was thinking there was a time based memstore flush in addition to a size
>> > one.  But upon reviewing the hbase default configuration, I don't see
>> > mention of it.  Is this purely size based?
>> >
>> > We don't have the tools in place to verify exactly what other data or
>> tables
>> > may have been impacted.
>> >
>> > The log files are at the paste bin links below.  The whole cluster is 8
>> > nodes + master, 3 zk nodes running on separate machines.  We run with
>> mostly
>> > standard settings but do have the following settings:
>> > heap: 12gb
>> > regionsize 4gb, (due to lots of cold data and not enough servers, avg 300
>> > regions/server)
>> > mslab: 4m/512k (due to somewhat frequent updates to larger objects in the
>> > 200-500k size range)
>> >
>> > We've been using hbase for about a year now and have been nothing but
>> happy
>> > with it.  The failure state that we had last night (where only some
>> region
>> > servers cannot talk to some zk servers) seems like a strange one.
>> >
>> > Any thoughts? (beyond chiding for switching to a rc)    Any opinions
>> whether
>> > we should we roll back to 90.3 (or 90.3+cloudera)?
>> >
>> > Thanks for any help,
>> > Jacques
>> >
>> > master: http://pastebin.com/aG8fm2KZ
>> > h001: http://pastebin.com/nLLk06EC
>> > h002: http://pastebin.com/0wPFuZDx
>> > h003: http://pastebin.com/3ZMV01mA
>> > h004: http://pastebin.com/0YVefuqS
>> > h005: http://pastebin.com/N90LDjvs
>> > h006: http://pastebin.com/gM8umekW
>> > h007: http://pastebin.com/0TVvX68d
>> > h008: http://pastebin.com/mV968Cem
>> >
>>
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Jacques <wh...@gmail.com>.
Thanks for the feedback.  So you're inclined to think it would be at the dfs
layer?

Is it accurate to say the most likely places where the data could have been
lost were:
1. wal writes didn't actually get written to disk (no log entries to suggest
any issues)
2. wal corrupted (no log entries suggest any trouble reading the log)
3. not all split logs were read by regionservers  (?? is there any way to
ensure this either way... should I look at the filesystem some place?)

Do you think the type of network partition I'm talking about is adequately
covered in existing tests? (Specifically running an external zk cluster?)

Have you heard if anyone else is been having problems with the second 90.4
rc?

Thanks again for your help.  I'm following up with the MapR guys as well.

Jacques

On Wed, Aug 3, 2011 at 3:49 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Hi Jacques,
>
> Sorry to hear about that.
>
> Regarding MapR, I personally don't have hands-on experience so it's a
> little bit hard for me to help you. You might want to ping them and
> ask their opinion (and I know they are watching, Ted? Srivas?)
>
> What I can do is telling you if things look normal from the HBase
> point of view, but I see you're not running with DEBUG so I might miss
> some information.
>
> Looking at the master log, it tells us that it was able to split the
> logs correctly.
>
> Looking at a few regionserver logs, it doesn't seem to say that it had
> issues replaying the logs so that's good too.
>
> About the memstore questions, it's almost purely size-based (64MB). I
> say almost because we limit the number of WALs a regionserver can
> carry so that when it reaches that limit it force flushes the
> memstores with older edits. There's also a thread that rolls the
> latest log if it's more than an hour old, so in the extreme case it
> could take 32 hours for an edit in the memstore to make it to a
> StoreFile. It used to be that without appends rolling those files
> often would prevent losses older than 1 hour, but I haven't seen those
> issues since we started using appends. But you're not using HDFS, and
> I don't have MapR experience, so I can't really go any further...
>
> J-D
>
> On Tue, Aug 2, 2011 at 3:44 PM, Jacques <wh...@gmail.com> wrote:
> > Given the hardy reviews and timing, we recently shifted from 90.3
> (apache)
> > to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4, r1150278).
> >
> > We had a network switch go down last night which caused an apparent
> network
> > partition between two of our region servers and one or more zk nodes.
> >  (We're still piecing together the situation).  Anyway, things *seemed*
> to
> > recover fine.  However, this morning we realized that we lost some data
> that
> > was generated just before the problems occurred.
> >
> > It looks like h002 went down nearly immediately at around 8pm while h001
> > didn't go down until around 8:10pm (somewhat confused by this).  We're
> > thinking that this may have contributed to the problem.  The particular
> > table that had data issues is a very small table with a single region
> that
> > was running on h002 when it went down.
> >
> > We know the corruption/lack of edits affected two tables.  It extended
> > across a number of rows and actually appears to reach back up to data
> > inserted 6 hours earlier (estimate).  The two tables we can verify errors
> on
> > are each probably at most 10-20k <1k rows.  Some places rows that were
> added
> > are completely missing and some just had missing cell edits.  As an
> aside, I
> > was thinking there was a time based memstore flush in addition to a size
> > one.  But upon reviewing the hbase default configuration, I don't see
> > mention of it.  Is this purely size based?
> >
> > We don't have the tools in place to verify exactly what other data or
> tables
> > may have been impacted.
> >
> > The log files are at the paste bin links below.  The whole cluster is 8
> > nodes + master, 3 zk nodes running on separate machines.  We run with
> mostly
> > standard settings but do have the following settings:
> > heap: 12gb
> > regionsize 4gb, (due to lots of cold data and not enough servers, avg 300
> > regions/server)
> > mslab: 4m/512k (due to somewhat frequent updates to larger objects in the
> > 200-500k size range)
> >
> > We've been using hbase for about a year now and have been nothing but
> happy
> > with it.  The failure state that we had last night (where only some
> region
> > servers cannot talk to some zk servers) seems like a strange one.
> >
> > Any thoughts? (beyond chiding for switching to a rc)    Any opinions
> whether
> > we should we roll back to 90.3 (or 90.3+cloudera)?
> >
> > Thanks for any help,
> > Jacques
> >
> > master: http://pastebin.com/aG8fm2KZ
> > h001: http://pastebin.com/nLLk06EC
> > h002: http://pastebin.com/0wPFuZDx
> > h003: http://pastebin.com/3ZMV01mA
> > h004: http://pastebin.com/0YVefuqS
> > h005: http://pastebin.com/N90LDjvs
> > h006: http://pastebin.com/gM8umekW
> > h007: http://pastebin.com/0TVvX68d
> > h008: http://pastebin.com/mV968Cem
> >
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Hi Jacques,

Sorry to hear about that.

Regarding MapR, I personally don't have hands-on experience so it's a
little bit hard for me to help you. You might want to ping them and
ask their opinion (and I know they are watching, Ted? Srivas?)

What I can do is telling you if things look normal from the HBase
point of view, but I see you're not running with DEBUG so I might miss
some information.

Looking at the master log, it tells us that it was able to split the
logs correctly.

Looking at a few regionserver logs, it doesn't seem to say that it had
issues replaying the logs so that's good too.

About the memstore questions, it's almost purely size-based (64MB). I
say almost because we limit the number of WALs a regionserver can
carry so that when it reaches that limit it force flushes the
memstores with older edits. There's also a thread that rolls the
latest log if it's more than an hour old, so in the extreme case it
could take 32 hours for an edit in the memstore to make it to a
StoreFile. It used to be that without appends rolling those files
often would prevent losses older than 1 hour, but I haven't seen those
issues since we started using appends. But you're not using HDFS, and
I don't have MapR experience, so I can't really go any further...

J-D

On Tue, Aug 2, 2011 at 3:44 PM, Jacques <wh...@gmail.com> wrote:
> Given the hardy reviews and timing, we recently shifted from 90.3 (apache)
> to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4, r1150278).
>
> We had a network switch go down last night which caused an apparent network
> partition between two of our region servers and one or more zk nodes.
>  (We're still piecing together the situation).  Anyway, things *seemed* to
> recover fine.  However, this morning we realized that we lost some data that
> was generated just before the problems occurred.
>
> It looks like h002 went down nearly immediately at around 8pm while h001
> didn't go down until around 8:10pm (somewhat confused by this).  We're
> thinking that this may have contributed to the problem.  The particular
> table that had data issues is a very small table with a single region that
> was running on h002 when it went down.
>
> We know the corruption/lack of edits affected two tables.  It extended
> across a number of rows and actually appears to reach back up to data
> inserted 6 hours earlier (estimate).  The two tables we can verify errors on
> are each probably at most 10-20k <1k rows.  Some places rows that were added
> are completely missing and some just had missing cell edits.  As an aside, I
> was thinking there was a time based memstore flush in addition to a size
> one.  But upon reviewing the hbase default configuration, I don't see
> mention of it.  Is this purely size based?
>
> We don't have the tools in place to verify exactly what other data or tables
> may have been impacted.
>
> The log files are at the paste bin links below.  The whole cluster is 8
> nodes + master, 3 zk nodes running on separate machines.  We run with mostly
> standard settings but do have the following settings:
> heap: 12gb
> regionsize 4gb, (due to lots of cold data and not enough servers, avg 300
> regions/server)
> mslab: 4m/512k (due to somewhat frequent updates to larger objects in the
> 200-500k size range)
>
> We've been using hbase for about a year now and have been nothing but happy
> with it.  The failure state that we had last night (where only some region
> servers cannot talk to some zk servers) seems like a strange one.
>
> Any thoughts? (beyond chiding for switching to a rc)    Any opinions whether
> we should we roll back to 90.3 (or 90.3+cloudera)?
>
> Thanks for any help,
> Jacques
>
> master: http://pastebin.com/aG8fm2KZ
> h001: http://pastebin.com/nLLk06EC
> h002: http://pastebin.com/0wPFuZDx
> h003: http://pastebin.com/3ZMV01mA
> h004: http://pastebin.com/0YVefuqS
> h005: http://pastebin.com/N90LDjvs
> h006: http://pastebin.com/gM8umekW
> h007: http://pastebin.com/0TVvX68d
> h008: http://pastebin.com/mV968Cem
>