You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sean Busbey (JIRA)" <ji...@apache.org> on 2018/06/13 19:02:00 UTC

[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.

    [ https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511552#comment-16511552 ] 

Sean Busbey edited comment on HBASE-20723 at 6/13/18 7:01 PM:
--------------------------------------------------------------

{quote}
The one thing that one of my colleagues figured out recently is that edits aren't actually persisted to the WAL until they either reach a certain size or a time limit has elapsed that triggers the hsync() or hflush(). Since the VM didn't exit correctly, I'm assuming this is what happened. Can you try loading more data in (still under the flush size/interval), but enough to cause a hsync to the WAL file and see if you have the same issue?
{quote}

this isn't supposed to be the case though? we're not supposed to return a "OK" to the client doing the write until we're done an hflush. That won't help if the underlying nodes for HDFS all fail. Is the WAL HDFS instance set up to have 3 replicas per block?

(Edit to replace hsync with hflush)


was (Author: busbey):
{quote}
The one thing that one of my colleagues figured out recently is that edits aren't actually persisted to the WAL until they either reach a certain size or a time limit has elapsed that triggers the hsync() or hflush(). Since the VM didn't exit correctly, I'm assuming this is what happened. Can you try loading more data in (still under the flush size/interval), but enough to cause a hsync to the WAL file and see if you have the same issue?
{quote}

this isn't supposed to be the case though? we're not supposed to return a "OK" to the client doing the write until we're done an hsync. That won't help if the underlying nodes for HDFS all fail. Is the WAL HDFS instance set up to have 3 replicas per block?

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -------------------------------------------------------------------------
>
>                 Key: HBASE-20723
>                 URL: https://issues.apache.org/jira/browse/HBASE-20723
>             Project: HBase
>          Issue Type: Bug
>          Components: hbase
>    Affects Versions: 1.1.2
>            Reporter: Rohan Pednekar
>            Priority: Major
>         Attachments: logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://xxxxx@yyyyy/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds4v2@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] wal.WALSplitter: This region's directory doesn't exist: hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. It is very likely that it was already split so it's safe to discard those edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the region directory in the hbase.wal.dir we specified rather than the hbase.rootdir.
> Looking at the source code,
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or even in different filesystem, then the #5 uses walDir as tableDir is apparently wrong.
> CC: [~zyork], [~yuzhihong@gmail.com] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)