You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2012/08/28 23:17:07 UTC

[jira] [Created] (HBASE-6679) RegionServer aborts due to race between compaction and split

Devaraj Das created HBASE-6679:
----------------------------------

             Summary: RegionServer aborts due to race between compaction and split
                 Key: HBASE-6679
                 URL: https://issues.apache.org/jira/browse/HBASE-6679
             Project: HBase
          Issue Type: Bug
            Reporter: Devaraj Das
            Assignee: Devaraj Das
             Fix For: 0.92.2
         Attachments: rs-crash-parallel-compact-split.log

In our nightlies, we have seen RS aborts due to compaction and split racing. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443568#comment-13443568 ] 

Ted Yu commented on HBASE-6679:
-------------------------------

Looks like we need some locking mechanism to make compacting and splitting the same region mutually exclusive.
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.2
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443805#comment-13443805 ] 

Andrew Purtell commented on HBASE-6679:
---------------------------------------

Then why not move split handling back to the compaction threads, do away with the separate split handler threads. Can someone articulate a compelling benefit to the contrary?
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.2
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462352#comment-13462352 ] 

Devaraj Das commented on HBASE-6679:
------------------------------------

Am looking, Ram. I will post some update soon.
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463494#comment-13463494 ] 

Devaraj Das commented on HBASE-6679:
------------------------------------

Okay, did some digging into the logs (that was attached in the jira earlier) and the code. Doesn't seem like a race between compaction and split (apologies for the confusion I might have created). The two are sequential (at the end of a compaction, split is requested for). But I'll note that the split happens in a separate thread.

The problem is that the daughter tries to open a reader to a file that doesn't exist. 
{noformat}
java.io.IOException: Failed ip-10-4-197-133.ec2.internal,60020,1346119706203-daughterOpener=4efb1c92918bbf3c54d0ead3345bb735
	at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:368)
	at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:456)
	at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: File does not exist: /apps/hbase/data/TestLoadAndVerify_1346120615716/5689a8785bbc9a8aa8e526cd7ef1542a/f1/5a55df83829f401993d95ecf2e539ba1
{noformat}

The method SplitTransaction.createDaughters creates the reference files (via a call to the method SplitTransaction.splitStoreFiles) that the daughter then tries to open. The list of files to create references to is the set of entries in the storeFiles field in Store.java (obtained via the call to this.parent.close in createDaughters). The storeFiles is last updated (in the thread doing the compaction) in the method Store.completeCompaction.

My suspicion is that the problem is due to the fact that accesses to storeFiles is not synchronized, and it not volatile either. This leads to inconsistencies in the compaction-thread and split-thread and the split thread doesn't see the last updated value of the field.

If the above theory is right (and I have this theory only), then the solution could be to make the storeFiles field volatile.

Thoughts?
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464299#comment-13464299 ] 

Lars Hofhansl commented on HBASE-6679:
--------------------------------------

I assume 0.94 and trunk have the issue, right?
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463548#comment-13463548 ] 

Devaraj Das commented on HBASE-6679:
------------------------------------

bq. For sure the regions was not doubly-assigned? Split happened of the region on one server but on another server, the same region was being compacted? You'd need the master logs to figure it a dbl-assign

Unfortunately, didn't save the master logs when the failure happened.. 

bq. Can you figure a place where we'd be running compactions on a region concurrent w/ our splitting it? Compacting we take out write lock. Doesnt look like any locks while SplitTransaction is running (closing parent, it'll need write lock... thats after daughters open though).

I can't figure out a place where this could happen in the natural execution of the regionserver.

bq. Storefiles are an ImmutableList.

Yes.. but that still could be exposed to the problems of memory inconsistencies when multiple threads are accessing the object in unsynchronized/non-volatile ways, no?

bq. @Deva

After a long time, someone addressed me by that name :-)

bq. So before this itself the region got closed. I feel the store file list should have been updated by the time. No ?

Can't say Ram for sure. There is no guarantee unless the access (read/write) are synchronized or the field is declared volatile..

                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6679:
-------------------------------

    Attachment: 6679-1.patch

Patch as per the previous comment.
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-6679:
--------------------------

    Fix Version/s:     (was: 0.92.2)
                   0.92.3
    
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-6679.
--------------------------

       Resolution: Fixed
    Fix Version/s: 0.96.0
                   0.94.3
     Hadoop Flags: Reviewed

Committed to 0.92, 0.94 and trunk.  Thanks for the patch Devaraj.
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3, 0.94.3, 0.96.0
>
>         Attachments: 6679-1.094.patch, 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461594#comment-13461594 ] 

ramkrishna.s.vasudevan commented on HBASE-6679:
-----------------------------------------------

Any updates on this issue? 

                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-6679:
---------------------------------

    Fix Version/s:     (was: 0.94.3)
                   0.94.2
    
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3, 0.94.2, 0.96.0
>
>         Attachments: 6679-1.094.patch, 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464107#comment-13464107 ] 

stack commented on HBASE-6679:
------------------------------

bq. Yes.. but that still could be exposed to the problems of memory inconsistencies when multiple threads are accessing the object in unsynchronized/non-volatile ways, no?

We swap in the new list when compaction completes.  I suppose we could make the reference volatile so all threads catch the update.

But you can't see how the two threads can run concurrently? (not to say it not possible)

I'd say close this issue if you don't have the logs to disprove it a case of dbl-assign, a more plausible explanation to my mind?

Good on you Deva.
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467512#comment-13467512 ] 

Hudson commented on HBASE-6679:
-------------------------------

Integrated in HBase-0.94-security #58 (See [https://builds.apache.org/job/HBase-0.94-security/58/])
    HBASE-6679 RegionServer aborts due to race between compaction and split (Revision 1390783)

     Result = SUCCESS
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java

                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3, 0.94.3, 0.96.0
>
>         Attachments: 6679-1.094.patch, 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469984#comment-13469984 ] 

Hudson commented on HBASE-6679:
-------------------------------

Integrated in HBase-0.92-security #143 (See [https://builds.apache.org/job/HBase-0.92-security/143/])
    HBASE-6679 RegionServer aborts due to race between compaction and split (Revision 1390787)

     Result = FAILURE
stack : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java

                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3, 0.94.2, 0.96.0
>
>         Attachments: 6679-1.094.patch, 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464379#comment-13464379 ] 

Hudson commented on HBASE-6679:
-------------------------------

Integrated in HBase-0.94 #491 (See [https://builds.apache.org/job/HBase-0.94/491/])
    HBASE-6679 RegionServer aborts due to race between compaction and split (Revision 1390783)

     Result = FAILURE
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java

                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3, 0.94.3, 0.96.0
>
>         Attachments: 6679-1.094.patch, 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469884#comment-13469884 ] 

Hudson commented on HBASE-6679:
-------------------------------

Integrated in HBase-0.94-security-on-Hadoop-23 #8 (See [https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/8/])
    HBASE-6679 RegionServer aborts due to race between compaction and split (Revision 1390783)

     Result = FAILURE
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java

                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3, 0.94.2, 0.96.0
>
>         Attachments: 6679-1.094.patch, 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464280#comment-13464280 ] 

Devaraj Das commented on HBASE-6679:
------------------------------------

bq. I suppose we could make the reference volatile so all threads catch the update.

Yeah, [~stack], I have the same opinion - we close the issue with a fix that makes the reference volatile (and that'd justify my hours of debugging <smile>).

bq. But you can't see how the two threads can run concurrently? (not to say it not possible)

At least from the regionserver logs it is evident that this didn't happen. From the code, the compactions and splits happen in executors, where the split happens in an executor with a thread pool of at most one thread. Once a compaction completes, the executor fires off a request for split (that may or may not happen based on checks done within the request handler). The compaction executor doesn't wait for the split to complete, and so technically, it's possible that split & compaction could be running in parallel. But at a finer granularity, there are locks being taken at different points in split/compact (and the important places are protected with HRegion.lock). There are also checks for things like HRegion.writeState that are checked/set at places in compaction/split.

So IMHO things are wired together okay (but yeah, usual disclaimer - may have missed something :-) )

bq. Good on you Deva.

You too :-)
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6679:
-------------------------------

    Attachment: rs-crash-parallel-compact-split.log

RS logs
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.2
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464381#comment-13464381 ] 

Hudson commented on HBASE-6679:
-------------------------------

Integrated in HBase-TRUNK #3382 (See [https://builds.apache.org/job/HBase-TRUNK/3382/])
    HBASE-6679 RegionServer aborts due to race between compaction and split (Revision 1390781)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java

                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3, 0.94.3, 0.96.0
>
>         Attachments: 6679-1.094.patch, 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445672#comment-13445672 ] 

stack commented on HBASE-6679:
------------------------------

Did this happen w/ a non-standard config?  Or is it just consequence of split/compaction refactoring and we should do as Andrew suggests?
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6679:
-------------------------------

    Description: In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.  (was: In our nightlies, we have seen RS aborts due to compaction and split racing. Will attach a snippet of the relevant RS logs.)
    
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.2
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-6679:
-------------------------

    Attachment: 6679-1.094.patch

0.94 version
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6679-1.094.patch, 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463535#comment-13463535 ] 

ramkrishna.s.vasudevan commented on HBASE-6679:
-----------------------------------------------

@Deva
Am not able to tell clearly what is the problem.  I too went thro those logs and found that the region 5689a8785bbc9a8aa8e526cd7ef1542a has completed the compaction.

{code}
2012-08-28 06:15:34,107 INFO org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: completed compaction: regionName=TestLoadAndVerify_1346120615716,\xD8\x0D\x03\x00\x00\x00\x00\x00/000007_0,1346125261573.5689a8785bbc9a8aa8e526cd7ef1542a., storeName=f1, fileCount=3, fileSize=27.3m, priority=3, time=14360293782301; duration=4sec

{code}
and later the split has started for the region (after 2 ms)
{code}
2012-08-28 06:15:34,109 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region TestLoadAndVerify_1346120615716,\xD8\x0D\x03\x00\x00\x00\x00\x00/000007_0,1346125261573.5689a8785bbc9a8aa8e526cd7ef1542a.
{code}
The offlining of the region is done here

{code}
2012-08-28 06:15:34,788 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region TestLoadAndVerify_1346120615716,\xD8\x0D\x03\x00\x00\x00\x00\x00/000007_0,1346125261573.5689a8785bbc9a8aa8e526cd7ef1542a. in META
{code}
So before this itself the region got closed. I feel the store file list should have been updated by the time. No ?

                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463523#comment-13463523 ] 

stack commented on HBASE-6679:
------------------------------

For sure the regions was not doubly-assigned? Split happened of the region on one server but on another server, the same region was being compacted?  You'd need the master logs to figure it a dbl-assign.

Storefiles are an ImmutableList.

Can you figure a place where we'd be running compactions on a region concurrent w/ our splitting it?  Compacting we take out write lock.  Doesnt look like any locks while SplitTransaction is running (closing parent, it'll need write lock... thats after daughters open though).
                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6679) RegionServer aborts due to race between compaction and split

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464631#comment-13464631 ] 

Hudson commented on HBASE-6679:
-------------------------------

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #195 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/195/])
    HBASE-6679 RegionServer aborts due to race between compaction and split (Revision 1390781)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java

                
> RegionServer aborts due to race between compaction and split
> ------------------------------------------------------------
>
>                 Key: HBASE-6679
>                 URL: https://issues.apache.org/jira/browse/HBASE-6679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3, 0.94.3, 0.96.0
>
>         Attachments: 6679-1.094.patch, 6679-1.patch, rs-crash-parallel-compact-split.log
>
>
> In our nightlies, we have seen RS aborts due to compaction and split racing. Original parent file gets deleted after the compaction, and hence, the daughters don't find the parent data file. The RS kills itself when this happens. Will attach a snippet of the relevant RS logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira