You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Lars Hofhansl (JIRA)" <ji...@apache.org> on 2013/04/06 05:25:15 UTC
[jira] [Resolved] (HBASE-8276) Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log manager"

     [ https://issues.apache.org/jira/browse/HBASE-8276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl resolved HBASE-8276.
----------------------------------

    Resolution: Fixed
    
> Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log manager"
> -----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8276
>                 URL: https://issues.apache.org/jira/browse/HBASE-8276
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.94.7
>
>         Attachments: hbase-8276.patch, hbase-8276-v1.patch
>
>
> In recent tests, we found situations that when some data nodes are down and file operations are slow depending on underlying hdfs timeout(normally 30 secs and socket connection timeout maybe around 1 min). While split log task heart beat time out is only 25 secs, a split log task will be preempted by SplitLogManager and assign to someone else after the 25 secs. On a small cluster, you'll see the same task is keeping bounced back & force for a while. I pasted a snippet of related logs below. You can search "preempted from" to see a task is preempted.
> {code}
> 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog: hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159, length=127639653
> 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovering file hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,603 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Finished lease recover attempt for hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,629 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old edits file. It could be the result of a previous failed split attempt. Deleting hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/73387f8d327a45bacf069bd631d70b3b/recovered.edits/0000000000003703447.temp, length=0
> 2013-04-01 21:22:09,629 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old edits file. It could be the result of a previous failed split attempt. Deleting hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b749cbceaaf037c97f70cc2a6f48f2b8/recovered.edits/0000000000003703446.temp, length=0
> 2013-04-01 21:22:09,630 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old edits file. It could be the result of a previous failed split attempt. Deleting hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/c26b9d4a042d42c1194a8c2f389d33c8/recovered.edits/0000000000003703448.temp, length=0
> 2013-04-01 21:22:09,666 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old edits file. It could be the result of a previous failed split attempt. Deleting hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/adabdb40ccd52140f09f953ff41fd829/recovered.edits/0000000000003703451.temp, length=0
> 2013-04-01 21:22:09,722 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old edits file. It could be the result of a previous failed split attempt. Deleting hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/19f463fe74f4365e7df3e5fdb13aecad/recovered.edits/0000000000003703468.temp, length=0
> 2013-04-01 21:22:09,734 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old edits file. It could be the result of a previous failed split attempt. Deleting hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b3e759a3fc9c4e83064961cc3cd4a911/recovered.edits/0000000000003703459.temp, length=0
> 2013-04-01 21:22:09,770 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old edits file. It could be the result of a previous failed split attempt. Deleting hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/6f078553be50897a986734ae043a5889/recovered.edits/0000000000003703454.temp, length=0
> 2013-04-01 21:22:34,985 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task /hbase/splitlog/hdfs%3A%2F%2Fip-10-137-16-140.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-137-20-188.ec2.internal%2C60020%2C1364849530779-splitting%2Fip-10-137-20-188.ec2.internal%252C60020%252C1364849530779.1364865506159 preempted from ip-10-151-29-196.ec2.internal,60020,1364849530671, current task state and owner=unassigned ip-10-137-16-140.ec2.internal,60000,1364849528428
> {code}
> The exact same issue is fixed by hbase-6738 in trunk so here comes the backport. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira