You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@iotdb.apache.org by GitBox <gi...@apache.org> on 2021/02/20 07:22:03 UTC

[GitHub] [iotdb] wangchao316 opened a new pull request #2702: IOTDB-1136 Improved reliability in flush error

wangchao316 opened a new pull request #2702:
URL: https://github.com/apache/iotdb/pull/2702


   The flushing fails due to OOM or insufficient disk space etc.... As a result, the cluster is in read-only mode.
   We need a mechanism to restore read-only mode.
   If the fault is caused by a common error
   1. If the flush fails, a retry mechanism can be added. By default, the retry mechanism is used for three times at an interval of 1s.
   2. After the read-only mode is enabled, the system can recover the read-only mode and pull the data every 10 minutes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [iotdb] wangchao316 commented on a change in pull request #2702: [IOTDB-1136] Improved reliability in flush error

Posted by GitBox <gi...@apache.org>.
wangchao316 commented on a change in pull request #2702:
URL: https://github.com/apache/iotdb/pull/2702#discussion_r579624872



##########
File path: server/src/main/java/org/apache/iotdb/db/writelog/manager/ReadOnlyRecoverManager.java
##########
@@ -0,0 +1,139 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iotdb.db.writelog.manager;
+
+import org.apache.iotdb.db.common.RetryCounter;
+import org.apache.iotdb.db.common.RetryCounterFactory;
+import org.apache.iotdb.db.conf.IoTDBDescriptor;
+import org.apache.iotdb.db.engine.StorageEngine;
+import org.apache.iotdb.db.exception.RecoverReadOnlyException;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.atomic.AtomicBoolean;
+
+/** read only mode recover default: 10 min once, retry 3 times */
+public class ReadOnlyRecoverManager {
+  private static final Logger logger = LoggerFactory.getLogger(ReadOnlyRecoverManager.class);
+
+  private static final int READ_ONLY_RECOVER_DEFAULT_RETRY_ATTEMPTS = 3;
+
+  private static final long READ_ONLY_RECOVER_DEFAULT_RETRY_SLEEP_IINTERVAL = 10 * 60 * 1000L;

Review comment:
       thanks @HTHou ,   I will add configuration in config...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [iotdb] neuyilan commented on a change in pull request #2702: [IOTDB-1136] Improved reliability in flush error

Posted by GitBox <gi...@apache.org>.
neuyilan commented on a change in pull request #2702:
URL: https://github.com/apache/iotdb/pull/2702#discussion_r579639847



##########
File path: server/src/main/java/org/apache/iotdb/db/common/RetryCounter.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iotdb.db.common;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.concurrent.TimeUnit;
+
+public class RetryCounter {
+
+  private static final Logger logger = LoggerFactory.getLogger(RetryCounter.class);
+
+  private RetryConfig retryConfig;
+  private int attempts;
+
+  public RetryCounter(RetryConfig retryConfig) {
+    this.attempts = 1;
+    this.retryConfig = retryConfig;
+  }
+
+  public void sleepToNextRetry() throws InterruptedException {
+    int attempts = getAttempts();
+    long sleepTime = getBackoffTime();
+    logger.trace("Sleeping {} ms before retry {}", sleepTime, attempts);
+    retryConfig.getTimeUnit().sleep(sleepTime);
+    useRetry();
+  }
+
+  public boolean shouldRetry() {
+    return attempts < retryConfig.getMaxAttempts();
+  }
+
+  public RetryConfig getRetryConfig() {
+    return retryConfig;
+  }
+
+  public int getAttempts() {
+    return attempts;
+  }
+
+  public long getBackoffTime() {
+    return this.retryConfig.backoffPolicy.getBackoffTime(this.retryConfig, getAttempts());
+  }
+
+  private void useRetry() {
+    attempts++;
+  }
+
+  public static class RetryConfig {
+    private int maxAttempts;
+    private long sleepInterval;
+    private TimeUnit timeUnit;
+    private BackoffPolicy backoffPolicy;
+    private float jitter;
+
+    private static final BackoffPolicy DEFAULT_BACKOFF_POLICY = new ExponentialBackoffPolicy();
+
+    public RetryConfig() {
+      maxAttempts = 1;
+      sleepInterval = 100;
+      timeUnit = TimeUnit.MILLISECONDS;
+      backoffPolicy = DEFAULT_BACKOFF_POLICY;
+      jitter = 0.0f;
+    }
+
+    public RetryConfig setMaxAttempts(int maxAttempts) {
+      this.maxAttempts = maxAttempts;
+      return this;
+    }
+
+    public RetryConfig setSleepInterval(long sleepInterval) {
+      this.sleepInterval = sleepInterval;
+      return this;
+    }
+
+    public int getMaxAttempts() {
+      return maxAttempts;
+    }
+
+    public long getSleepInterval() {
+      return sleepInterval;
+    }
+
+    public TimeUnit getTimeUnit() {
+      return timeUnit;
+    }
+
+    public float getJitter() {
+      return jitter;
+    }
+  }
+
+  private static long addJitter(long interval, float jitter) {
+    long jitterInterval = (long) (interval * ThreadLocalRandom.current().nextFloat() * jitter);
+    return interval + jitterInterval;
+  }
+
+  public static class BackoffPolicy {
+    public long getBackoffTime(RetryConfig config, int attempts) {
+      return addJitter(config.getSleepInterval(), config.getJitter());
+    }
+  }
+
+  public static class ExponentialBackoffPolicy extends BackoffPolicy {
+    @Override
+    public long getBackoffTime(RetryConfig config, int attempts) {
+      long backoffTime = (long) (config.getSleepInterval() * Math.pow(2, attempts));

Review comment:
       `long backoffTime = (long) (config.getSleepInterval() * Math.pow(2, attempts));`
   
   The above code line seems not used?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [iotdb] wangchao316 commented on a change in pull request #2702: [IOTDB-1136] Improved reliability in flush error

Posted by GitBox <gi...@apache.org>.
wangchao316 commented on a change in pull request #2702:
URL: https://github.com/apache/iotdb/pull/2702#discussion_r579910869



##########
File path: server/src/main/java/org/apache/iotdb/db/common/RetryCounter.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iotdb.db.common;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.concurrent.TimeUnit;
+
+public class RetryCounter {
+
+  private static final Logger logger = LoggerFactory.getLogger(RetryCounter.class);
+
+  private RetryConfig retryConfig;
+  private int attempts;
+
+  public RetryCounter(RetryConfig retryConfig) {
+    this.attempts = 1;
+    this.retryConfig = retryConfig;
+  }
+
+  public void sleepToNextRetry() throws InterruptedException {
+    int attempts = getAttempts();
+    long sleepTime = getBackoffTime();
+    logger.trace("Sleeping {} ms before retry {}", sleepTime, attempts);
+    retryConfig.getTimeUnit().sleep(sleepTime);
+    useRetry();
+  }
+
+  public boolean shouldRetry() {
+    return attempts < retryConfig.getMaxAttempts();
+  }
+
+  public RetryConfig getRetryConfig() {
+    return retryConfig;
+  }
+
+  public int getAttempts() {
+    return attempts;
+  }
+
+  public long getBackoffTime() {
+    return this.retryConfig.backoffPolicy.getBackoffTime(this.retryConfig, getAttempts());
+  }
+
+  private void useRetry() {
+    attempts++;
+  }
+
+  public static class RetryConfig {
+    private int maxAttempts;
+    private long sleepInterval;
+    private TimeUnit timeUnit;
+    private BackoffPolicy backoffPolicy;
+    private float jitter;
+
+    private static final BackoffPolicy DEFAULT_BACKOFF_POLICY = new ExponentialBackoffPolicy();
+
+    public RetryConfig() {
+      maxAttempts = 1;
+      sleepInterval = 100;
+      timeUnit = TimeUnit.MILLISECONDS;
+      backoffPolicy = DEFAULT_BACKOFF_POLICY;
+      jitter = 0.0f;
+    }
+
+    public RetryConfig setMaxAttempts(int maxAttempts) {
+      this.maxAttempts = maxAttempts;
+      return this;
+    }
+
+    public RetryConfig setSleepInterval(long sleepInterval) {
+      this.sleepInterval = sleepInterval;
+      return this;
+    }
+
+    public int getMaxAttempts() {
+      return maxAttempts;
+    }
+
+    public long getSleepInterval() {
+      return sleepInterval;
+    }
+
+    public TimeUnit getTimeUnit() {
+      return timeUnit;
+    }
+
+    public float getJitter() {
+      return jitter;
+    }
+  }
+
+  private static long addJitter(long interval, float jitter) {
+    long jitterInterval = (long) (interval * ThreadLocalRandom.current().nextFloat() * jitter);
+    return interval + jitterInterval;
+  }
+
+  public static class BackoffPolicy {
+    public long getBackoffTime(RetryConfig config, int attempts) {
+      return addJitter(config.getSleepInterval(), config.getJitter());
+    }
+  }
+
+  public static class ExponentialBackoffPolicy extends BackoffPolicy {
+    @Override
+    public long getBackoffTime(RetryConfig config, int attempts) {
+      long backoffTime = (long) (config.getSleepInterval() * Math.pow(2, attempts));

Review comment:
       thanks @neuyilan , this code is called in sleepUntilNextRetry - > getBackoffTime  method...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [iotdb] HTHou commented on a change in pull request #2702: [IOTDB-1136] Improved reliability in flush error

Posted by GitBox <gi...@apache.org>.
HTHou commented on a change in pull request #2702:
URL: https://github.com/apache/iotdb/pull/2702#discussion_r579624398



##########
File path: server/src/main/java/org/apache/iotdb/db/writelog/manager/ReadOnlyRecoverManager.java
##########
@@ -0,0 +1,139 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iotdb.db.writelog.manager;
+
+import org.apache.iotdb.db.common.RetryCounter;
+import org.apache.iotdb.db.common.RetryCounterFactory;
+import org.apache.iotdb.db.conf.IoTDBDescriptor;
+import org.apache.iotdb.db.engine.StorageEngine;
+import org.apache.iotdb.db.exception.RecoverReadOnlyException;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.atomic.AtomicBoolean;
+
+/** read only mode recover default: 10 min once, retry 3 times */
+public class ReadOnlyRecoverManager {
+  private static final Logger logger = LoggerFactory.getLogger(ReadOnlyRecoverManager.class);
+
+  private static final int READ_ONLY_RECOVER_DEFAULT_RETRY_ATTEMPTS = 3;
+
+  private static final long READ_ONLY_RECOVER_DEFAULT_RETRY_SLEEP_IINTERVAL = 10 * 60 * 1000L;

Review comment:
       Hi, I think these two fields could be configured and users might like to change them.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [iotdb] wangchao316 commented on pull request #2702: [IOTDB-1136] Improved reliability in flush error

Posted by GitBox <gi...@apache.org>.
wangchao316 commented on pull request #2702:
URL: https://github.com/apache/iotdb/pull/2702#issuecomment-783132270


   @qiaojialin @jixuan1989  hi, could you please review this? 
   Thanks...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org