You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2013/06/17 17:23:24 UTC
[jira] [Commented] (HBASE-8755) A new write thread model for HLog to improve the overall HBase write throughput

    [ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13685646#comment-13685646 ] 

Ted Yu commented on HBASE-8755:
-------------------------------

{code}
+  private final AtomicLong failedTxid = new AtomicLong(0);
{code}
Add some comment w.r.t. the purpose of above variable.
{code}
-    // When optionalFlushInterval is set as 0, don't start a thread for deferred log sync.
-    if (this.optionalFlushInterval > 0) {
{code}
Is deferred log sync still supported ?
{code}
+    } catch (InterruptedException e) {
+      LOG.error("Exception while waiting for syncer thread to die", e);
{code}
The error message needs to be updated.
{code}
+      synchronized (bufferLock) {
+        doWrite(regionInfo, logKey, logEdit, htd);
+        txid = this.unflushedEntries.incrementAndGet();
+      }
+      this.asyncWriter.setPendingTxid(txid);
{code}
Should setPendingTxid() call be protected by the bufferLock ?
In setPendingTxid():
{code}
+    public void setPendingTxid(long txid) {
+      if (txid <= this.pendingTxid)
+        return;
{code}
What if two simultaneous calls with different txid take place ? Both txid are greater than this.pendingTxid, the smaller txid may be written last.

In several comments, replace 'local buffered writes' with 'buffered writes'
{code}
+          assert this.pendingTxid > this.lastWrittenTxid :
+            "pendingTxid not greater than lastWrittenTxid when wake-up!";
{code}
Include the two Txid's in the message above.
{code}
+            asyncIOE = e;
+            failedTxid.set(this.pendingTxid);
+         }
+
+          // 4. update 'lastWrittenTxid' and notify AsyncFlusher to do 'sync'
+          this.lastWrittenTxid = this.pendingTxid;
{code}
In error condition, failedTxid is still assigned to this.lastWrittenTxid. Is that safe ?
Since AsyncFlusher does 'sync', should it be named AsyncSyncer ?
{code}
+      } catch (InterruptedException e) {
+        LOG.debug(getName() + " interrupted while waiting for " +
{code}
Please restore interrupt status.
{code}
+  // to make durability of those WALEdits on HDFS side
{code}
Should be 'to make those WALEdits durable on HDFS side'
{code}
+          long now = System.currentTimeMillis();
{code}
Use EnvironmentEdge.
{code}
                requestLogRoll();
+              }
+            } catch (IOException e) {
+              LOG.debug("writer.getLength() failed");
{code}
Should the log level be at warn or error ? Is writer.getLength() relevant here ?

AsyncNotifier does notification by calling syncedTillHere.notifyAll(). Can this part be folded into AsyncFlusher ?
{code}
+  void addPendingWrites(Entry e) throws IOException {
{code}
Rename the above method addPendingWrite() since there is only one Entry involved.
{code}
+  // it's caller's responsibility to hold updateLock before call this method
+  List<Entry> getPendingWrites() {
{code}
bufferLock is held before calling the above method. Update comment accordingly.
{code}
+        } catch (InterruptedException e) {
+          LOG.debug("interrupted while waiting for notification from AsyncNotifier");
{code}
Restore interrupt status.
                
> A new write thread model for HLog to improve the overall HBase write throughput
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-8755
>                 URL: https://issues.apache.org/jira/browse/HBASE-8755
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>            Reporter: Feng Honghua
>         Attachments: HBASE-8755-0.94-V0.patch
>
>
> In current write model, each write handler thread (executing put()) will individually go through a full 'append (hlog local buffer) => HLog writer append (write to hdfs) => HLog writer sync (sync hdfs)' cycle for each write, which incurs heavy race condition on updateLock and flushLock.
> The only optimization where checking if current syncTillHere > txid in expectation for other thread help write/sync its own txid to hdfs and omitting the write/sync actually help much less than expectation.
> Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi proposed a new write thread model for writing hdfs sequence file and the prototype implementation shows a 4X improvement for throughput (from 17000 to 70000+). 
> I apply this new write thread model in HLog and the performance test in our test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 RS, from 22000 to 70000 for 5 RS), the 1 RS write throughput (1K row-size) even beats the one of BigTable (Precolator published in 2011 says Bigtable's write throughput then is 31002). I can provide the detailed performance test results if anyone is interested.
> The change for new write thread model is as below:
>  1> All put handler threads append the edits to HLog's local pending buffer; (it notifies AsyncWriter thread that there is new edits in local buffer)
>  2> All put handler threads wait in HLog.syncer() function for underlying threads to finish the sync that contains its txid;
>  3> An single AsyncWriter thread is responsible for retrieve all the buffered edits in HLog's local pending buffer and write to the hdfs (hlog.writer.append); (it notifies AsyncFlusher thread that there is new writes to hdfs that needs a sync)
>  4> An single AsyncFlusher thread is responsible for issuing a sync to hdfs to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread that sync watermark increases)
>  5> An single AsyncNotifier thread is responsible for notifying all pending put handler threads which are waiting in the HLog.syncer() function
>  6> No LogSyncer thread any more (since there is always AsyncWriter/AsyncFlusher threads do the same job it does)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira