You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by keith-turner <gi...@git.apache.org> on 2016/10/29 00:34:24 UTC

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

GitHub user keith-turner opened a pull request:

    https://github.com/apache/accumulo/pull/176

    Wrote blog post about durability and performance

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/keith-turner/accumulo durability

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/176.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #176
    
----
commit 6b98fd4e7ecb76075bd0c713867401f614d8fd04
Author: Keith Turner <ke...@deenlo.com>
Date:   2016-10-28T02:55:30Z

    Wrote blog post about durability and performance

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85737453
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    --- End diff --
    
    I like the wording `most durable` so much better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85754037
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    +durability.  This can cause one tables settings to slow down writes to another
    +table.  
    +
    +In Accumulo 1.6, it was easy to make all writes use `hflush` because there was
    +only one tserver setting.  Getting everything to use `flush` in 1.7 and later
    +can be a little tricky because by default the Accumulo metadata table is set to
    +use `sync`.  Executing the following command in the Accumulo shell will
    +accomplish this (assuming no tables or namespaces have been specifically set to
    +`sync`).  The first command sets a system wide table default for `flush`.  The
    +second two commands override metadata table specific settings of `sync`.
    +
    +```
    +config -s table.durability=flush
    +config -t accumulo.metadata -s table.durability=flush
    +config -t accumulo.root -s table.durability=flush
    --- End diff --
    
    made updates in f5d8401 ... hopefully this is more clear now... its a very nuanced what I am trying to communicate here... which is why its harder to configure 1.7 correctly for `hflush`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85749578
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    --- End diff --
    
    Each tablet has its own IMM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85754195
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    --- End diff --
    
    made updates in f5d8401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85754082
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    +durability.  This can cause one tables settings to slow down writes to another
    +table.  
    +
    +In Accumulo 1.6, it was easy to make all writes use `hflush` because there was
    +only one tserver setting.  Getting everything to use `flush` in 1.7 and later
    +can be a little tricky because by default the Accumulo metadata table is set to
    +use `sync`.  Executing the following command in the Accumulo shell will
    +accomplish this (assuming no tables or namespaces have been specifically set to
    +`sync`).  The first command sets a system wide table default for `flush`.  The
    +second two commands override metadata table specific settings of `sync`.
    +
    +```
    +config -s table.durability=flush
    +config -t accumulo.metadata -s table.durability=flush
    +config -t accumulo.root -s table.durability=flush
    +```
    +
    +Even with these settings adjusted, minor compactions could still force `hsync`
    +to be called in 1.7.0 and 1.7.1.  This was fixed in 1.7.2 and 1.8.0.  See the
    +[1.7.2 release notes][172_RN_MCHS] and [ACCUMULO-4112] for more details.
    +
    +In addition to the per table durability setting, a per batch writer durability
    +setting was also added in 1.7.0.  See
    +[BatchWriterConfig.setDurability(...)][SD].  This means any client could
    +potentially cause a `hsync` operation to occur, even if the system is
    +configured to use `hflush`.
    +
    +## Improving the situation
    +
    +The more granular durability settings introduced in 1.7.0 can cause some
    +unexpected problems.  [ACCUMULO-4146] suggest one possible way to solve these
    +problems with Per-durability write ahead logs.
    +
    +[fcf]: https://docs.oracle.com/javase/8/docs/api/java/nio/channels/FileChannel.html#force-boolean-
    +[ros1]: https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/ReplicaOutputStreams.java#L78
    +[ros2]: https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/ReplicaOutputStreams.java#L87
    +[fos]: https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java#L358
    +[ACCUMULO-4146]: https://issues.apache.org/jira/browse/ACCUMULO-4146
    +[ACCUMULO-4112]: https://issues.apache.org/jira/browse/ACCUMULO-4112
    +[160_RN_WAL]: /release_notes/1.6.0#slower-writes-than-previous-accumulo-versions
    --- End diff --
    
    made updates in f5d8401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by ctubbsii <gi...@git.apache.org>.
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85632330
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    --- End diff --
    
    Which of the above 4 options are the default?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85749397
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    --- End diff --
    
    I don't think so, but not 100% sure.  Would have to go look at code.  The docs do not mention this option.  1.6 did have a per table option to turn the WAL on and off.  This was also superseded by `table.durability`.   I had forgotten about this property. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85742229
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    --- End diff --
    
    The latter two would definitely be best, but I didn't want to impose that you _had_ to do either of them :). I'm not sure who would be most likely to know something like that off the top of their head. Maybe Todd Lipcon?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by ctubbsii <gi...@git.apache.org>.
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85632504
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    +durability.  This can cause one tables settings to slow down writes to another
    +table.  
    +
    +In Accumulo 1.6, it was easy to make all writes use `hflush` because there was
    +only one tserver setting.  Getting everything to use `flush` in 1.7 and later
    +can be a little tricky because by default the Accumulo metadata table is set to
    +use `sync`.  Executing the following command in the Accumulo shell will
    +accomplish this (assuming no tables or namespaces have been specifically set to
    +`sync`).  The first command sets a system wide table default for `flush`.  The
    +second two commands override metadata table specific settings of `sync`.
    +
    +```
    +config -s table.durability=flush
    +config -t accumulo.metadata -s table.durability=flush
    +config -t accumulo.root -s table.durability=flush
    +```
    +
    +Even with these settings adjusted, minor compactions could still force `hsync`
    +to be called in 1.7.0 and 1.7.1.  This was fixed in 1.7.2 and 1.8.0.  See the
    +[1.7.2 release notes][172_RN_MCHS] and [ACCUMULO-4112] for more details.
    +
    +In addition to the per table durability setting, a per batch writer durability
    +setting was also added in 1.7.0.  See
    +[BatchWriterConfig.setDurability(...)][SD].  This means any client could
    +potentially cause a `hsync` operation to occur, even if the system is
    +configured to use `hflush`.
    +
    +## Improving the situation
    +
    +The more granular durability settings introduced in 1.7.0 can cause some
    +unexpected problems.  [ACCUMULO-4146] suggest one possible way to solve these
    +problems with Per-durability write ahead logs.
    +
    +[fcf]: https://docs.oracle.com/javase/8/docs/api/java/nio/channels/FileChannel.html#force-boolean-
    +[ros1]: https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/ReplicaOutputStreams.java#L78
    +[ros2]: https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/ReplicaOutputStreams.java#L87
    +[fos]: https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java#L358
    +[ACCUMULO-4146]: https://issues.apache.org/jira/browse/ACCUMULO-4146
    +[ACCUMULO-4112]: https://issues.apache.org/jira/browse/ACCUMULO-4112
    +[160_RN_WAL]: /release_notes/1.6.0#slower-writes-than-previous-accumulo-versions
    --- End diff --
    
    Jekyll note: you should prefix absolute links with `{{ site.baseurl }}` so the links don't break if running locally or on a test fork.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on the issue:

    https://github.com/apache/accumulo/pull/176
  
    > It might be how we create/use the WAL. It's possible that a tserver can create a WAL and open an output stream to the first block, but never actually write any data to it (see ACCUMULO-4004). I'm not quite sure of the state of the first block and file from an HDFS perspective if the node dies while in this state (WAL created, open input stream to the first block). But, I can say that this does cause a problem (in Accumulo 1.6.4 at least).
    
    Ok, thanks for the details, Dave. If we have a scenario that can reliably reproduce this, I can likely find an HDFS expert to give us some context on what is happened at the HDFS layer. I'm not sure how to inspect the state of that block nor would I know what would be "expected" or "correct" :)
    
    I think expanding on your original suggestion with some context from ACCUMULO-4004 would be good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85754136
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    --- End diff --
    
    made updates in f5d8401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #176: Wrote blog post about durability and performance

Posted by dlmarion <gi...@git.apache.org>.
Github user dlmarion commented on the issue:

    https://github.com/apache/accumulo/pull/176
  
    > I'm not sure I understand the reasoning behind that logic (admittedly, it's "early"). Specifically, how would a failure of the local node be any different than a failure of a non-local DN in the write pipeline?
    
    It might be how we create/use the WAL. It's possible that a tserver can create a WAL and open an output stream to the first block, but never actually write any data to it (see ACCUMULO-4004). I'm not quite sure of the state of the first block and file from an HDFS perspective if the node dies while in this state (WAL created, open input stream to the first block). But, I can say that this does cause a problem (in Accumulo 1.6.4 at least).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85754107
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    --- End diff --
    
    made updates in f5d8401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85668378
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    --- End diff --
    
    Nit: singular "map" instead of "maps" (I think?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r86040011
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    +durability.  This can cause one tables settings to slow down writes to another
    +table.  
    +
    +In Accumulo 1.6, it was easy to make all writes use `hflush` because there was
    +only one tserver setting.  Getting everything to use `flush` in 1.7 and later
    +can be a little tricky because by default the Accumulo metadata table is set to
    +use `sync`.  Executing the following command in the Accumulo shell will
    +accomplish this (assuming no tables or namespaces have been specifically set to
    +`sync`).  The first command sets a system wide table default for `flush`.  The
    +second two commands override metadata table specific settings of `sync`.
    +
    +```
    +config -s table.durability=flush
    +config -t accumulo.metadata -s table.durability=flush
    +config -t accumulo.root -s table.durability=flush
    --- End diff --
    
    I made further changes to this section in ebf9bca in an attempt to more clearly communicate why 1.7+ is trickier to configure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by ctubbsii <gi...@git.apache.org>.
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85632382
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    --- End diff --
    
    What does "most conservative" mean here? Maybe "most durable" would be a better word choice?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85747862
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    --- End diff --
    
    But isn't there still only one IMM, just partitioned by tablet? That was my distinction.
    
    Also, "Tablet Servers' WAL" looking at this again :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #176: Wrote blog post about durability and performance

Posted by dlmarion <gi...@git.apache.org>.
Github user dlmarion commented on the issue:

    https://github.com/apache/accumulo/pull/176
  
    ACCUMULO-4000 has some information too. It's not always seen during decommissioning though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on the issue:

    https://github.com/apache/accumulo/pull/176
  
    > so that if the node goes down and can't be brought back up, you can still recover using the WAL.
    
    I'm not sure I understand the reasoning behind that logic (admittedly, it's "early"). Specifically, how would a failure of the local node be any different than a failure of a non-local DN in the write pipeline? Also, aren't there performance implications when using hflush to avoid the local node (not even just the write side, but the read side would be unable to use SCR's when the local DN doesn't have those blocks).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85750520
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    --- End diff --
    
    My apologies! Ignore this one then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85754159
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    --- End diff --
    
    made updates in f5d8401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85668428
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    --- End diff --
    
    s/reboot/fail/ ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85738347
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    --- End diff --
    
    Its certainly not very useful.  Thinking I can do one of the following :
     * Omit it
     * Research it
     * Change it to ask for anyone how might know the answer to this mystery to let us know.
    
    Researching it would be ideal, but I am thinking of going with the last option and asking for help.  If there is a reader w/ expertise they could probably answer this much more quickly than me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85668607
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    +durability.  This can cause one tables settings to slow down writes to another
    +table.  
    +
    +In Accumulo 1.6, it was easy to make all writes use `hflush` because there was
    +only one tserver setting.  Getting everything to use `flush` in 1.7 and later
    +can be a little tricky because by default the Accumulo metadata table is set to
    +use `sync`.  Executing the following command in the Accumulo shell will
    +accomplish this (assuming no tables or namespaces have been specifically set to
    +`sync`).  The first command sets a system wide table default for `flush`.  The
    +second two commands override metadata table specific settings of `sync`.
    +
    +```
    +config -s table.durability=flush
    +config -t accumulo.metadata -s table.durability=flush
    +config -t accumulo.root -s table.durability=flush
    +```
    +
    +Even with these settings adjusted, minor compactions could still force `hsync`
    +to be called in 1.7.0 and 1.7.1.  This was fixed in 1.7.2 and 1.8.0.  See the
    +[1.7.2 release notes][172_RN_MCHS] and [ACCUMULO-4112] for more details.
    +
    +In addition to the per table durability setting, a per batch writer durability
    +setting was also added in 1.7.0.  See
    +[BatchWriterConfig.setDurability(...)][SD].  This means any client could
    +potentially cause a `hsync` operation to occur, even if the system is
    +configured to use `hflush`.
    +
    +## Improving the situation
    +
    +The more granular durability settings introduced in 1.7.0 can cause some
    +unexpected problems.  [ACCUMULO-4146] suggest one possible way to solve these
    --- End diff --
    
    s/suggest/suggests/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85668356
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    --- End diff --
    
    nit: it's also written.
    
    Write the long-form "write ahead log" and then put the abbreviation in parens. Then, you can use "WAL" to refer to "write ahead log"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85668563
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    +durability.  This can cause one tables settings to slow down writes to another
    +table.  
    --- End diff --
    
    This is very interesting. I hadn't considered this before. Sounds like this would be a very good metric to expose (since it could mutations written to one table much slower than expected).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by ctubbsii <gi...@git.apache.org>.
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85632402
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    +durability.  This can cause one tables settings to slow down writes to another
    +table.  
    +
    +In Accumulo 1.6, it was easy to make all writes use `hflush` because there was
    +only one tserver setting.  Getting everything to use `flush` in 1.7 and later
    +can be a little tricky because by default the Accumulo metadata table is set to
    +use `sync`.  Executing the following command in the Accumulo shell will
    +accomplish this (assuming no tables or namespaces have been specifically set to
    +`sync`).  The first command sets a system wide table default for `flush`.  The
    +second two commands override metadata table specific settings of `sync`.
    +
    +```
    +config -s table.durability=flush
    +config -t accumulo.metadata -s table.durability=flush
    +config -t accumulo.root -s table.durability=flush
    --- End diff --
    
    The two metadata tables are using flush here, not sync, as the paragraph above described.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85668330
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    --- End diff --
    
    Nit: Consistent capitalization of "Tablet Server" across the next three lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85743259
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    --- End diff --
    
    A batch of mutations may go to multiple tablets.  I'll try to make that a little more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85753747
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    +
    +Each Accumulo Tablet Server has a single WAL.  When multiple clients send
    +mutations to a tablet server at around the same time, the tablet sever may group
    +all of this into a single WAL operation.  It will do this instead of writing and
    +syncing or flushing each client's mutations to the WAL separately.  Doing this
    +increase throughput and lowers average latency for clients.
    +
    +## Configuring WAL flush/sync in Accumulo 1.7+
    +
    +Accumulo 1.7.0 introduced [table.durability][17_UM_TD], a new per table property
    +for configuring durability.  It also stopped using the `tserver.wal.sync.method`
    +property.  The `table.durability` property has the following four legal values.
    +
    + * **none** : Do not write to WAL            
    + * **log**  : Write to WAL, but do not sync  
    + * **flush** : Write to WAL and call `hflush` 
    + * **sync** : Write to WAL and call `hsync`  
    +
    +If multiple writes arrive at around the same time with different durability
    +settings, then the group commit code will choose the most conservative
    --- End diff --
    
    made updates in f5d8401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85668416
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    --- End diff --
    
    s/ensure/guarantee/ ? I think that better represents the "it may or may not be on disk" notion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85753310
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    --- End diff --
    
    I really like this much better than what I had.  I shortened it a bit when I merged it in.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85739576
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    --- End diff --
    
    When I read "reboot", I thought of it as an "planned" occurrence, instead of a planned _or_ unplanned failure (such as operator error, hardware failure, etc, kernel panic, etc). Admittedly, it might be my colloquial use of the word "reboot" affecting my interpretation :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by ctubbsii <gi...@git.apache.org>.
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85632246
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    --- End diff --
    
    Kind of makes the reader wonder if there are existing tickets open for HDFS to improve this situation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by ctubbsii <gi...@git.apache.org>.
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85632296
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    --- End diff --
    
    Was there no third option to do neither?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85668511
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    --- End diff --
    
    I'd avoid these two sentences about something you don't know about. Just omit it. I don't think it will have the effect you intend it to have (without knowing your mannerism/way of speaking personally).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85738743
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    --- End diff --
    
    I was using the word `die`... then I replaced that with `reboot`...  I am looking for the shortest way to say the operating system dies... the OS can die w/o a hardware failure, thats why I settled on reboot as a short way to communicate this in a way that I think most would understand


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r86220565
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -7,17 +7,17 @@ author: Keith Turner
     ## Overview
     
     Accumulo stores recently written data in a sorted in memory map.  Before data is
    -added to this map, it's written to an unsorted WAL (write ahead log).  In the
    -case when a Tablet Server dies, the recently written data is recovered from the
    +added to this map, it's written to an unsorted write ahead log(WAL).  In the
    +case when a tablet server dies, the recently written data is recovered from the
     WAL.
     
     When data is written to Accumulo the following happens :
     
      * Client sends a batch of mutations to a tablet server
      * Tablet server does the following :
    -   * Writes mutation to Tablet Servers WAL
    -   * Sync or flush WAL
    -   * Adds mutations to sorted in memory maps
    +   * Writes mutation to tablet servers WAL
    --- End diff --
    
    Fixed this in a following on commit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #176: Wrote blog post about durability and performance

Posted by dlmarion <gi...@git.apache.org>.
Github user dlmarion commented on the issue:

    https://github.com/apache/accumulo/pull/176
  
    May want to point out https://issues.apache.org/jira/browse/HDFS-3702 as an option starting with Hadoop 2.8, so that if the node goes down and can't be brought back up, you can still recover using the WAL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by ctubbsii <gi...@git.apache.org>.
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85632341
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    +loss to occur.
    +
    +In order to avoid data loss in the event of reboot, `hsync` can be called.  This
    +will ensure data is written to disk on all datanodes before returning.  When
    +using `hsync` for the WAL, if Accumulo reports success to a user it means the
    +data is on disk.  However `hsync` is much slower than `hflush` and the way it's
    +implemented exacerbates the problem.  For example `hflush` make take 1ms and
    +`hsync` may take 50ms.  This difference will impact writes to Accumulo and can
    +be mitigated in some situations with larger buffers in Accumulo.
    +
    +HDFS keeps checksum data internally by default.  Datanodes store checksum data
    +in a separate file in the local filesystem.  This means when `hsync` is called
    +on a WAL, two files must be synced on each datanode.  Syncing two files doubles
    +the time. To make matters even worse, when the two files are synced the local
    +filesystem metadata is also synced.  Depending on the local filesystem and its
    +configuration, syncing the metadata may or may not take time.  In the worst
    +case, we need to wait for four sync operations at the local filesystem level on
    +each datanode. One thing I am not sure about, is if these sync operations occur
    +in parallel on the replicas on different datanodes.  Lets hope they occur in
    +parallel.  The following pointers show where sync occurs in the datanode code.
    +
    + * [BlockReceiver.flushOrSync()][fos] calls [ReplicaOutputStreams.syncDataOut()][ros1] and [ReplicaOutputStreams.syncChecksumOut()][ros2] when `isSync` is true.
    + * The methods in ReplicaOutputStreams call [FileChannel.force(true)][fcf] which
    +   synchronously flushes data and filesystem metadata.
    +
    +If files were preallocated (this would avoid syncing local filesystem metadata)
    +and checksums were stored in-line, then 1 sync could be done instead of 4.  
    +
    +## Configuring WAL flush/sync in Accumulo 1.6
    +
    +Accumulo 1.6.0 only supported `hsync` and this caused [performance
    +problems][160_RN_WAL].  In order to offer better performance, the option to
    +configure `hflush` was [added in 1.6.1][161_RN_WAL].  The
    +[tserver.wal.sync.method][16_UM_SM] configuration option was added to support
    +this feature.  This was a tablet server wide option that applied to everything
    +written to any table.   
    +
    +## Group Commit
    --- End diff --
    
    This section should be above the section for configuring 1.6, so the configuration section for 1.6 is closer to the one for 1.7+.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner closed the pull request at:

    https://github.com/apache/accumulo/pull/176


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on the issue:

    https://github.com/apache/accumulo/pull/176
  
    merged these changes in commit e0447313ae503925c28b5164c27163009274f03f


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r86040160
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    --- End diff --
    
    used wording `reboot or die` in ebf9bca


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85752895
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    --- End diff --
    
    I was thinking it would be nice to add another optional field: `reviewers`.   For Fluo we decided not have an author field because the post were a community effort. As is the case here, review always made the post so much better.  I am thinking a different way to solve this than Fluo did is to add a reviewer field.  Unless someone objects, I can add handling for that in the template in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85668445
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    +case when a Tablet Server dies, the recently written data is recovered from the
    +WAL.
    +
    +When data is written to Accumulo the following happens :
    +
    + * Client sends a batch of mutations to a tablet server
    + * Tablet server does the following :
    +   * Writes mutation to Tablet Servers WAL
    +   * Sync or flush WAL
    +   * Adds mutations to sorted in memory maps
    +   * Reports success back to client.
    +
    +The sync/flush step above moves data written to the WAL from memory to disk.
    +Write ahead logs are stored in HDFS. HDFS supports two ways of forcing data to
    +disk for an open file : `hsync` and `hflush`.  
    +
    +## HDFS Sync/Flush Details
    +
    +When `hflush` is called on a WAL, it does not ensure data is on disk.  It only
    +ensure that data is in OS buffers on each datanode and on its way to disk.  As a
    +result calls to `hflush` are very fast.  If a WAL is replicated to 3 data nodes
    +then data may be lost if all three machines reboot.  If the datanode process
    +dies, thats ok because it flushed to OS.  The machines have to reboot for data
    --- End diff --
    
    s/thats ok because it flushed to OS/data loss will not happen because the data was still sitting in OS I/O buffers to be written to disk/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/176#discussion_r85742295
  
    --- Diff: _posts/blog/2016-10-28-durability-performance.md ---
    @@ -0,0 +1,136 @@
    +---
    +title: "Durability Performance Implications"
    +date: 2016-10-28 17:00:00 +0000
    +author: Keith Turner
    +---
    +
    +## Overview
    +
    +Accumulo stores recently written data in a sorted in memory map.  Before data is
    +added to this map, it's written to an unsorted WAL (write ahead log).  In the
    --- End diff --
    
    When I wrote that I thought it looked screwy, now I know why.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #176: Wrote blog post about durability and performance

Posted by keith-turner <gi...@git.apache.org>.
Github user keith-turner commented on the issue:

    https://github.com/apache/accumulo/pull/176
  
    > May want to point out https://issues.apache.org/jira/browse/HDFS-3702 as an option starting with Hadoop 2.8, so that if the node goes down and can't be brought back up, you can still recover using the WAL.
    
    @dlmarion what you are suggesting seems like it would require a lot of context to explain.  I don't know how to write that context up.   I would be happy to add anything you write up or you could write up a following blog post and I can review it.   I am thinking we can build up a body of useful info and ideas w/ blog post, and blog post can reference other post to avoid repeating contextual info.
    
    For the improvements sections I was also thinking of mentioning that another possible solution to this would be to use the multiple tservers per node you added AND a custom balancer that assigns tablets based on durability settings.  However I decided not add this information because I though it would take too much to explain well.  This could also be another blog post. I was not really sure what to do this idea... I couldn't even decided if it was a good a idea or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---