You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Kevin Sweeney <ke...@apache.org> on 2014/10/09 04:01:05 UTC

Review Request 26478: Add a flag to deduplicate storage snapshots

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.


Bugs: AURORA-722
    https://issues.apache.org/jira/browse/AURORA-722


Repository: aurora


Description
-------

Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.

This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).

This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.


Diffs
-----

  build.gradle 2e1bf78d7797f17afd51a94a22eff80e00aba464 
  src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
  src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
  src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerFactory.java 52b2e83efc0c289366b5246aff32553a546b1a70 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
  src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
  src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/26478/diff/


Testing
-------

./gradlew -Pq build


Thanks,

Kevin Sweeney


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Bill Farner <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review56562
-----------------------------------------------------------


Ping?  Any progress here?

- Bill Farner


On Oct. 9, 2014, 2:39 a.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 2:39 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   build.gradle 2e1bf78d7797f17afd51a94a22eff80e00aba464 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.

> On Oct. 9, 2014, 10:30 a.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 133
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line133>
> >
> >     There is a remote possibility of numOutputTasks to be zero in a snapshot for an empty cluster.

hmm - there is, and that would result in NaN appearing in the log. Which is technically accurate


- Kevin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review56010
-----------------------------------------------------------


On Oct. 8, 2014, 7:39 p.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 8, 2014, 7:39 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   build.gradle 2e1bf78d7797f17afd51a94a22eff80e00aba464 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.

> On Oct. 9, 2014, 10:30 a.m., Maxim Khutornenko wrote:
> > src/main/thrift/org/apache/aurora/gen/storage.thrift, line 205
> > <https://reviews.apache.org/r/26478/diff/2/?file=716383#file716383line205>
> >
> >     Please, document fields. What is taskConfigId here?

Documented all fields.


> On Oct. 9, 2014, 10:30 a.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 35
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line35>
> >
> >     This javadoc would highly benefit from some details about the source of duplication and a proposed solution. It's not obvious for a newcomer why TaskConfigs are duplicated.

Added documentation elsewhere, happy to add more here if you think it's needed.


> On Oct. 9, 2014, 10:30 a.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 126
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line126>
> >
> >     Why result field here?

refactored this code to be "less performant, more readable"


> On Oct. 9, 2014, 10:30 a.m., Maxim Khutornenko wrote:
> > src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java, line 71
> > <https://reviews.apache.org/r/26478/diff/2/?file=716386#file716386line71>
> >
> >     How about a roundtrip test with no tasks in a snapshot?

Good idea, added null-checking.


> On Oct. 9, 2014, 10:30 a.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 155
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line155>
> >
> >     Inverse log message of a hydration ratio would be useful here along with a "Starting redupulication".

Added.


- Kevin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review56010
-----------------------------------------------------------


On Oct. 15, 2014, 12:17 p.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 12:17 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
>   docs/scheduler-storage.md PRE-CREATION 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Maxim Khutornenko <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review56010
-----------------------------------------------------------



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment96362>

    This javadoc would highly benefit from some details about the source of duplication and a proposed solution. It's not obvious for a newcomer why TaskConfigs are duplicated.



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment96365>

    Why result field here?



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment96360>

    There is a remote possibility of numOutputTasks to be zero in a snapshot for an empty cluster.



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment96363>

    Inverse log message of a hydration ratio would be useful here along with a "Starting redupulication".



src/main/thrift/org/apache/aurora/gen/storage.thrift
<https://reviews.apache.org/r/26478/#comment96364>

    Please, document fields. What is taskConfigId here?



src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java
<https://reviews.apache.org/r/26478/#comment96368>

    How about a roundtrip test with no tasks in a snapshot?


- Maxim Khutornenko


On Oct. 9, 2014, 2:39 a.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 2:39 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   build.gradle 2e1bf78d7797f17afd51a94a22eff80e00aba464 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Zameer Manji <zm...@twopensource.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review55943
-----------------------------------------------------------

Ship it!


Ship It!

- Zameer Manji


On Oct. 8, 2014, 7:39 p.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 8, 2014, 7:39 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   build.gradle 2e1bf78d7797f17afd51a94a22eff80e00aba464 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Maxim Khutornenko <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review56795
-----------------------------------------------------------

Ship it!


Ship It!

- Maxim Khutornenko


On Oct. 15, 2014, 7:52 p.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 7:52 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
>   docs/scheduler-storage.md PRE-CREATION 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/
-----------------------------------------------------------

(Updated Oct. 15, 2014, 5:05 p.m.)


Review request for Aurora, Maxim Khutornenko, Bill Farner, and Zameer Manji.


Changes
-------

Bill's feedback + fix markdown syntax


Bugs: AURORA-722
    https://issues.apache.org/jira/browse/AURORA-722


Repository: aurora


Description
-------

Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.

This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).

This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.


Diffs (updated)
-----

  config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
  docs/deploying-aurora-scheduler.md 20f5f389f55e800cc5e6638e62e25cfb6e2d72b4 
  docs/scheduler-storage.md PRE-CREATION 
  src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
  src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
  src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
  src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
  src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/26478/diff/


Testing
-------

./gradlew -Pq build


Thanks,

Kevin Sweeney


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Bill Farner <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review56835
-----------------------------------------------------------

Ship it!


Awesome change, thanks a ton.


docs/scheduler-storage.md
<https://reviews.apache.org/r/26478/#comment97270>

    Can you link to this from docs/deploying-aurora-scheduler.md?  Probably just a section heading and a link is sufficient.  I also filed AURORA-839 to fill this new page out with more info.


- Bill Farner


On Oct. 15, 2014, 9:32 p.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 9:32 p.m.)
> 
> 
> Review request for Aurora, Maxim Khutornenko, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
>   docs/scheduler-storage.md PRE-CREATION 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/
-----------------------------------------------------------

(Updated Oct. 15, 2014, 2:32 p.m.)


Review request for Aurora, Maxim Khutornenko, Bill Farner, and Zameer Manji.


Changes
-------

-David, who is on vacation


Bugs: AURORA-722
    https://issues.apache.org/jira/browse/AURORA-722


Repository: aurora


Description
-------

Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.

This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).

This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.


Diffs
-----

  config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
  docs/scheduler-storage.md PRE-CREATION 
  src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
  src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
  src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
  src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
  src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/26478/diff/


Testing
-------

./gradlew -Pq build


Thanks,

Kevin Sweeney


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/
-----------------------------------------------------------

(Updated Oct. 15, 2014, 12:52 p.m.)


Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.


Changes
-------

Maxim's feedback


Bugs: AURORA-722
    https://issues.apache.org/jira/browse/AURORA-722


Repository: aurora


Description
-------

Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.

This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).

This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.


Diffs (updated)
-----

  config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
  docs/scheduler-storage.md PRE-CREATION 
  src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
  src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
  src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
  src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
  src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/26478/diff/


Testing
-------

./gradlew -Pq build


Thanks,

Kevin Sweeney


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/
-----------------------------------------------------------

(Updated Oct. 15, 2014, 12:17 p.m.)


Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.


Changes
-------

Address Bill's feedback.


Bugs: AURORA-722
    https://issues.apache.org/jira/browse/AURORA-722


Repository: aurora


Description
-------

Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.

This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).

This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.


Diffs (updated)
-----

  config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
  docs/scheduler-storage.md PRE-CREATION 
  src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
  src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
  src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
  src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
  src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/26478/diff/


Testing
-------

./gradlew -Pq build


Thanks,

Kevin Sweeney


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.

> On Oct. 15, 2014, 11:54 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 56
> > <https://reviews.apache.org/r/26478/diff/3/?file=721240#file721240line56>
> >
> >     'reduplicate' doesn't sit well with me.  Perhaps 'normalize' and 'denormalize' are more standard terms that apply?  I don't feel too strongly, so don't change it if they seem equally good to you.

I am decidedly ambivalent about the name


> On Oct. 15, 2014, 11:54 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 87
> > <https://reviews.apache.org/r/26478/diff/3/?file=721240#file721240line87>
> >
> >     This line is not covered in tests.  Please address.
> >     
> >     However, i suggest you implement this as below, and inline.
> >     
> >         ScheduledTask partialScheduledTask = scheduledTask.deepCopy();
> >         partialScheduledTask.getAssignedTask().unsetTaskConfig();
> >         return partialScheduledTask;

Inlined.


> On Oct. 15, 2014, 11:54 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 110
> > <https://reviews.apache.org/r/26478/diff/2-3/?file=716380#file716380line110>
> >
> >     Please use a better variable name.

Inlined, so no variable to name.


> On Oct. 15, 2014, 11:54 a.m., Bill Farner wrote:
> > docs/scheduler-storage.md, line 13
> > <https://reviews.apache.org/r/26478/diff/3/?file=721234#file721234line13>
> >
> >     > Most users will want to enable both compression and deduplication.
> >     
> >     I suggest you yank this sentence out of this section, and add to the opening paragraph:
> >     
> >     > The scheduler has two optimizations to reduce the size of snapshots and thus improve snapshot performance: compression and deduplication.  Most users will want to enable both compression and deduplication.

good idea, added


- Kevin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review56762
-----------------------------------------------------------


On Oct. 14, 2014, 6:32 p.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 6:32 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
>   docs/scheduler-storage.md PRE-CREATION 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Bill Farner <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review56762
-----------------------------------------------------------


Thanks for adding the documentation!  I'm a +1 once the copying code is less performant but more readable/concise.


src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment97167>

    Please use a better variable name.



docs/scheduler-storage.md
<https://reviews.apache.org/r/26478/#comment97161>

    > Most users will want to enable both compression and deduplication.
    
    I suggest you yank this sentence out of this section, and add to the opening paragraph:
    
    > The scheduler has two optimizations to reduce the size of snapshots and thus improve snapshot performance: compression and deduplication.  Most users will want to enable both compression and deduplication.



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment97165>

    'reduplicate' doesn't sit well with me.  Perhaps 'normalize' and 'denormalize' are more standard terms that apply?  I don't feel too strongly, so don't change it if they seem equally good to you.



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment97172>

    This line is not covered in tests.  Please address.
    
    However, i suggest you implement this as below, and inline.
    
        ScheduledTask partialScheduledTask = scheduledTask.deepCopy();
        partialScheduledTask.getAssignedTask().unsetTaskConfig();
        return partialScheduledTask;


- Bill Farner


On Oct. 15, 2014, 1:32 a.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 1:32 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
>   docs/scheduler-storage.md PRE-CREATION 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/
-----------------------------------------------------------

(Updated Oct. 14, 2014, 6:32 p.m.)


Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.


Changes
-------

Documentation


Bugs: AURORA-722
    https://issues.apache.org/jira/browse/AURORA-722


Repository: aurora


Description
-------

Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.

This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).

This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.


Diffs (updated)
-----

  config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
  docs/scheduler-storage.md PRE-CREATION 
  src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
  src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
  src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
  src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
  src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/26478/diff/


Testing
-------

./gradlew -Pq build


Thanks,

Kevin Sweeney


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Bill Farner <wf...@apache.org>.

> On Oct. 9, 2014, 3:30 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 71
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line71>
> >
> >     Do you have numbers on how much time this routine saves when compared to a full deep copy and unsetting the field you're trying to clear?  Unless it's a significant contributor to overall snapshot performance, i'd prefer the more concise code of the latter approach.
> >     
> >     My hunch is that this one might save O(100 ms), but the ones below are noise and not worth the code.
> 
> Kevin Sweeney wrote:
>     I don't have data for this specific optimization - my gut is that we should avoid deepCopy on Snapshots due to them respresenting essentially the entire scheduler heap. Happy to remove if you think it's not warranted.

Yeah, i vote for remove to favor simpler debugging.


- Bill


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review55986
-----------------------------------------------------------


On Oct. 15, 2014, 1:32 a.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 1:32 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
>   docs/scheduler-storage.md PRE-CREATION 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.

> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java, line 86
> > <https://reviews.apache.org/r/26478/diff/2/?file=716376#file716376line86>
> >
> >     Method interceptors should work fine for package-priviate methods [1].  Just for my understanding - are you going with protected because package-private class + protected method is slightly more restrictive?
> >     
> >     [1] https://github.com/google/guice/wiki/AOP#limitations

I'm going with protected because it's more indicative of why it's visible at all (I want guice's dynamic proxy subclass to override it and add timing information). It's also more restrictive than package-private in this context.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 61
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line61>
> >
> >     Tasks.SCHEDULED_TO_INFO

This operates on the mutable structures, SCHEDULED_TO_INFO operates on the immutable ones.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 71
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line71>
> >
> >     Do you have numbers on how much time this routine saves when compared to a full deep copy and unsetting the field you're trying to clear?  Unless it's a significant contributor to overall snapshot performance, i'd prefer the more concise code of the latter approach.
> >     
> >     My hunch is that this one might save O(100 ms), but the ones below are noise and not worth the code.

I don't have data for this specific optimization - my gut is that we should avoid deepCopy on Snapshots due to them respresenting essentially the entire scheduler heap. Happy to remove if you think it's not warranted.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java, line 84
> > <https://reviews.apache.org/r/26478/diff/2/?file=716386#file716386line84>
> >
> >     The assertion error will be much more useful if you do `assertEquals(emptySet, b)` rather than `assertEquals(n, b.size())`.

It's actually `null` here, but `.size()` in thrift reports `0` for `null`. Changed to explicit `assertNull`.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java, line 89
> > <https://reviews.apache.org/r/26478/diff/2/?file=716386#file716386line89>
> >
> >     Isn't this check redundant to the one below?

not quite, since the latter is making a Set out of a List, and therefore potentially doing deduplication for me.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java, line 98
> > <https://reviews.apache.org/r/26478/diff/2/?file=716386#file716386line98>
> >
> >     Can you try for `assertEquals` here as well, with an expected list of `DeduplicatedScheduledTask` objects?  It will catch classes of bugs missed with this check (e.g. extra entries in the `partialTasks` list), and make it easier to diagnose a failed test.

Since the input is a Snapshot (Set<ScheduledTask>) it's not possible to know the expected order of the output list ahead-of-time. Added an explicit check against extra entries.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java, line 80
> > <https://reviews.apache.org/r/26478/diff/2/?file=716379#file716379line80>
> >
> >     First sentence contains mixed thoughts for an odd sentence, rewrite?

Rewrote.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java, line 81
> > <https://reviews.apache.org/r/26478/diff/2/?file=716379#file716379line81>
> >
> >     Please make this more verbose to explain what exactly is backwards in compatible.  Don't worry about being too wordy.
> >     
> >     I'd also like to see a TODO to remove this flag, since it should become the default.  Perhaps this should be included in the 0.7.0 deprecations list.

Moved to separate doc.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 110
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line110>
> >
> >     Inline this on :114.

done


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 116
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line116>
> >
> >     It would be nice to see a brief comment here to give the reader an overview of what's going on in the routine.

Added.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java, line 157
> > <https://reviews.apache.org/r/26478/diff/2/?file=716380#file716380line157>
> >
> >     remove newline

removed


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/thrift/org/apache/aurora/gen/storage.thrift, line 202
> > <https://reviews.apache.org/r/26478/diff/2/?file=716383#file716383line202>
> >
> >     The mention of taskConfigId is vague.  There should be a doc somewhere indicating that `taskConfigId` is referencing the _index_ of an entry in `taskConfigs`.

Made more explicit


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/thrift/org/apache/aurora/gen/storage.thrift, line 215
> > <https://reviews.apache.org/r/26478/diff/2/?file=716383#file716383line215>
> >
> >     remove newline

Fixed.


> On Oct. 9, 2014, 8:30 a.m., Bill Farner wrote:
> > src/main/thrift/org/apache/aurora/gen/storage.thrift, line 237
> > <https://reviews.apache.org/r/26478/diff/2/?file=716383#file716383line237>
> >
> >     Please doc.

Done


- Kevin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review55986
-----------------------------------------------------------


On Oct. 14, 2014, 6:32 p.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 6:32 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   config/legacy_untested_classes.txt 3af99867eb25a7e44bb3520e82b1def125bd6e15 
>   docs/scheduler-storage.md PRE-CREATION 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Bill Farner <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/#review55986
-----------------------------------------------------------



src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java
<https://reviews.apache.org/r/26478/#comment96339>

    Method interceptors should work fine for package-priviate methods [1].  Just for my understanding - are you going with protected because package-private class + protected method is slightly more restrictive?
    
    [1] https://github.com/google/guice/wiki/AOP#limitations



src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java
<https://reviews.apache.org/r/26478/#comment96331>

    First sentence contains mixed thoughts for an odd sentence, rewrite?



src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java
<https://reviews.apache.org/r/26478/#comment96332>

    Please make this more verbose to explain what exactly is backwards in compatible.  Don't worry about being too wordy.
    
    I'd also like to see a TODO to remove this flag, since it should become the default.  Perhaps this should be included in the 0.7.0 deprecations list.



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment96334>

    Tasks.SCHEDULED_TO_INFO



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment96335>

    Do you have numbers on how much time this routine saves when compared to a full deep copy and unsetting the field you're trying to clear?  Unless it's a significant contributor to overall snapshot performance, i'd prefer the more concise code of the latter approach.
    
    My hunch is that this one might save O(100 ms), but the ones below are noise and not worth the code.



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment96336>

    Inline this on :114.



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment96337>

    It would be nice to see a brief comment here to give the reader an overview of what's going on in the routine.



src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java
<https://reviews.apache.org/r/26478/#comment96338>

    remove newline



src/main/thrift/org/apache/aurora/gen/storage.thrift
<https://reviews.apache.org/r/26478/#comment96340>

    The mention of taskConfigId is vague.  There should be a doc somewhere indicating that `taskConfigId` is referencing the _index_ of an entry in `taskConfigs`.



src/main/thrift/org/apache/aurora/gen/storage.thrift
<https://reviews.apache.org/r/26478/#comment96341>

    remove newline



src/main/thrift/org/apache/aurora/gen/storage.thrift
<https://reviews.apache.org/r/26478/#comment96342>

    Please doc.



src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java
<https://reviews.apache.org/r/26478/#comment96343>

    The assertion error will be much more useful if you do `assertEquals(emptySet, b)` rather than `assertEquals(n, b.size())`.



src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java
<https://reviews.apache.org/r/26478/#comment96344>

    Isn't this check redundant to the one below?



src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java
<https://reviews.apache.org/r/26478/#comment96345>

    Can you try for `assertEquals` here as well, with an expected list of `DeduplicatedScheduledTask` objects?  It will catch classes of bugs missed with this check (e.g. extra entries in the `partialTasks` list), and make it easier to diagnose a failed test.


- Bill Farner


On Oct. 9, 2014, 2:39 a.m., Kevin Sweeney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26478/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 2:39 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-722
>     https://issues.apache.org/jira/browse/AURORA-722
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.
> 
> This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).
> 
> This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.
> 
> 
> Diffs
> -----
> 
>   build.gradle 2e1bf78d7797f17afd51a94a22eff80e00aba464 
>   src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
>   src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
>   src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
>   src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
>   src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
>   src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
>   src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26478/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Kevin Sweeney
> 
>


Re: Review Request 26478: Add a flag to deduplicate storage snapshots

Posted by Kevin Sweeney <ke...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26478/
-----------------------------------------------------------

(Updated Oct. 8, 2014, 7:39 p.m.)


Review request for Aurora, David McLaughlin, Bill Farner, and Zameer Manji.


Changes
-------

Reduce visibility of some classes


Bugs: AURORA-722
    https://issues.apache.org/jira/browse/AURORA-722


Repository: aurora


Description
-------

Add a new format for deduplicated storage snapshots. Microbenchmarks show a 10x deduplication ratio on Twitter's production snapshots.

This format is backwards-incompatible, so this patch introduces a flag to control its use (defaulting off).

This only changes the format used to write to the replicated log (where time is of the essence since all writes are done holding the global storage lock) - the format of backups written to disk is unchanged, as backups don't hold the lock.


Diffs (updated)
-----

  build.gradle 2e1bf78d7797f17afd51a94a22eff80e00aba464 
  src/main/java/org/apache/aurora/codec/ThriftBinaryCodec.java 65e986eaa2c4193431ca048425a1ed3ab60f5882 
  src/main/java/org/apache/aurora/scheduler/storage/log/EntrySerializer.java 7239a6a5eb5479e395e16423c83fdf80a77e5a83 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogManager.java 4b50e2069407dc263b4fc93f1827d3a8836253bf 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java f806297d1d0700155c976743f936b2b8a3a390fb 
  src/main/java/org/apache/aurora/scheduler/storage/log/LogStorageModule.java 769348e6b8a5c701734afff391b1c77de35222c6 
  src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicator.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManager.java 22db80eaf34fe736fa5a3a9289836c9ac9e59906 
  src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java e5cfbf5cf43bf5bbc38c42fe685a7e9f0d03af2a 
  src/main/thrift/org/apache/aurora/gen/storage.thrift 5350ec945fbe028ee4641683815a068ce00b5efc 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogManagerTest.java 39729b374fe4e383f9b5ada7d016923766df9af7 
  src/test/java/org/apache/aurora/scheduler/storage/log/LogStorageTest.java 7a8c3b882633376a1bf6a78616d55aaa7401d13f 
  src/test/java/org/apache/aurora/scheduler/storage/log/SnapshotDeduplicatorImplTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/26478/diff/


Testing
-------

./gradlew -Pq build


Thanks,

Kevin Sweeney