You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Chris Riccomini <cr...@apache.org> on 2014/04/29 00:52:38 UTC

Review Request 20811: SAMZA-254

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/
-----------------------------------------------------------

Review request for samza.


Repository: samza


Description
-------

add javadocs, and reset deletion counter in compact


make delete threshold configurable. add a performance test (takes 25s to run).


make compaction lazy on read-side so we can take advantage of cached writes


trigger compactions periodically to remove deleted keys from levels


Diffs
-----

  samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
  samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
  samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 

Diff: https://reviews.apache.org/r/20811/diff/


Testing
-------


Thanks,

Chris Riccomini


Re: Review Request 20811: SAMZA-254

Posted by Jakob Homan <jg...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41779
-----------------------------------------------------------



samza-test/src/main/resources/perf/kv-perf.properties
<https://reviews.apache.org/r/20811/#comment75372>

    ASF Header



samza-test/src/main/scala/org/apache/samza/test/performance/TestKeyValuePerformance.scala
<https://reviews.apache.org/r/20811/#comment75371>

    ASF Header.


- Jakob Homan


On April 29, 2014, 2:52 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 29, 2014, 2:52 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> updating perf test to use delete threshold setting
> 
> 
> disable this feature by default
> 
> 
> default the delete compaction threshold to cache size
> 
> 
> move performance test into samza-test
> 
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   README.md a930cf020c46a616f4c7319881dbff3043b0fd49 
>   build.gradle 65d95d4c6eb220b617dab6ea5772c9062ab146bd 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
>   samza-test/src/main/resources/perf/kv-perf.properties PRE-CREATION 
>   samza-test/src/main/scala/org/apache/samza/test/performance/TestKeyValuePerformance.scala PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>


Re: Review Request 20811: SAMZA-254

Posted by Martin Kleppmann <mk...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41822
-----------------------------------------------------------

Ship it!


Looks good!


README.md
<https://reviews.apache.org/r/20811/#comment75451>

    Suggestion: the Gradle task could default to this path, so that it's easier to run. (It can still accept an optional -PconfigPath parameter to override the default in case you want to run something other than the default performance test.)


- Martin Kleppmann


On April 29, 2014, 10:37 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 29, 2014, 10:37 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> add license header to new files
> 
> 
> updating perf test to use delete threshold setting
> 
> 
> disable this feature by default
> 
> 
> default the delete compaction threshold to cache size
> 
> 
> move performance test into samza-test
> 
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   README.md a930cf020c46a616f4c7319881dbff3043b0fd49 
>   build.gradle 65d95d4c6eb220b617dab6ea5772c9062ab146bd 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
>   samza-test/src/main/resources/perf/kv-perf.properties PRE-CREATION 
>   samza-test/src/main/scala/org/apache/samza/test/performance/TestKeyValuePerformance.scala PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>


Re: Review Request 20811: SAMZA-254

Posted by Chris Riccomini <cr...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/
-----------------------------------------------------------

(Updated April 29, 2014, 10:37 p.m.)


Review request for samza.


Repository: samza


Description (updated)
-------

add license header to new files


updating perf test to use delete threshold setting


disable this feature by default


default the delete compaction threshold to cache size


move performance test into samza-test


add javadocs, and reset deletion counter in compact


make delete threshold configurable. add a performance test (takes 25s to run).


make compaction lazy on read-side so we can take advantage of cached writes


trigger compactions periodically to remove deleted keys from levels


Diffs (updated)
-----

  README.md a930cf020c46a616f4c7319881dbff3043b0fd49 
  build.gradle 65d95d4c6eb220b617dab6ea5772c9062ab146bd 
  samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
  samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
  samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
  samza-test/src/main/resources/perf/kv-perf.properties PRE-CREATION 
  samza-test/src/main/scala/org/apache/samza/test/performance/TestKeyValuePerformance.scala PRE-CREATION 

Diff: https://reviews.apache.org/r/20811/diff/


Testing
-------


Thanks,

Chris Riccomini


Re: Review Request 20811: SAMZA-254

Posted by Chris Riccomini <cr...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/
-----------------------------------------------------------

(Updated April 29, 2014, 9:52 p.m.)


Review request for samza.


Repository: samza


Description (updated)
-------

updating perf test to use delete threshold setting


disable this feature by default


default the delete compaction threshold to cache size


move performance test into samza-test


add javadocs, and reset deletion counter in compact


make delete threshold configurable. add a performance test (takes 25s to run).


make compaction lazy on read-side so we can take advantage of cached writes


trigger compactions periodically to remove deleted keys from levels


Diffs (updated)
-----

  README.md a930cf020c46a616f4c7319881dbff3043b0fd49 
  build.gradle 65d95d4c6eb220b617dab6ea5772c9062ab146bd 
  samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
  samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
  samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
  samza-test/src/main/resources/perf/kv-perf.properties PRE-CREATION 
  samza-test/src/main/scala/org/apache/samza/test/performance/TestKeyValuePerformance.scala PRE-CREATION 

Diff: https://reviews.apache.org/r/20811/diff/


Testing
-------


Thanks,

Chris Riccomini


Re: Review Request 20811: SAMZA-254

Posted by Chris Riccomini <cr...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/
-----------------------------------------------------------

(Updated April 29, 2014, 9:17 p.m.)


Review request for samza.


Repository: samza


Description (updated)
-------

default the delete compaction threshold to cache size


move performance test into samza-test


add javadocs, and reset deletion counter in compact


make delete threshold configurable. add a performance test (takes 25s to run).


make compaction lazy on read-side so we can take advantage of cached writes


trigger compactions periodically to remove deleted keys from levels


Diffs (updated)
-----

  README.md a930cf020c46a616f4c7319881dbff3043b0fd49 
  build.gradle 65d95d4c6eb220b617dab6ea5772c9062ab146bd 
  samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
  samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
  samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
  samza-test/src/main/resources/perf/kv-perf.properties PRE-CREATION 
  samza-test/src/main/scala/org/apache/samza/test/performance/TestKeyValuePerformance.scala PRE-CREATION 

Diff: https://reviews.apache.org/r/20811/diff/


Testing
-------


Thanks,

Chris Riccomini


Re: Review Request 20811: SAMZA-254

Posted by Chris Riccomini <cr...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41732
-----------------------------------------------------------



samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala
<https://reviews.apache.org/r/20811/#comment75320>

    Yea, was thinking about this this morning, actually. I think it should go in samza-test/src/main, and have a little gradle task to run it (similar to with Martin did with SAMZA-180).


- Chris Riccomini


On April 28, 2014, 10:52 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 28, 2014, 10:52 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>


Re: Review Request 20811: SAMZA-254

Posted by Martin Kleppmann <mk...@linkedin.com>.

> On April 29, 2014, 5:52 p.m., Jakob Homan wrote:
> > samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala, line 289
> > <https://reviews.apache.org/r/20811/diff/1/?file=569851#file569851line289>
> >
> >     This really isn't a test, as it's not possible for it to fail.  It's more of a benchmark.  Perhaps move it to a separate main-ed class that can be run from the command line for manual checking? Either that or add some you-shall-not-pass value that will fail the test if performance dips too low, but that's fraught with per-machine variation.

+1. I would love there to be a separate performance test suite which we can run regularly, but it needs to be separate from the unit/functional/integration test suite.

Lucene does this very well: http://people.apache.org/~mikemccand/lucenebench/ -- they have created a standardised set of performance tests, have run them every day for over 3 years, and graphed the results. It's awesome to see performance work quantified like that. You make what you measure :)


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41731
-----------------------------------------------------------


On April 28, 2014, 10:52 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 28, 2014, 10:52 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>


Re: Review Request 20811: SAMZA-254

Posted by Jakob Homan <jg...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41731
-----------------------------------------------------------



samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala
<https://reviews.apache.org/r/20811/#comment75318>

    This really isn't a test, as it's not possible for it to fail.  It's more of a benchmark.  Perhaps move it to a separate main-ed class that can be run from the command line for manual checking? Either that or add some you-shall-not-pass value that will fail the test if performance dips too low, but that's fraught with per-machine variation.


- Jakob Homan


On April 28, 2014, 3:52 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 28, 2014, 3:52 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>


Re: Review Request 20811: SAMZA-254

Posted by Martin Kleppmann <mk...@linkedin.com>.

> On April 29, 2014, 7:32 p.m., Martin Kleppmann wrote:
> > samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala, line 47
> > <https://reviews.apache.org/r/20811/diff/1/?file=569849#file569849line47>
> >
> >     Gut feeling (not backed by any data) is that a threshold of 1000 might be quite low for a default. If there is a lot of data in the store, the compaction itself may start taking a long time.
> >     
> >     Rather than an absolute number, how about setting the threshold in terms of the proportion of keys in the keys? e.g. perform a compaction if more than (say) 20% of the keys in the store have been deleted? That way, if the store is big (=compaction is expensive), the threshold is automatically higher.
> >     
> >     (For purposes of that calculation I think it would be fine to simply count number of put requests and number of delete requests -- no need to track unique keys.)

s/keys in the keys/keys in the store/


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41764
-----------------------------------------------------------


On April 28, 2014, 10:52 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 28, 2014, 10:52 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>


Re: Review Request 20811: SAMZA-254

Posted by Martin Kleppmann <mk...@linkedin.com>.

> On April 29, 2014, 7:32 p.m., Martin Kleppmann wrote:
> > samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala, line 47
> > <https://reviews.apache.org/r/20811/diff/1/?file=569849#file569849line47>
> >
> >     Gut feeling (not backed by any data) is that a threshold of 1000 might be quite low for a default. If there is a lot of data in the store, the compaction itself may start taking a long time.
> >     
> >     Rather than an absolute number, how about setting the threshold in terms of the proportion of keys in the keys? e.g. perform a compaction if more than (say) 20% of the keys in the store have been deleted? That way, if the store is big (=compaction is expensive), the threshold is automatically higher.
> >     
> >     (For purposes of that calculation I think it would be fine to simply count number of put requests and number of delete requests -- no need to track unique keys.)
> 
> Martin Kleppmann wrote:
>     s/keys in the keys/keys in the store/
> 
> Chris Riccomini wrote:
>     I can think of two ways in which this approach breaks:
>     
>     1. People update the same key multiple times, and the put counter is improperly incremented.
>     2. People delete a key that doesn't exist, and the delete counter is improperly incremented.
>     
>     (2) doesn't bother me much, since it's kind of erroneous, but (1) concerns me. Using the counting ratio, as you've described, would not work well in cases where people are over-writing existing keys. This is actually a pretty common use case, I think. In such a case, we would over-count puts, and would therefore drop our percentage, since the puts are in the denominator (deletes / puts = %). Assuming the over-counting is pretty predictable, developers could just drop their threshold from 20% to 10% (for example).
>     
>     Another thing to consider is how easy this config is to use and describe. I think it's a bit more confusing to have a % based threshold, especially if it's not accurate, and leads to weird issues like (1) and (2), above.
>     
>     Regarding 1000 being too low, in cases where the cache is used (and I've never seen it not be used), the threshold is somewhat misleading, since the cache will cache deletes, which prevents the delete counter from incrementing in the LevelDB store. This means you can have over 1000 deletes, but still not trigger a compaction. For that reason, I'm going to raise the threshold to be equal to the cache size by default. This will ensure that if you execute >= deleteThreshold deletes, then call get(), your deletes will be flushed to the LevelDB store.

Ok, makes sense. Also, it's good that the compaction doesn't happen on the delete itself, but on the subsequent get. So if you have a million keys in the store, and delete half of them in one go, it will still only get compacted once, no matter what the threshold is set to. So that sounds fine to me.


- Martin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41764
-----------------------------------------------------------


On April 29, 2014, 10:37 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 29, 2014, 10:37 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> add license header to new files
> 
> 
> updating perf test to use delete threshold setting
> 
> 
> disable this feature by default
> 
> 
> default the delete compaction threshold to cache size
> 
> 
> move performance test into samza-test
> 
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   README.md a930cf020c46a616f4c7319881dbff3043b0fd49 
>   build.gradle 65d95d4c6eb220b617dab6ea5772c9062ab146bd 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
>   samza-test/src/main/resources/perf/kv-perf.properties PRE-CREATION 
>   samza-test/src/main/scala/org/apache/samza/test/performance/TestKeyValuePerformance.scala PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>


Re: Review Request 20811: SAMZA-254

Posted by Chris Riccomini <cr...@apache.org>.

> On April 29, 2014, 7:32 p.m., Martin Kleppmann wrote:
> > samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala, line 47
> > <https://reviews.apache.org/r/20811/diff/1/?file=569849#file569849line47>
> >
> >     Gut feeling (not backed by any data) is that a threshold of 1000 might be quite low for a default. If there is a lot of data in the store, the compaction itself may start taking a long time.
> >     
> >     Rather than an absolute number, how about setting the threshold in terms of the proportion of keys in the keys? e.g. perform a compaction if more than (say) 20% of the keys in the store have been deleted? That way, if the store is big (=compaction is expensive), the threshold is automatically higher.
> >     
> >     (For purposes of that calculation I think it would be fine to simply count number of put requests and number of delete requests -- no need to track unique keys.)
> 
> Martin Kleppmann wrote:
>     s/keys in the keys/keys in the store/

I can think of two ways in which this approach breaks:

1. People update the same key multiple times, and the put counter is improperly incremented.
2. People delete a key that doesn't exist, and the delete counter is improperly incremented.

(2) doesn't bother me much, since it's kind of erroneous, but (1) concerns me. Using the counting ratio, as you've described, would not work well in cases where people are over-writing existing keys. This is actually a pretty common use case, I think. In such a case, we would over-count puts, and would therefore drop our percentage, since the puts are in the denominator (deletes / puts = %). Assuming the over-counting is pretty predictable, developers could just drop their threshold from 20% to 10% (for example).

Another thing to consider is how easy this config is to use and describe. I think it's a bit more confusing to have a % based threshold, especially if it's not accurate, and leads to weird issues like (1) and (2), above.

Regarding 1000 being too low, in cases where the cache is used (and I've never seen it not be used), the threshold is somewhat misleading, since the cache will cache deletes, which prevents the delete counter from incrementing in the LevelDB store. This means you can have over 1000 deletes, but still not trigger a compaction. For that reason, I'm going to raise the threshold to be equal to the cache size by default. This will ensure that if you execute >= deleteThreshold deletes, then call get(), your deletes will be flushed to the LevelDB store.


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41764
-----------------------------------------------------------


On April 28, 2014, 10:52 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 28, 2014, 10:52 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>


Re: Review Request 20811: SAMZA-254

Posted by Martin Kleppmann <mk...@linkedin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41764
-----------------------------------------------------------



samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala
<https://reviews.apache.org/r/20811/#comment75344>

    Gut feeling (not backed by any data) is that a threshold of 1000 might be quite low for a default. If there is a lot of data in the store, the compaction itself may start taking a long time.
    
    Rather than an absolute number, how about setting the threshold in terms of the proportion of keys in the keys? e.g. perform a compaction if more than (say) 20% of the keys in the store have been deleted? That way, if the store is big (=compaction is expensive), the threshold is automatically higher.
    
    (For purposes of that calculation I think it would be fine to simply count number of put requests and number of delete requests -- no need to track unique keys.)


- Martin Kleppmann


On April 28, 2014, 10:52 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 28, 2014, 10:52 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala 81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala 8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>