You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Chris Lohfink (JIRA)" <ji...@apache.org> on 2014/05/16 13:18:57 UTC

[jira] [Comment Edited] (CASSANDRA-7247) Provide top ten most frequent keys per column family

    [ https://issues.apache.org/jira/browse/CASSANDRA-7247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999621#comment-13999621 ] 

Chris Lohfink edited comment on CASSANDRA-7247 at 5/16/14 5:53 AM:
-------------------------------------------------------------------

Problem is StreamSummary is not thread safe.  There is a ConcurrentStreamSummary, which I found in this implementation to be ~5x slower then a synchronized block around the offer of the non-thread safe one.  Concurrent did perform similarly when also wrapped in synchronized block which I will show below but because it would lose any benefit of being a concurrent implementation when access is serialized I think the faster impl is best.

Done on 2013 retina MBP with 500gb ssd against trunk:

{code:title=No Changes}
            id, ops       ,    op/s,   key/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr
 4 threadCount, 634450    ,   21692,   21692,     0.2,     0.2,     0.2,     0.2,     0.4,   740.1,   29.2,  0.01188
 8 threadCount, 886600    ,   29762,   29762,     0.3,     0.2,     0.3,     0.4,     1.3,  1007.3,   29.8,  0.01220
16 threadCount, 912050    ,   29035,   29035,     0.5,     0.3,     0.9,     2.5,    11.2,  1393.8,   31.4,  0.01162
24 threadCount, 1022250   ,   32681,   32681,     0.7,     0.5,     1.0,     2.9,    13.5,  1126.5,   31.3,  0.00923
36 threadCount, 946550    ,   30900,   30900,     1.2,     0.8,     1.4,     3.0,    22.5,  1369.2,   30.6,  0.01089
{code}

{code:title=With Patch}
            id, ops       ,    op/s,   key/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr
 4 threadCount, 643900    ,   21700,   21700,     0.2,     0.2,     0.2,     0.2,     0.9,   941.1,   29.7,  0.01079
 8 threadCount, 942100    ,   32300,   32300,     0.2,     0.2,     0.3,     0.3,     1.2,   849.5,   29.2,  0.01519
16 threadCount, 907400    ,   30650,   30650,     0.5,     0.3,     0.8,     1.9,    10.7,  1124.0,   29.6,  0.01112
24 threadCount, 1026150   ,   31753,   31753,     0.7,     0.5,     0.9,     3.3,    20.6,  1299.0,   32.3,  0.01295
36 threadCount, 980600    ,   30077,   30077,     1.2,     0.8,     1.3,     2.7,    24.9,  1394.3,   32.6,  0.01747
{code}

{code:title=ConcurrentStreamSummary with sync}
 4 threadCount, 494350    ,   16643,   16643,     0.2,     0.2,     0.3,     0.3,     1.0,   943.6,   29.7,  0.01286
 8 threadCount, 812950    ,   26358,   26358,     0.3,     0.2,     0.3,     0.5,     1.4,  1488.9,   30.8,  0.01909
16 threadCount, 877500    ,   27396,   27396,     0.6,     0.3,     1.0,     2.2,    12.1,  1299.2,   32.0,  0.01824
24 threadCount, 837550    ,   25345,   25345,     0.9,     0.4,     1.2,     3.7,    84.2,  2123.6,   33.0,  0.02437
36 threadCount, 910200    ,   28008,   28008,     1.3,     0.6,     2.8,     9.2,    32.2,  1212.8,   32.5,  0.01654
{code}


was (Author: cnlwsu):
Problem is StreamSummary is not thread safe.  There is a ConcurrentStreamSummary, which I found in this implementation to be ~5x slower then a synchronized block around the offer of the non-thread safe one.  Concurrent did perform similarly when also wrapped in synchronized block which I will show below but because it would lose any benefit of being a concurrent implementation when access is serialized I think the faster impl is best.

Done on 2013 retina MBP with 500gb ssd against trunk:

{code:title=No Changes}
            id, ops       ,    op/s,   key/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr
 4 threadCount, 634450    ,   21692,   21692,     0.2,     0.2,     0.2,     0.2,     0.4,   740.1,   29.2,  0.01188
 8 threadCount, 886600    ,   29762,   29762,     0.3,     0.2,     0.3,     0.4,     1.3,  1007.3,   29.8,  0.01220
16 threadCount, 912050    ,   29035,   29035,     0.5,     0.3,     0.9,     2.5,    11.2,  1393.8,   31.4,  0.01162
24 threadCount, 1022250   ,   32681,   32681,     0.7,     0.5,     1.0,     2.9,    13.5,  1126.5,   31.3,  0.00923
36 threadCount, 946550    ,   30900,   30900,     1.2,     0.8,     1.4,     3.0,    22.5,  1369.2,   30.6,  0.01089
{code}

{code:title=With Patch}
            id, ops       ,    op/s,   key/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr
 4 threadCount, 643900    ,   21700,   21700,     0.2,     0.2,     0.2,     0.2,     0.9,   941.1,   29.7,  0.01079
 8 threadCount, 942100    ,   32300,   32300,     0.2,     0.2,     0.3,     0.3,     1.2,   849.5,   29.2,  0.01519
16 threadCount, 907400    ,   30650,   30650,     0.5,     0.3,     0.8,     1.9,    10.7,  1124.0,   29.6,  0.01112
24 threadCount, 1026150   ,   31753,   31753,     0.7,     0.5,     0.9,     3.3,    20.6,  1299.0,   32.3,  0.01295
36 threadCount, 980600    ,   30077,   30077,     1.2,     0.8,     1.3,     2.7,    24.9,  1394.3,   32.6,  0.01747
{code}

> Provide top ten most frequent keys per column family
> ----------------------------------------------------
>
>                 Key: CASSANDRA-7247
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7247
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Chris Lohfink
>            Priority: Minor
>         Attachments: patch.diff
>
>
> Since already have the nice addthis stream library, can use it to keep track of most frequent DecoratedKeys that come through the system using StreamSummaries ([nice explaination|http://boundary.com/blog/2013/05/14/approximate-heavy-hitters-the-spacesaving-algorithm/]).  Then provide a new metric to access them via JMX.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)