You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2010/06/03 07:41:54 UTC

[jira] Created: (CASSANDRA-1155) keep persistent row statistics

keep persistent row statistics
------------------------------

                 Key: CASSANDRA-1155
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
             Project: Cassandra
          Issue Type: Sub-task
          Components: Core
            Reporter: Jonathan Ellis
             Fix For: 0.7


during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.

having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1155) keep persistent row statistics

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888431#action_12888431 ] 

Brandon Williams commented on CASSANDRA-1155:
---------------------------------------------

As discussed on irc, we'll persist these to the system keyspace so it can be accessed without JMX.

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1155) keep persistent row statistics

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-1155:
----------------------------------------

    Attachment: 1155-v5.txt

v5 moves the stats write after the descriptor is renamed, because persistSSTableStatistics takes care not to store stats for temporary files.  Also adds StatisticsTable back which was accidentally removed in v4, fixes some bugs in EH/CFS min/mean/max calculation, and has SSTW pass its EHs directly to SSTR when calling internalOpen.  Adds a test for SSTR providing stats, and more tests for EH.

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7 beta 1
>
>         Attachments: 1155-v2.txt, 1155-v3.txt, 1155-v4.txt, 1155-v5.txt, 1155.txt
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1155) keep persistent row statistics

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874963#action_12874963 ] 

Jonathan Ellis commented on CASSANDRA-1155:
-------------------------------------------

this metadata needs to be persisted (as a Meta file to go with Data Index and Filter), too.

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.7
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1155) keep persistent row statistics

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888397#action_12888397 ] 

Jonathan Ellis commented on CASSANDRA-1155:
-------------------------------------------

(or better, -Statistics instead of -Meta)

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1155) keep persistent row statistics

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1155:
--------------------------------------

    Attachment: 1155-v2.txt

v2 does some minor cleanup.

I started replacing CFS min/max w/ aggregated data from histograms (min/max is not persistent, so the latter is more useful) but the histograms are part of Writer rather than base SSTable.  They need to be available to the Reader to be useful for CASSANDRA-749.  (Although possibly loading them as a simple long[] in Reader would be better, to avoid unnecessary synchronization overhead on a read-only structure.)


> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7
>
>         Attachments: 1155-v2.txt, 1155.txt
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1155) keep persistent row statistics

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892766#action_12892766 ] 

Hudson commented on CASSANDRA-1155:
-----------------------------------

Integrated in Cassandra #501 (See [http://hudson.zones.apache.org/hudson/job/Cassandra/501/])
    

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7 beta 1
>
>         Attachments: 1155-v2.txt, 1155-v3.txt, 1155-v4.txt, 1155-v5.txt, 1155.txt
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1155) keep persistent row statistics

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-1155:
----------------------------------------

    Attachment: 1155.txt

Patch to store persistent statistics in the system keyspace.  Since these need to be per-sstable, I left min/max/mean in CFS as they were.

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7
>
>         Attachments: 1155.txt
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (CASSANDRA-1155) keep persistent row statistics

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reassigned CASSANDRA-1155:
-----------------------------------------

    Assignee: Brandon Williams

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1155) keep persistent row statistics

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891727#action_12891727 ] 

Jonathan Ellis commented on CASSANDRA-1155:
-------------------------------------------

ship it!  (and edit CHANGES please)

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7 beta 1
>
>         Attachments: 1155-v2.txt, 1155-v3.txt, 1155-v4.txt, 1155-v5.txt, 1155.txt
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1155) keep persistent row statistics

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-1155:
----------------------------------------

    Attachment: 1155-v3.txt

v3 builds on v2, finishing the TODOs in CFS and loading the persistent statistics on SSTRs when opening an existing SST.  This has a deadlock problem when flushing, where a flush of A goes to write the stats, but meanwhile B has acquired the flusherlock in preparation to flush, so A can't acquire the lock to do the stats write, and B can't release the lock because we only allow N flushes at a time.  Because that's pretty hairy, I'm going to go the route of storing a separate -Statistics.db, but am posting this patch in case it turns out to be useful later.

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7 beta 1
>
>         Attachments: 1155-v2.txt, 1155-v3.txt, 1155.txt
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1155) keep persistent row statistics

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1155:
--------------------------------------

    Attachment: 1155-v4.txt

v4 moves the write to stats CF into postFlushExecutor to avoid deadlock.

the deadlock was, you have two CFs flushing, A and B.

flush of A goes to update stats CF
meanwhile flush of B has acquired flusherlock in preparation to flush, so A can't aquire the lock to do the stats write
but B can't release the lock, because we only allow N flushes at a time, so it blocks on the executor submission (to flush-writer-pool)
so neither A nor B can proceed

> keep persistent row statistics
> ------------------------------
>
>                 Key: CASSANDRA-1155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1155
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7 beta 1
>
>         Attachments: 1155-v2.txt, 1155-v3.txt, 1155-v4.txt, 1155.txt
>
>
> during flush and compaction we should keep row size statistics using EstimatedHistogram (column count, and row size), replacing min/max/total sizes in CFS.
> having this detail will let us estimate, given an index CF, how many nodes we need to query to get the number of matching rows requested by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.