You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benedict (JIRA)" <ji...@apache.org> on 2015/06/03 15:34:38 UTC
[jira] [Comment Edited] (CASSANDRA-8915) Improve MergeIterator performance

    [ https://issues.apache.org/jira/browse/CASSANDRA-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570791#comment-14570791 ] 

Benedict edited comment on CASSANDRA-8915 at 6/3/15 1:34 PM:
-------------------------------------------------------------

Thanks for taking the time to provide this. I have some comments and suggestions for clarifying the proof for future readers. It is always tricky to find your bearings in a new proof, and I found myself confused by some subtle ambiguities that are hard to spot as the author.


* It would help to state that {{=}} refers to the partial order over key equality, not node identity (the first sentence in particular lead to some double-takes) 
* "stands for" and "predicate" (and use of a function-like identifier) to descibe {{σ(N)}} is confusing. This caused a surprising number of misunderstandings that took a good hour to crystalise in the realisation that {{σ(N)}} meant the flag or property that encodes the result of the predicate {{ρ(N) = N}}, not a short-hand for that predicate
* It would help to state {{P ≙ ep-heap(M,N)}} to indicate the ep-heap P rooted at N that leads to M, to avoid confusion with the similarity of "rooted at" and "leading to" 
* I also find that switching case to capitals for heaps/structures and retaining lower case for nodes/values is helpful for clarity. P and R are in the same run of variables as N, but refer to a different category of things. i.e. I would have found it clearer to follow the statement {{P ≙ ep-heap(m,n)}}.
* This would permit p ad P to respectively refer to a node and the ep-heap rooted at the node, such that {{P ≙ ep-heap(m,p)}} makes immediate intuitive sense
* \{\{\}\} can then be used to mark the boundaries of mathematical expressions (so lower case chars are still easily parsed as not prose) 

bq. 3. ... If ¬σ(P), set σ(R) := (P = R), otherwise leave σ(R) unchanged.

Perhaps clarify, that σ(P) => σ(R), so already correct. I realise you state this later, but here is where the question is raised. The later sentence is also very hard to parse, and could be simplified with the earlier short explanation.

bq. 3. ...  Use the algorithm recursively to *turn* the two ep-subheaps leading to P at its children and the new value C > P into an ep-heap leading to P

Consider *merge*, to clarify that this step restores the binary-tree property.

I haven't yet corroborated the code against the  proof, but will do so shortly. I would consider supplying the initial proof outline with its three cases along with the code, so that we can annotate each major branch with a short demonstration of invariant maintenance.


was (Author: benedict):
Thanks for taking the time to provide this. I have some comments and suggestions for clarifying the proof for future readers. It is always tricky to find your bearings in a new proof, and I found myself confused by some subtle ambiguities that are hard to spot as the author.


* It would help to state that {{=}} refers to the partial order over key equality, not node identity (the first sentence in particular lead to some double-takes) 
* "stands for" and "predicate" (and use of a function-like identifier) to descibe {{σ(N)}} is confusing. This caused a surprising number of misunderstandings that took a good hour to crystalise in the realisation that {{σ(N)}} meant the flag or property that encodes the result of the predicate {{ρ(N) = N}}, not a short-hand for that predicate
* It would help to state {{P ≙ ep-heap(M,N)}} to indicate the ep-heap P rooted at N that leads to M, to avoid confusion with the similarity of "rooted at" and "leading to" 
* I also find that switching case to capitals for heaps/structures and retaining lower case for nodes/values is helpful for clarity. P and R are in the same run of variables as N, but refer to a different category of things. i.e. I would have found it clearer to follow the statement {{P ≙ ep-heap(m,n)}}.
* \{\{\}\} can then be used to mark the boundaries of mathematical expressions (so lower case chars are still easily parsed as not prose) 

bq. 3. ... If ¬σ(P), set σ(R) := (P = R), otherwise leave σ(R) unchanged.

Perhaps clarify, that σ(P) => σ(R), so already correct. I realise you state this later, but here is where the question is raised. The later sentence is also very hard to parse, and could be simplified with the earlier short explanation.

bq. 3. ...  Use the algorithm recursively to *turn* the two ep-subheaps leading to P at its children and the new value C > P into an ep-heap leading to P

Consider *merge*, to clarify that this step restores the binary-tree property.

I haven't yet corroborated the code against the  proof, but will do so shortly. I would consider supplying the initial proof outline with its three cases along with the code, so that we can annotate each major branch with a short demonstration of invariant maintenance.

> Improve MergeIterator performance
> ---------------------------------
>
>                 Key: CASSANDRA-8915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8915
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Branimir Lambov
>            Assignee: Branimir Lambov
>            Priority: Minor
>              Labels: compaction, performance
>             Fix For: 3.x
>
>
> The implementation of {{MergeIterator}} uses a priority queue and applies a pair of {{poll}}+{{add}} operations for every item in the resulting sequence. This is quite inefficient as {{poll}} necessarily applies at least {{log N}} comparisons (up to {{2log N}}), and {{add}} often requires another {{log N}}, for example in the case where the inputs largely don't overlap (where {{N}} is the number of iterators being merged).
> This can easily be replaced with a simple custom structure that can perform replacement of the top of the queue in a single step, which will very often complete after a couple of comparisons and in the worst case scenarios will match the complexity of the current implementation.
> This should significantly improve merge performance for iterators with limited overlap (e.g. levelled compaction).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)