You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "David Arena (JIRA)" <ji...@apache.org> on 2011/06/20 15:49:47 UTC

[jira] [Created] (CASSANDRA-2798) Repair Fails 0.8

Repair Fails 0.8
----------------

                 Key: CASSANDRA-2798
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.8.0
            Reporter: David Arena


I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
kill -9 cassandra-pid
rm -rf "All data & logs"
start cassandra
nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

e.g Before
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

e.g After
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
nodetool flush,compact,repair all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Arena updated CASSANDRA-2798:
-----------------------------------

    Description: 
I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
# kill -9 cassandra-pid
# rm -rf "All data & logs"
# start cassandra
# nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

* e.g Before 
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

* e.g After
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

-> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..

-> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..

-> nodetool flush,compact,cleanup all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26

  was:
I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
# kill -9 cassandra-pid
# rm -rf "All data & logs"
# start cassandra
# nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

*e.g Before 
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

*e.g After
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

-> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..

-> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..

-> nodetool flush,compact,cleanup all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26


> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>
> I am seeing fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053185#comment-13053185 ] 

David Arena commented on CASSANDRA-2798:
----------------------------------------

Soo after a restart and a compact.. im looking at this.. ( Still doesnt seem absolutely correct, but yes you are correct about compaction problems.. )

10.0.1.150 Up Normal 2.61 GB 33.33% 0 
10.0.1.152 Up Normal 2.61 GB 33.33% 56713727820156410577229101238628035242 
10.0.1.154 Up Normal 3.16 GB 33.33% 113427455640312821154458202477256070485

Node1 & Node2 is now back to normal..
but Node3 did not return to 2.61GB...
Ive tried, compact, flush, cleanup.. etc etc.. It wont get smaller.. :(

I still dont understand why a repair on node3 balloons the data on node1 & node2 in 0.8.. This should happen as far as i believe.. 
Its my understanding that node3 should, copy the data from its replicas on other nodes ( hence why we see 2x the data size... ) and then a compact to aggregate it down to a proper replica for the cluster..

Node1 & Node2 really shouldnt be changing at all ???


> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CASSANDRA-2798) Repair Fails 0.8

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reassigned CASSANDRA-2798:
-----------------------------------------

    Assignee: Sylvain Lebresne

Thanks, David -- we've suspected something like this, and a simple case to reproduce will help a lot.

> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Arena updated CASSANDRA-2798:
-----------------------------------

    Description: 
I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
# kill -9 cassandra-pid
# rm -rf "All data & logs"
# start cassandra
# nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

*** e.g Before ***
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

*** e.g After ***
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

-> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..

-> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..

-> nodetool flush,compact,cleanup all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26

  was:
I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
kill -9 cassandra-pid
rm -rf "All data & logs"
start cassandra
nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

e.g Before
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

e.g After
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
nodetool flush,compact,cleanup all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26


> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>
> I am seeing fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> *** e.g Before ***
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> *** e.g After ***
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052613#comment-13052613 ] 

David Arena commented on CASSANDRA-2798:
----------------------------------------

Ok. so i cant exactly hand out the inserting script but i can give you an indication of the data format.. our scripts for testing are complex classes, whereby building objects.. However this is EXACTLY an output of what actually is written to cassandra ( CF test1 ) with a little obfuscation..

UUID4 key:
f9bb44f2844241df971e0975005c87dc   
DATA FORMAT:
{'i': '[{"pr": ["XYZ", "0.47"], "va": [["XZ", "0.19"]], "pu": "1307998855", "iu": "http://devel.test.com/test2730", "it": "TESTERtype 0: TESTERobject 2730", "pi": {"XYZ": "0.31", "XZ": "0.47"}, "id": "0!2730"}]', 'cu': 'XYZ', 'cd': '1308668648'}

Try loading 100,000 of these with random uuid4 keys...

Furthermore, i have retried the test.. precisely again and again.. including a flush before killing node3.. Still im left with..
Before...
10.0.1.150      Up     Normal  2.61 GB         33.33%  0                                           
10.0.1.152      Up     Normal  2.61 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  2.61 GB         33.33%  113427455640312821154458202477256070485 

After Killing Node and Restart...
10.0.1.150      Up     Normal  2.61 GB         33.33%  0                                           
10.0.1.152      Up     Normal  2.61 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  61.69 KB        33.33%  113427455640312821154458202477256070485

After Running Repair...
10.0.1.150      Up     Normal  4.76 GB         33.33%  0                                           
10.0.1.152      Up     Normal  5.41 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  8.87 GB         33.33%  113427455640312821154458202477256070485

After Running Flush & Compact on ALL nodes...
10.0.1.150      Up     Normal  4.76 GB         33.33%  0                                           
10.0.1.152      Up     Normal  5.41 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  4.86 GB         33.33%  113427455640312821154458202477256070485

This does not occur in 0.7.6.. in fact it works perfectly..

> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Arena updated CASSANDRA-2798:
-----------------------------------

    Description: 
I am seeing a fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
# kill -9 cassandra-pid
# rm -rf "All data & logs"
# start cassandra
# nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

* e.g Before 
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

* e.g After
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

-> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..

-> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..

-> nodetool flush,compact,cleanup all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26

  was:
I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
# kill -9 cassandra-pid
# rm -rf "All data & logs"
# start cassandra
# nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

* e.g Before 
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

* e.g After
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

-> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..

-> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..

-> nodetool flush,compact,cleanup all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26


> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2798) Repair Fails 0.8

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053699#comment-13053699 ] 

Sylvain Lebresne commented on CASSANDRA-2798:
---------------------------------------------

bq. Node3 returned this time to 2.62...

Ok, this at least the 'load is not going down to what it should' problem is due to compaction problems and those are fix for 0.8.1 (which I have tested).

bq. I still dont understand why a repair on node3 balloons the data on node1 & node2 in 0.8.. This should not happen as far as i believe.. 

You are right, it shouldn't. Pretty sure this is due to CASSANDRA-2815. So we'll have to fix that.

As a side note, repair is clearly not the most efficient way to repair a node whose data has been fully nuked, in that even if everything goes well, building the merkle trees is useless and streaming twice the same data too. Hopefully CASSANDRA-957 should soon provide a better alternative for that particular case.

> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053195#comment-13053195 ] 

David Arena commented on CASSANDRA-2798:
----------------------------------------

I retried the whole process..
Node3 returned this time to 2.62...

Are you able to test this in 0.8.1 ???




> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2798) Repair Fails 0.8

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052714#comment-13052714 ] 

Sylvain Lebresne commented on CASSANDRA-2798:
---------------------------------------------

Alright, that was my mistake, I was testing on 0.8 current branch, thinking there was no reason this would have been fixed since 0.8.0 but it may actually be that it is fixed. On 0.8.0, I'm indeed able to reproduce that scenario. However, if after the all of this I restart the nodes and redo a compact after the restart, everything goes back to normal load. Which make me think that it's possibly CASSANDRA-2765 that prevents the compaction to actually happen. Restarting allows it to happen again.

Can you validate (or deny) that a restart followed by a compact fixes this ?

> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Arena updated CASSANDRA-2798:
-----------------------------------

    Description: 
I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
# kill -9 cassandra-pid
# rm -rf "All data & logs"
# start cassandra
# nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

*e.g Before 
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

*e.g After
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

-> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..

-> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..

-> nodetool flush,compact,cleanup all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26

  was:
I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
# kill -9 cassandra-pid
# rm -rf "All data & logs"
# start cassandra
# nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

*** e.g Before ***
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

*** e.g After ***
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

-> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..

-> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..

-> nodetool flush,compact,cleanup all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26


> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>
> I am seeing fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> *e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> *e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053185#comment-13053185 ] 

David Arena edited comment on CASSANDRA-2798 at 6/22/11 11:18 AM:
------------------------------------------------------------------

Soo after a restart and a compact.. im looking at this.. ( Still doesnt seem absolutely correct, but yes you are correct about compaction problems.. )

10.0.1.150 Up Normal 2.61 GB 33.33% 0 
10.0.1.152 Up Normal 2.61 GB 33.33% 56713727820156410577229101238628035242 
10.0.1.154 Up Normal 3.16 GB 33.33% 113427455640312821154458202477256070485

Node1 & Node2 is now back to normal..
but Node3 did not return to 2.61GB...
Ive tried, compact, flush, cleanup.. etc etc.. It wont get smaller.. :(

I still dont understand why a repair on node3 balloons the data on node1 & node2 in 0.8.. This should not happen as far as i believe.. 
Its my understanding that node3 should, copy the data from its replicas on other nodes ( hence why we see 2x the data size... ) and then a compact to aggregate it down to a proper replica for the cluster..

Node1 & Node2 really shouldnt be changing at all ???


      was (Author: arenstar):
    Soo after a restart and a compact.. im looking at this.. ( Still doesnt seem absolutely correct, but yes you are correct about compaction problems.. )

10.0.1.150 Up Normal 2.61 GB 33.33% 0 
10.0.1.152 Up Normal 2.61 GB 33.33% 56713727820156410577229101238628035242 
10.0.1.154 Up Normal 3.16 GB 33.33% 113427455640312821154458202477256070485

Node1 & Node2 is now back to normal..
but Node3 did not return to 2.61GB...
Ive tried, compact, flush, cleanup.. etc etc.. It wont get smaller.. :(

I still dont understand why a repair on node3 balloons the data on node1 & node2 in 0.8.. This should happen as far as i believe.. 
Its my understanding that node3 should, copy the data from its replicas on other nodes ( hence why we see 2x the data size... ) and then a compact to aggregate it down to a proper replica for the cluster..

Node1 & Node2 really shouldnt be changing at all ???

  
> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052613#comment-13052613 ] 

David Arena edited comment on CASSANDRA-2798 at 6/21/11 3:37 PM:
-----------------------------------------------------------------

Ok. so i cant exactly hand out the inserting script but i can give you an indication of the data format.. our scripts for testing are complex classes, whereby building objects.. However this is EXACTLY an output of what actually is written to cassandra ( CF test1 ) with a little obfuscation..

UUID4 key:
f9bb44f2844241df971e0975005c87dc   
DATA FORMAT:
{'i': '[{"pr": ["XYZ", "0.47"], "va": [["XZ", "0.19"]], "pu": "1307998855", "iu": "http://devel.test.com/test2730", "it": "TESTERtype 0: TESTERobject 2730", "pi": {"XYZ": "0.31", "XZ": "0.47"}, "id": "0!2730"}]', 'cu': 'XYZ', 'cd': '1308668648'}

For the CF test 2, the data format looks like this..
UUID4 key:
f9bb44f2844241df971e0975005c87dc
DATA FORMAT:
('0!2243', {'rt': '1308221914', 'ri': '1308218344', 'pu': '1308218344'}), ('1!2342', {'pu': '1308080741'}), ('2!1731', {'pu': '1308618693'}), ('3!3772', {'pu': '1308338296'})..

There can be up to 100 fields per key in CF test2..

basically for every insert in CF test1, there is a corresponding insert in CF test2(with roughly 50-100 fields)

Try loading 100,000 of random uuid4 keys into CF test1 with the corresponding keys/fields in CF test2...

Furthermore, i have retried the test.. precisely again and again.. including a flush before killing node3.. Still im not able to succeed..

Before...
10.0.1.150      Up     Normal  2.61 GB         33.33%  0                                           
10.0.1.152      Up     Normal  2.61 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  2.61 GB         33.33%  113427455640312821154458202477256070485 

After Killing Node and Restart...
10.0.1.150      Up     Normal  2.61 GB         33.33%  0                                           
10.0.1.152      Up     Normal  2.61 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  61.69 KB        33.33%  113427455640312821154458202477256070485

After Running Repair...
10.0.1.150      Up     Normal  4.76 GB         33.33%  0                                           
10.0.1.152      Up     Normal  5.41 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  8.87 GB         33.33%  113427455640312821154458202477256070485

After Running Flush & Compact on ALL nodes...
10.0.1.150      Up     Normal  4.76 GB         33.33%  0                                           
10.0.1.152      Up     Normal  5.41 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  4.86 GB         33.33%  113427455640312821154458202477256070485

This does not occur in 0.7.6.. in fact it works perfectly..

      was (Author: arenstar):
    Ok. so i cant exactly hand out the inserting script but i can give you an indication of the data format.. our scripts for testing are complex classes, whereby building objects.. However this is EXACTLY an output of what actually is written to cassandra ( CF test1 ) with a little obfuscation..

UUID4 key:
f9bb44f2844241df971e0975005c87dc   
DATA FORMAT:
{'i': '[{"pr": ["XYZ", "0.47"], "va": [["XZ", "0.19"]], "pu": "1307998855", "iu": "http://devel.test.com/test2730", "it": "TESTERtype 0: TESTERobject 2730", "pi": {"XYZ": "0.31", "XZ": "0.47"}, "id": "0!2730"}]', 'cu': 'XYZ', 'cd': '1308668648'}

Try loading 100,000 of these with random uuid4 keys...

Furthermore, i have retried the test.. precisely again and again.. including a flush before killing node3.. Still im left with..
Before...
10.0.1.150      Up     Normal  2.61 GB         33.33%  0                                           
10.0.1.152      Up     Normal  2.61 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  2.61 GB         33.33%  113427455640312821154458202477256070485 

After Killing Node and Restart...
10.0.1.150      Up     Normal  2.61 GB         33.33%  0                                           
10.0.1.152      Up     Normal  2.61 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  61.69 KB        33.33%  113427455640312821154458202477256070485

After Running Repair...
10.0.1.150      Up     Normal  4.76 GB         33.33%  0                                           
10.0.1.152      Up     Normal  5.41 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  8.87 GB         33.33%  113427455640312821154458202477256070485

After Running Flush & Compact on ALL nodes...
10.0.1.150      Up     Normal  4.76 GB         33.33%  0                                           
10.0.1.152      Up     Normal  5.41 GB         33.33%  56713727820156410577229101238628035242      
10.0.1.154      Up     Normal  4.86 GB         33.33%  113427455640312821154458202477256070485

This does not occur in 0.7.6.. in fact it works perfectly..
  
> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CASSANDRA-2798) Repair Fails 0.8

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-2798.
---------------------------------------

    Resolution: Duplicate
      Assignee:     (was: Sylvain Lebresne)

> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052016#comment-13052016 ] 

David Arena commented on CASSANDRA-2798:
----------------------------------------

For me, i simply set the 3 nodes tokens too
1. 0
2. 56713727820156410577229101238628035242
3. 113427455640312821154458202477256070485

Set the listen address's to the private 10.0.1.0/24 machine ip

inserted -> "

create keyspace test                                               
	with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
	and strategy_options = [{replication_factor:3}];

use test;

create column family test1
    with comparator = BytesType
    and keys_cached = 10000
    and rows_cached = 1000
    and row_cache_save_period = 0
    and key_cache_save_period = 3600
    and memtable_flush_after = 59
    and memtable_throughput = 255
    and memtable_operations = 0.29;

create column family test2
    with column_type = Super
    and subcomparator = BytesType
    and rows_cached = 10000
    and keys_cached = 50;
"
and filled both CF's with 100,000 keys each...

It shouldnt be too hard to reproduce..???
Im actually using the Riptano rpm's, but i would believe they are capable of packing for production use..


> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2798) Repair Fails 0.8

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051992#comment-13051992 ] 

Sylvain Lebresne commented on CASSANDRA-2798:
---------------------------------------------

For info, CASSANDRA-2797 is what makes repair to never end. This does not explain the doubling of the data however. I'm trying to reproduce right now but my first attempt was unsuccessful (it worked as expected). Feel free to share more information on your setup (the logs of the server for instance) if you're at liberty to do so.  

> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2798) Repair Fails 0.8

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052141#comment-13052141 ] 

Sylvain Lebresne commented on CASSANDRA-2798:
---------------------------------------------

Ok, I'm happy to help tracking this down, but somehow I can't get it to reproduce.
I did try with the same number of nodes, same tokens, same exact column families definition and I inserted 100,000 keys in each (with 1 column per key for test1 and 1 super column with 1 column per key for test2). Once inserted, I flushed (so that there is some sstable), I killed node3, cleaned all data/commit log, restarted node3 and ran nodetool repair on node3. It correctly succeeded. At the end of repair, the load of node3 was twice the size it should (which is expected since both node will have repaired it -- which in itself may not be the more efficient solution but this is not the debate here) and the load had slightly increased in the two other nodes (I'll have to check the actual reason but again, not the issue at end). But after a compact, everything was back to it's normal size.

Now I made different tries with different settings of numbers of super column/column per key, but I can't test everything. Maybe the size of the value play a role too. In any case, if you can reproduce so simply, would you mind attaching the script your using to "fill both CF" as well as as much details about the step you use.

Last thing, when I'm talking about the load of a node, I'm talking about the load value as reported by nodetool ring. If you were checking the actual file on disk, please make sure to restart the cluster after the nodetool compact to make sure this is not just compacted files not yet being deleted.

Overall, right now I'm focused on the 'nodetool compact after repair doesn't make the load goes down' which is the really weird one.

> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2798) Repair Fails 0.8

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052003#comment-13052003 ] 

Sylvain Lebresne commented on CASSANDRA-2798:
---------------------------------------------

Actually I was a bit quick, CASSANDRA-2797 would only be the culprit for making repair hang if repair was started the other nodes that the one where the data was removed. so it may not be that.

> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>            Assignee: Sylvain Lebresne
>
> I am seeing a fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> # kill -9 cassandra-pid
> # rm -rf "All data & logs"
> # start cassandra
> # nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> * e.g Before 
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> * e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> -> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> -> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> -> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2798) Repair Fails 0.8

Posted by "David Arena (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Arena updated CASSANDRA-2798:
-----------------------------------

    Description: 
I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
kill -9 cassandra-pid
rm -rf "All data & logs"
start cassandra
nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

e.g Before
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

e.g After
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
nodetool flush,compact,cleanup all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26

  was:
I am seeing fatal problem in the new 0.8

Im running a 3 node cluster with a replication_factor of 3..

On Node 3.. If i 
kill -9 cassandra-pid
rm -rf "All data & logs"
start cassandra
nodetool -h "node-3-ip" repair

The whole cluster become duplicated..

e.g Before
node 1 -> 2.65GB
node 2 -> 2.65GB
node 3 -> 2.65GB

e.g After
node 1 -> 5.3GB
node 2 -> 5.3GB
node 3 -> 7.95GB

nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
nodetool flush,compact,repair all complete, but do not help...

This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue



Running: CentOS 5.6, JDK 1.6.0_26


> Repair Fails 0.8
> ----------------
>
>                 Key: CASSANDRA-2798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: David Arena
>
> I am seeing fatal problem in the new 0.8
> Im running a 3 node cluster with a replication_factor of 3..
> On Node 3.. If i 
> kill -9 cassandra-pid
> rm -rf "All data & logs"
> start cassandra
> nodetool -h "node-3-ip" repair
> The whole cluster become duplicated..
> e.g Before
> node 1 -> 2.65GB
> node 2 -> 2.65GB
> node 3 -> 2.65GB
> e.g After
> node 1 -> 5.3GB
> node 2 -> 5.3GB
> node 3 -> 7.95GB
> nodetool repair, never ends (96 hours +), however there is no streams running, nor any cpu or disk activity..
> Manually killing the repair and restarting does not help.. Restarting the server/cassandra does not help..
> nodetool flush,compact,cleanup all complete, but do not help...
> This is not occuring in 0.7.6.. I have come to the conclusion this is a Major 0.8 issue
> Running: CentOS 5.6, JDK 1.6.0_26

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira