You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Philip Andronov (Created) (JIRA)" <ji...@apache.org> on 2012/02/03 12:41:53 UTC

[jira] [Created] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Unnecessary  ReadRepair request during RangeScan
------------------------------------------------

                 Key: CASSANDRA-3843
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.0.0
            Reporter: Philip Andronov
            Priority: Critical


During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 

With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(

It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.

Code explanations:
class RangeSliceResponseResolver {
    // ....
    private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
    {
    // ....

        protected Row getReduced()
        {
            ColumnFamily resolved = versions.size() > 1
                                  ? RowRepairResolver.resolveSuperset(versions)
                                  : versions.get(0);
            if (versions.size() < sources.size())
            {
                for (InetAddress source : sources)
                {
                    if (!versionSources.contains(source))
                    {
                          
                        // [PA] Here we are adding null ColumnFamily.
                        // later it will be compared with the "desired"
                        // version and will give us "fake" difference which
                        // forces Cassandra to send ReadRepair to a given source
                        versions.add(null);
                        versionSources.add(source);
                    }
                }
            }
            // ....
            if (resolved != null)
                repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
            // ....
        }
    }
}


2. public class RowRepairResolver extends AbstractRowResolver {
    // ....
    public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
    {
        List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());

        for (int i = 0; i < versions.size(); i++)
        {
            // Sooner or later we have to compare null and resolved which are obviously
            // not equals, so it will fire a ReadRequest, however it is not needed here
            ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
            if (diffCf == null)
                continue;
        // .... 

Imagine the following situation:
NodeA has X.1 // row X with the version 1
NodeB has X.2 
NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2

During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
NodeA has X.12
NodeB has X.12

which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes A/B (no ReadRepair) or by any pair with node C, but in that case ReadRepair will be fire which will brings nodeC to the consistent state

If you are reading from the Index then sooner or later you will get TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jonathan Ellis (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3843:
--------------------------------------

    Fix Version/s: 1.0.8
         Assignee: Jonathan Ellis
    
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>            Priority: Critical
>             Fix For: 1.0.8
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Philip Andronov (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip Andronov updated CASSANDRA-3843:
---------------------------------------

    Description: 
During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 

With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(

It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.

Code explanations:
{code:title=RangeSliceResponseResolver.java|borderStyle=solid}
class RangeSliceResponseResolver {
    // ....
    private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
    {
    // ....

        protected Row getReduced()
        {
            ColumnFamily resolved = versions.size() > 1
                                  ? RowRepairResolver.resolveSuperset(versions)
                                  : versions.get(0);
            if (versions.size() < sources.size())
            {
                for (InetAddress source : sources)
                {
                    if (!versionSources.contains(source))
                    {
                          
                        // [PA] Here we are adding null ColumnFamily.
                        // later it will be compared with the "desired"
                        // version and will give us "fake" difference which
                        // forces Cassandra to send ReadRepair to a given source
                        versions.add(null);
                        versionSources.add(source);
                    }
                }
            }
            // ....
            if (resolved != null)
                repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
            // ....
        }
    }
}
{code}


{code:title=RowRepairResolver.java|borderStyle=solid}
public class RowRepairResolver extends AbstractRowResolver {
    // ....
    public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
    {
        List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());

        for (int i = 0; i < versions.size(); i++)
        {
            // Sooner or later we have to compare null and resolved which are obviously
            // not equals, so it will fire a ReadRequest, however it is not needed here
            ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
            if (diffCf == null)
                continue;
        // .... 
{code}

Imagine the following situation:
NodeA has X.1 // row X with the version 1
NodeB has X.2 
NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2

During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
NodeA has X.12
NodeB has X.12

which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes A/B (no ReadRepair) or by any pair with node C, but in that case ReadRepair will be fire which will brings nodeC to the consistent state

If you are reading from the Index then sooner or later you will get TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

  was:
During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 

With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(

It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.

Code explanations:
class RangeSliceResponseResolver {
    // ....
    private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
    {
    // ....

        protected Row getReduced()
        {
            ColumnFamily resolved = versions.size() > 1
                                  ? RowRepairResolver.resolveSuperset(versions)
                                  : versions.get(0);
            if (versions.size() < sources.size())
            {
                for (InetAddress source : sources)
                {
                    if (!versionSources.contains(source))
                    {
                          
                        // [PA] Here we are adding null ColumnFamily.
                        // later it will be compared with the "desired"
                        // version and will give us "fake" difference which
                        // forces Cassandra to send ReadRepair to a given source
                        versions.add(null);
                        versionSources.add(source);
                    }
                }
            }
            // ....
            if (resolved != null)
                repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
            // ....
        }
    }
}


2. public class RowRepairResolver extends AbstractRowResolver {
    // ....
    public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
    {
        List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());

        for (int i = 0; i < versions.size(); i++)
        {
            // Sooner or later we have to compare null and resolved which are obviously
            // not equals, so it will fire a ReadRequest, however it is not needed here
            ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
            if (diffCf == null)
                continue;
        // .... 

Imagine the following situation:
NodeA has X.1 // row X with the version 1
NodeB has X.2 
NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2

During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
NodeA has X.12
NodeB has X.12

which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes A/B (no ReadRepair) or by any pair with node C, but in that case ReadRepair will be fire which will brings nodeC to the consistent state

If you are reading from the Index then sooner or later you will get TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

    
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Priority: Critical
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // Sooner or later we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes A/B (no ReadRepair) or by any pair with node C, but in that case ReadRepair will be fire which will brings nodeC to the consistent state
> If you are reading from the Index then sooner or later you will get TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jonathan Ellis (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3843:
--------------------------------------

    Attachment: 3843.txt

The null version was added for CASSANDRA-2680.  I think the core problem here is that the RSRR is being created with *all* the replica endpoints ({{liveEndpoints}}), not just the ones being asked to respond to the query ({{handler.endpoints}}).  This is a bit tricky since the handler wants to know its resolver at creation time, and unlike RowDigestResolver, RSRR wants to initialize its endpoints at creation time too.  Kind of hackish patch attached.

                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jonathan Ellis (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3843:
--------------------------------------

    Reviewer: vijay2win@yahoo.com
    Priority: Major  (was: Critical)
    
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jeremy Hanna (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212617#comment-13212617 ] 

Jeremy Hanna commented on CASSANDRA-3843:
-----------------------------------------

I did repairs on all the nodes and then compacts on all the nodes. Then I did a pig job to simply count the number of rows in the column family. Again I think the overall writes were reduced but there are writes going on. I need to turn debug on and do the same test again. I did the compactions at 6:42 and the range scans at 14:16:

-rw-r--r-- 1 root root 40106228511 Feb 21 06:42 account_snapshot-g-792-Data.db
-rw-r--r-- 1 root root   206884816 Feb 21 06:42 account_snapshot-g-792-Filter.db
-rw-r--r-- 1 root root  2913796038 Feb 21 06:42 account_snapshot-g-792-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 06:42 account_snapshot-g-792-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-793-Compacted
-rw-r--r-- 1 root root      287286 Feb 21 14:16 account_snapshot-g-793-Data.db
-rw-r--r-- 1 root root         976 Feb 21 14:16 account_snapshot-g-793-Filter.db
-rw-r--r-- 1 root root       20857 Feb 21 14:16 account_snapshot-g-793-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:16 account_snapshot-g-793-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-794-Compacted
-rw-r--r-- 1 root root    87770771 Feb 21 14:17 account_snapshot-g-794-Data.db
-rw-r--r-- 1 root root      293944 Feb 21 14:17 account_snapshot-g-794-Filter.db
-rw-r--r-- 1 root root     6377968 Feb 21 14:17 account_snapshot-g-794-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-794-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-795-Compacted
-rw-r--r-- 1 root root    78459166 Feb 21 14:17 account_snapshot-g-795-Data.db
-rw-r--r-- 1 root root      262600 Feb 21 14:17 account_snapshot-g-795-Filter.db
-rw-r--r-- 1 root root     5698156 Feb 21 14:17 account_snapshot-g-795-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-795-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-796-Compacted
-rw-r--r-- 1 root root    69838937 Feb 21 14:17 account_snapshot-g-796-Data.db
-rw-r--r-- 1 root root      234000 Feb 21 14:17 account_snapshot-g-796-Filter.db
-rw-r--r-- 1 root root     5077447 Feb 21 14:17 account_snapshot-g-796-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-796-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-797-Compacted
-rw-r--r-- 1 root root    68094433 Feb 21 14:17 account_snapshot-g-797-Data.db
-rw-r--r-- 1 root root      227808 Feb 21 14:17 account_snapshot-g-797-Filter.db
-rw-r--r-- 1 root root     4943098 Feb 21 14:17 account_snapshot-g-797-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-797-Statistics.db
-rw-r--r-- 1 root root   304163307 Feb 21 14:20 account_snapshot-g-798-Data.db
-rw-r--r-- 1 root root     1019776 Feb 21 14:20 account_snapshot-g-798-Filter.db
-rw-r--r-- 1 root root    22096669 Feb 21 14:20 account_snapshot-g-798-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:20 account_snapshot-g-798-Statistics.db
-rw-r--r-- 1 root root    65874829 Feb 21 14:18 account_snapshot-g-799-Data.db
-rw-r--r-- 1 root root      220192 Feb 21 14:18 account_snapshot-g-799-Filter.db
-rw-r--r-- 1 root root     4777809 Feb 21 14:18 account_snapshot-g-799-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:18 account_snapshot-g-799-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-800-Compacted
-rw-r--r-- 1 root root    50067413 Feb 21 14:18 account_snapshot-g-800-Data.db
-rw-r--r-- 1 root root      167416 Feb 21 14:18 account_snapshot-g-800-Filter.db
-rw-r--r-- 1 root root     3632313 Feb 21 14:18 account_snapshot-g-800-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:18 account_snapshot-g-800-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-801-Compacted
-rw-r--r-- 1 root root    50575719 Feb 21 14:18 account_snapshot-g-801-Data.db
-rw-r--r-- 1 root root      169160 Feb 21 14:18 account_snapshot-g-801-Filter.db
-rw-r--r-- 1 root root     3669880 Feb 21 14:18 account_snapshot-g-801-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:18 account_snapshot-g-801-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-802-Compacted
-rw-r--r-- 1 root root    41788766 Feb 21 14:19 account_snapshot-g-802-Data.db
-rw-r--r-- 1 root root      139776 Feb 21 14:19 account_snapshot-g-802-Filter.db
-rw-r--r-- 1 root root     3033069 Feb 21 14:19 account_snapshot-g-802-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-802-Statistics.db
-rw-r--r-- 1 root root    46547146 Feb 21 14:19 account_snapshot-g-803-Data.db
-rw-r--r-- 1 root root      155720 Feb 21 14:19 account_snapshot-g-803-Filter.db
-rw-r--r-- 1 root root     3378457 Feb 21 14:19 account_snapshot-g-803-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-803-Statistics.db
-rw-r--r-- 1 root root   142719184 Feb 21 14:20 account_snapshot-g-804-Data.db
-rw-r--r-- 1 root root      478576 Feb 21 14:19 account_snapshot-g-804-Filter.db
-rw-r--r-- 1 root root    10356119 Feb 21 14:20 account_snapshot-g-804-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:20 account_snapshot-g-804-Statistics.db
-rw-r--r-- 1 root root    55373874 Feb 21 14:19 account_snapshot-g-805-Data.db
-rw-r--r-- 1 root root      185160 Feb 21 14:19 account_snapshot-g-805-Filter.db
-rw-r--r-- 1 root root     4017391 Feb 21 14:19 account_snapshot-g-805-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-805-Statistics.db
-rw-r--r-- 1 root root    46399227 Feb 21 14:19 account_snapshot-g-806-Data.db
-rw-r--r-- 1 root root      155120 Feb 21 14:19 account_snapshot-g-806-Filter.db
-rw-r--r-- 1 root root     3365947 Feb 21 14:19 account_snapshot-g-806-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-806-Statistics.db
-rw-r--r-- 1 root root    58491393 Feb 21 14:19 account_snapshot-g-807-Data.db
-rw-r--r-- 1 root root      196048 Feb 21 14:19 account_snapshot-g-807-Filter.db
-rw-r--r-- 1 root root     4253922 Feb 21 14:19 account_snapshot-g-807-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-807-Statistics.db
-rw-r--r-- 1 root root    47609635 Feb 21 14:20 account_snapshot-g-808-Data.db
-rw-r--r-- 1 root root      159320 Feb 21 14:20 account_snapshot-g-808-Filter.db
-rw-r--r-- 1 root root     3456985 Feb 21 14:20 account_snapshot-g-808-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:20 account_snapshot-g-808-Statistics.db
-rw-r--r-- 1 root root    46923060 Feb 21 14:20 account_snapshot-tmp-g-809-Data.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-tmp-g-809-Index.db
-rw-r--r-- 1 root root    49693602 Feb 21 14:20 account_snapshot-tmp-g-810-Data.db
-rw-r--r-- 1 root root      166600 Feb 21 14:20 account_snapshot-tmp-g-810-Filter.db
-rw-r--r-- 1 root root     3614750 Feb 21 14:20 account_snapshot-tmp-g-810-Index.db
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jeremy Hanna (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212725#comment-13212725 ] 

Jeremy Hanna commented on CASSANDRA-3843:
-----------------------------------------

Good to know - we'll upgrade to 1.0.8 as soon as we can then.
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Vijay (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204269#comment-13204269 ] 

Vijay commented on CASSANDRA-3843:
----------------------------------

+1
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212679#comment-13212679 ] 

Brandon Williams commented on CASSANDRA-3843:
---------------------------------------------

I'm unable to repro against 1.0 HEAD.
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Philip Andronov (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip Andronov updated CASSANDRA-3843:
---------------------------------------

    Description: 
During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 

With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(

It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.

Code explanations:
{code:title=RangeSliceResponseResolver.java|borderStyle=solid}
class RangeSliceResponseResolver {
    // ....
    private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
    {
    // ....

        protected Row getReduced()
        {
            ColumnFamily resolved = versions.size() > 1
                                  ? RowRepairResolver.resolveSuperset(versions)
                                  : versions.get(0);
            if (versions.size() < sources.size())
            {
                for (InetAddress source : sources)
                {
                    if (!versionSources.contains(source))
                    {
                          
                        // [PA] Here we are adding null ColumnFamily.
                        // later it will be compared with the "desired"
                        // version and will give us "fake" difference which
                        // forces Cassandra to send ReadRepair to a given source
                        versions.add(null);
                        versionSources.add(source);
                    }
                }
            }
            // ....
            if (resolved != null)
                repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
            // ....
        }
    }
}
{code}


{code:title=RowRepairResolver.java|borderStyle=solid}
public class RowRepairResolver extends AbstractRowResolver {
    // ....
    public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
    {
        List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());

        for (int i = 0; i < versions.size(); i++)
        {
            // On some iteration we have to compare null and resolved which are obviously
            // not equals, so it will fire a ReadRequest, however it is not needed here
            ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
            if (diffCf == null)
                continue;
        // .... 
{code}

Imagine the following situation:
NodeA has X.1 // row X with the version 1
NodeB has X.2 
NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2

During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
NodeA has X.12
NodeB has X.12

which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state

Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

  was:
During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 

With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(

It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.

Code explanations:
{code:title=RangeSliceResponseResolver.java|borderStyle=solid}
class RangeSliceResponseResolver {
    // ....
    private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
    {
    // ....

        protected Row getReduced()
        {
            ColumnFamily resolved = versions.size() > 1
                                  ? RowRepairResolver.resolveSuperset(versions)
                                  : versions.get(0);
            if (versions.size() < sources.size())
            {
                for (InetAddress source : sources)
                {
                    if (!versionSources.contains(source))
                    {
                          
                        // [PA] Here we are adding null ColumnFamily.
                        // later it will be compared with the "desired"
                        // version and will give us "fake" difference which
                        // forces Cassandra to send ReadRepair to a given source
                        versions.add(null);
                        versionSources.add(source);
                    }
                }
            }
            // ....
            if (resolved != null)
                repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
            // ....
        }
    }
}
{code}


{code:title=RowRepairResolver.java|borderStyle=solid}
public class RowRepairResolver extends AbstractRowResolver {
    // ....
    public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
    {
        List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());

        for (int i = 0; i < versions.size(); i++)
        {
            // Sooner or later we have to compare null and resolved which are obviously
            // not equals, so it will fire a ReadRequest, however it is not needed here
            ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
            if (diffCf == null)
                continue;
        // .... 
{code}

Imagine the following situation:
NodeA has X.1 // row X with the version 1
NodeB has X.2 
NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2

During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
NodeA has X.12
NodeB has X.12

which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes A/B (no ReadRepair) or by any pair with node C, but in that case ReadRepair will be fire which will brings nodeC to the consistent state

If you are reading from the Index then sooner or later you will get TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

    
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Priority: Critical
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207827#comment-13207827 ] 

Jonathan Ellis commented on CASSANDRA-3843:
-------------------------------------------

I suggest testing with a single range scan at debug level.  Too much hay to see the needle when you're doing 100s or 1000s of scans.
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jeremy Hanna (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212617#comment-13212617 ] 

Jeremy Hanna edited comment on CASSANDRA-3843 at 2/21/12 2:30 PM:
------------------------------------------------------------------

I did repairs on all the nodes and then compacts on all the nodes. Then I did a pig job to simply count the number of rows in the column family. Again I think the overall writes were reduced but there are writes going on. I need to turn debug on and do the same test again. I did the compactions at 6:42 and the range scans at 14:16:

{code}
-rw-r--r-- 1 root root 40106228511 Feb 21 06:42 account_snapshot-g-792-Data.db
-rw-r--r-- 1 root root   206884816 Feb 21 06:42 account_snapshot-g-792-Filter.db
-rw-r--r-- 1 root root  2913796038 Feb 21 06:42 account_snapshot-g-792-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 06:42 account_snapshot-g-792-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-793-Compacted
-rw-r--r-- 1 root root      287286 Feb 21 14:16 account_snapshot-g-793-Data.db
-rw-r--r-- 1 root root         976 Feb 21 14:16 account_snapshot-g-793-Filter.db
-rw-r--r-- 1 root root       20857 Feb 21 14:16 account_snapshot-g-793-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:16 account_snapshot-g-793-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-794-Compacted
-rw-r--r-- 1 root root    87770771 Feb 21 14:17 account_snapshot-g-794-Data.db
-rw-r--r-- 1 root root      293944 Feb 21 14:17 account_snapshot-g-794-Filter.db
-rw-r--r-- 1 root root     6377968 Feb 21 14:17 account_snapshot-g-794-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-794-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-795-Compacted
-rw-r--r-- 1 root root    78459166 Feb 21 14:17 account_snapshot-g-795-Data.db
-rw-r--r-- 1 root root      262600 Feb 21 14:17 account_snapshot-g-795-Filter.db
-rw-r--r-- 1 root root     5698156 Feb 21 14:17 account_snapshot-g-795-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-795-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-796-Compacted
-rw-r--r-- 1 root root    69838937 Feb 21 14:17 account_snapshot-g-796-Data.db
-rw-r--r-- 1 root root      234000 Feb 21 14:17 account_snapshot-g-796-Filter.db
-rw-r--r-- 1 root root     5077447 Feb 21 14:17 account_snapshot-g-796-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-796-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-797-Compacted
-rw-r--r-- 1 root root    68094433 Feb 21 14:17 account_snapshot-g-797-Data.db
-rw-r--r-- 1 root root      227808 Feb 21 14:17 account_snapshot-g-797-Filter.db
-rw-r--r-- 1 root root     4943098 Feb 21 14:17 account_snapshot-g-797-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-797-Statistics.db
-rw-r--r-- 1 root root   304163307 Feb 21 14:20 account_snapshot-g-798-Data.db
-rw-r--r-- 1 root root     1019776 Feb 21 14:20 account_snapshot-g-798-Filter.db
-rw-r--r-- 1 root root    22096669 Feb 21 14:20 account_snapshot-g-798-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:20 account_snapshot-g-798-Statistics.db
-rw-r--r-- 1 root root    65874829 Feb 21 14:18 account_snapshot-g-799-Data.db
-rw-r--r-- 1 root root      220192 Feb 21 14:18 account_snapshot-g-799-Filter.db
-rw-r--r-- 1 root root     4777809 Feb 21 14:18 account_snapshot-g-799-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:18 account_snapshot-g-799-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-800-Compacted
-rw-r--r-- 1 root root    50067413 Feb 21 14:18 account_snapshot-g-800-Data.db
-rw-r--r-- 1 root root      167416 Feb 21 14:18 account_snapshot-g-800-Filter.db
-rw-r--r-- 1 root root     3632313 Feb 21 14:18 account_snapshot-g-800-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:18 account_snapshot-g-800-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-801-Compacted
-rw-r--r-- 1 root root    50575719 Feb 21 14:18 account_snapshot-g-801-Data.db
-rw-r--r-- 1 root root      169160 Feb 21 14:18 account_snapshot-g-801-Filter.db
-rw-r--r-- 1 root root     3669880 Feb 21 14:18 account_snapshot-g-801-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:18 account_snapshot-g-801-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-802-Compacted
-rw-r--r-- 1 root root    41788766 Feb 21 14:19 account_snapshot-g-802-Data.db
-rw-r--r-- 1 root root      139776 Feb 21 14:19 account_snapshot-g-802-Filter.db
-rw-r--r-- 1 root root     3033069 Feb 21 14:19 account_snapshot-g-802-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-802-Statistics.db
-rw-r--r-- 1 root root    46547146 Feb 21 14:19 account_snapshot-g-803-Data.db
-rw-r--r-- 1 root root      155720 Feb 21 14:19 account_snapshot-g-803-Filter.db
-rw-r--r-- 1 root root     3378457 Feb 21 14:19 account_snapshot-g-803-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-803-Statistics.db
-rw-r--r-- 1 root root   142719184 Feb 21 14:20 account_snapshot-g-804-Data.db
-rw-r--r-- 1 root root      478576 Feb 21 14:19 account_snapshot-g-804-Filter.db
-rw-r--r-- 1 root root    10356119 Feb 21 14:20 account_snapshot-g-804-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:20 account_snapshot-g-804-Statistics.db
-rw-r--r-- 1 root root    55373874 Feb 21 14:19 account_snapshot-g-805-Data.db
-rw-r--r-- 1 root root      185160 Feb 21 14:19 account_snapshot-g-805-Filter.db
-rw-r--r-- 1 root root     4017391 Feb 21 14:19 account_snapshot-g-805-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-805-Statistics.db
-rw-r--r-- 1 root root    46399227 Feb 21 14:19 account_snapshot-g-806-Data.db
-rw-r--r-- 1 root root      155120 Feb 21 14:19 account_snapshot-g-806-Filter.db
-rw-r--r-- 1 root root     3365947 Feb 21 14:19 account_snapshot-g-806-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-806-Statistics.db
-rw-r--r-- 1 root root    58491393 Feb 21 14:19 account_snapshot-g-807-Data.db
-rw-r--r-- 1 root root      196048 Feb 21 14:19 account_snapshot-g-807-Filter.db
-rw-r--r-- 1 root root     4253922 Feb 21 14:19 account_snapshot-g-807-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-807-Statistics.db
-rw-r--r-- 1 root root    47609635 Feb 21 14:20 account_snapshot-g-808-Data.db
-rw-r--r-- 1 root root      159320 Feb 21 14:20 account_snapshot-g-808-Filter.db
-rw-r--r-- 1 root root     3456985 Feb 21 14:20 account_snapshot-g-808-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:20 account_snapshot-g-808-Statistics.db
-rw-r--r-- 1 root root    46923060 Feb 21 14:20 account_snapshot-tmp-g-809-Data.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-tmp-g-809-Index.db
-rw-r--r-- 1 root root    49693602 Feb 21 14:20 account_snapshot-tmp-g-810-Data.db
-rw-r--r-- 1 root root      166600 Feb 21 14:20 account_snapshot-tmp-g-810-Filter.db
-rw-r--r-- 1 root root     3614750 Feb 21 14:20 account_snapshot-tmp-g-810-Index.db
{code}
                
      was (Author: jeromatron):
    I did repairs on all the nodes and then compacts on all the nodes. Then I did a pig job to simply count the number of rows in the column family. Again I think the overall writes were reduced but there are writes going on. I need to turn debug on and do the same test again. I did the compactions at 6:42 and the range scans at 14:16:

-rw-r--r-- 1 root root 40106228511 Feb 21 06:42 account_snapshot-g-792-Data.db
-rw-r--r-- 1 root root   206884816 Feb 21 06:42 account_snapshot-g-792-Filter.db
-rw-r--r-- 1 root root  2913796038 Feb 21 06:42 account_snapshot-g-792-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 06:42 account_snapshot-g-792-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-793-Compacted
-rw-r--r-- 1 root root      287286 Feb 21 14:16 account_snapshot-g-793-Data.db
-rw-r--r-- 1 root root         976 Feb 21 14:16 account_snapshot-g-793-Filter.db
-rw-r--r-- 1 root root       20857 Feb 21 14:16 account_snapshot-g-793-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:16 account_snapshot-g-793-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-794-Compacted
-rw-r--r-- 1 root root    87770771 Feb 21 14:17 account_snapshot-g-794-Data.db
-rw-r--r-- 1 root root      293944 Feb 21 14:17 account_snapshot-g-794-Filter.db
-rw-r--r-- 1 root root     6377968 Feb 21 14:17 account_snapshot-g-794-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-794-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-795-Compacted
-rw-r--r-- 1 root root    78459166 Feb 21 14:17 account_snapshot-g-795-Data.db
-rw-r--r-- 1 root root      262600 Feb 21 14:17 account_snapshot-g-795-Filter.db
-rw-r--r-- 1 root root     5698156 Feb 21 14:17 account_snapshot-g-795-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-795-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-796-Compacted
-rw-r--r-- 1 root root    69838937 Feb 21 14:17 account_snapshot-g-796-Data.db
-rw-r--r-- 1 root root      234000 Feb 21 14:17 account_snapshot-g-796-Filter.db
-rw-r--r-- 1 root root     5077447 Feb 21 14:17 account_snapshot-g-796-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-796-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-797-Compacted
-rw-r--r-- 1 root root    68094433 Feb 21 14:17 account_snapshot-g-797-Data.db
-rw-r--r-- 1 root root      227808 Feb 21 14:17 account_snapshot-g-797-Filter.db
-rw-r--r-- 1 root root     4943098 Feb 21 14:17 account_snapshot-g-797-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:17 account_snapshot-g-797-Statistics.db
-rw-r--r-- 1 root root   304163307 Feb 21 14:20 account_snapshot-g-798-Data.db
-rw-r--r-- 1 root root     1019776 Feb 21 14:20 account_snapshot-g-798-Filter.db
-rw-r--r-- 1 root root    22096669 Feb 21 14:20 account_snapshot-g-798-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:20 account_snapshot-g-798-Statistics.db
-rw-r--r-- 1 root root    65874829 Feb 21 14:18 account_snapshot-g-799-Data.db
-rw-r--r-- 1 root root      220192 Feb 21 14:18 account_snapshot-g-799-Filter.db
-rw-r--r-- 1 root root     4777809 Feb 21 14:18 account_snapshot-g-799-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:18 account_snapshot-g-799-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-800-Compacted
-rw-r--r-- 1 root root    50067413 Feb 21 14:18 account_snapshot-g-800-Data.db
-rw-r--r-- 1 root root      167416 Feb 21 14:18 account_snapshot-g-800-Filter.db
-rw-r--r-- 1 root root     3632313 Feb 21 14:18 account_snapshot-g-800-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:18 account_snapshot-g-800-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-801-Compacted
-rw-r--r-- 1 root root    50575719 Feb 21 14:18 account_snapshot-g-801-Data.db
-rw-r--r-- 1 root root      169160 Feb 21 14:18 account_snapshot-g-801-Filter.db
-rw-r--r-- 1 root root     3669880 Feb 21 14:18 account_snapshot-g-801-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:18 account_snapshot-g-801-Statistics.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-g-802-Compacted
-rw-r--r-- 1 root root    41788766 Feb 21 14:19 account_snapshot-g-802-Data.db
-rw-r--r-- 1 root root      139776 Feb 21 14:19 account_snapshot-g-802-Filter.db
-rw-r--r-- 1 root root     3033069 Feb 21 14:19 account_snapshot-g-802-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-802-Statistics.db
-rw-r--r-- 1 root root    46547146 Feb 21 14:19 account_snapshot-g-803-Data.db
-rw-r--r-- 1 root root      155720 Feb 21 14:19 account_snapshot-g-803-Filter.db
-rw-r--r-- 1 root root     3378457 Feb 21 14:19 account_snapshot-g-803-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-803-Statistics.db
-rw-r--r-- 1 root root   142719184 Feb 21 14:20 account_snapshot-g-804-Data.db
-rw-r--r-- 1 root root      478576 Feb 21 14:19 account_snapshot-g-804-Filter.db
-rw-r--r-- 1 root root    10356119 Feb 21 14:20 account_snapshot-g-804-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:20 account_snapshot-g-804-Statistics.db
-rw-r--r-- 1 root root    55373874 Feb 21 14:19 account_snapshot-g-805-Data.db
-rw-r--r-- 1 root root      185160 Feb 21 14:19 account_snapshot-g-805-Filter.db
-rw-r--r-- 1 root root     4017391 Feb 21 14:19 account_snapshot-g-805-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-805-Statistics.db
-rw-r--r-- 1 root root    46399227 Feb 21 14:19 account_snapshot-g-806-Data.db
-rw-r--r-- 1 root root      155120 Feb 21 14:19 account_snapshot-g-806-Filter.db
-rw-r--r-- 1 root root     3365947 Feb 21 14:19 account_snapshot-g-806-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-806-Statistics.db
-rw-r--r-- 1 root root    58491393 Feb 21 14:19 account_snapshot-g-807-Data.db
-rw-r--r-- 1 root root      196048 Feb 21 14:19 account_snapshot-g-807-Filter.db
-rw-r--r-- 1 root root     4253922 Feb 21 14:19 account_snapshot-g-807-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:19 account_snapshot-g-807-Statistics.db
-rw-r--r-- 1 root root    47609635 Feb 21 14:20 account_snapshot-g-808-Data.db
-rw-r--r-- 1 root root      159320 Feb 21 14:20 account_snapshot-g-808-Filter.db
-rw-r--r-- 1 root root     3456985 Feb 21 14:20 account_snapshot-g-808-Index.db
-rw-r--r-- 1 root root        4276 Feb 21 14:20 account_snapshot-g-808-Statistics.db
-rw-r--r-- 1 root root    46923060 Feb 21 14:20 account_snapshot-tmp-g-809-Data.db
-rw-r--r-- 1 root root           0 Feb 21 14:20 account_snapshot-tmp-g-809-Index.db
-rw-r--r-- 1 root root    49693602 Feb 21 14:20 account_snapshot-tmp-g-810-Data.db
-rw-r--r-- 1 root root      166600 Feb 21 14:20 account_snapshot-tmp-g-810-Filter.db
-rw-r--r-- 1 root root     3614750 Feb 21 14:20 account_snapshot-tmp-g-810-Index.db
                  
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207046#comment-13207046 ] 

Jonathan Ellis commented on CASSANDRA-3843:
-------------------------------------------

It's a relatively small patch, but StorageProxy and its callbacks can be fragile...  I almost didn't commit it to 1.0 either.  Tell you what though, I'll post a backported patch here and if you want you can run with it. :)
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jeremy Hanna (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207832#comment-13207832 ] 

Jeremy Hanna commented on CASSANDRA-3843:
-----------------------------------------

I did patch with v2.  Doing more testing today and it appears that there are writes occurring but it looks like a definite reduction.  It could be a valid repair thing.  I'll do some more testing and hopefully repair every node and compact every node and then do a scan across a large column family and see what happens.
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jeremy Hanna (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206881#comment-13206881 ] 

Jeremy Hanna commented on CASSANDRA-3843:
-----------------------------------------

We'll be upgrading to 1.0.8 as soon as we can, but this seems like a significant issue for anyone doing range scans - does it make sense to backport to 0.8.x?
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jonathan Ellis (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-3843.
---------------------------------------

    Resolution: Fixed
    
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jeremy Hanna (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212725#comment-13212725 ] 

Jeremy Hanna edited comment on CASSANDRA-3843 at 2/21/12 4:48 PM:
------------------------------------------------------------------

Thanks - good to know - we'll upgrade to 1.0.8 as soon as we can then.
                
      was (Author: jeromatron):
    Good to know - we'll upgrade to 1.0.8 as soon as we can then.
                  
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jonathan Ellis (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-3843:
--------------------------------------

    Attachment: 3843-v2.txt

committed, with the same fix for the 2nd occurrence of the RSRR in 1.0 StorageProxy. v2 attached in case that makes it easyer for anyone to test.
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207050#comment-13207050 ] 

Jonathan Ellis commented on CASSANDRA-3843:
-------------------------------------------

Looks to me like the 1.0 code changes from v2 apply cleanly to 0.8.  (CHANGES diff does not apply but can be ignored.)
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Philip Andronov (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204600#comment-13204600 ] 

Philip Andronov commented on CASSANDRA-3843:
--------------------------------------------

> The null version was added for CASSANDRA-2680.
Oh, good point. Sorry, I've should pay more attention on git history, not only on annotations :)

Anyway, thanks for the patch, now we could apply correct patch on our servers.
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207828#comment-13207828 ] 

Jonathan Ellis commented on CASSANDRA-3843:
-------------------------------------------

... You did patch with v2, right?
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3843) Unnecessary ReadRepair request during RangeScan

Posted by "Jeremy Hanna (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207577#comment-13207577 ] 

Jeremy Hanna commented on CASSANDRA-3843:
-----------------------------------------

I patched the version of 0.8.4 that we use with the change.  I applied it to all of our staging nodes.  However, the problem with writes on the column family it was simply doing range scans of still persists.  I had major compacted a column family on all of the nodes, then did a simple pig job to read the contents of that CF, then I got a lot of minor compactions for that column family.
                
> Unnecessary  ReadRepair request during RangeScan
> ------------------------------------------------
>
>                 Key: CASSANDRA-3843
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3843
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Philip Andronov
>            Assignee: Jonathan Ellis
>             Fix For: 1.0.8
>
>         Attachments: 3843-v2.txt, 3843.txt
>
>
> During reading with Quorum level and replication factor greater then 2, Cassandra sends at least one ReadRepair, even if there is no need to do that. 
> With the fact that read requests await until ReadRepair will finish it slows down requsts a lot, up to the Timeout :(
> It seems that the problem has been introduced by the CASSANDRA-2494, unfortunately I have no enought knowledge of Cassandra internals to fix the problem and do not broke CASSANDRA-2494 functionality, so my report without a patch.
> Code explanations:
> {code:title=RangeSliceResponseResolver.java|borderStyle=solid}
> class RangeSliceResponseResolver {
>     // ....
>     private class Reducer extends MergeIterator.Reducer<Pair<Row,InetAddress>, Row>
>     {
>     // ....
>         protected Row getReduced()
>         {
>             ColumnFamily resolved = versions.size() > 1
>                                   ? RowRepairResolver.resolveSuperset(versions)
>                                   : versions.get(0);
>             if (versions.size() < sources.size())
>             {
>                 for (InetAddress source : sources)
>                 {
>                     if (!versionSources.contains(source))
>                     {
>                           
>                         // [PA] Here we are adding null ColumnFamily.
>                         // later it will be compared with the "desired"
>                         // version and will give us "fake" difference which
>                         // forces Cassandra to send ReadRepair to a given source
>                         versions.add(null);
>                         versionSources.add(source);
>                     }
>                 }
>             }
>             // ....
>             if (resolved != null)
>                 repairResults.addAll(RowRepairResolver.scheduleRepairs(resolved, table, key, versions, versionSources));
>             // ....
>         }
>     }
> }
> {code}
> {code:title=RowRepairResolver.java|borderStyle=solid}
> public class RowRepairResolver extends AbstractRowResolver {
>     // ....
>     public static List<IAsyncResult> scheduleRepairs(ColumnFamily resolved, String table, DecoratedKey<?> key, List<ColumnFamily> versions, List<InetAddress> endpoints)
>     {
>         List<IAsyncResult> results = new ArrayList<IAsyncResult>(versions.size());
>         for (int i = 0; i < versions.size(); i++)
>         {
>             // On some iteration we have to compare null and resolved which are obviously
>             // not equals, so it will fire a ReadRequest, however it is not needed here
>             ColumnFamily diffCf = ColumnFamily.diff(versions.get(i), resolved);
>             if (diffCf == null)
>                 continue;
>         // .... 
> {code}
> Imagine the following situation:
> NodeA has X.1 // row X with the version 1
> NodeB has X.2 
> NodeC has X.? // Unknown version, but because write was with Quorum it is 1 or 2
> During the Quorum read from nodes A and B, Cassandra creates version 12 and send ReadRepair, so now nodes has the following content:
> NodeA has X.12
> NodeB has X.12
> which is correct, however Cassandra also will fire ReadRepair to NodeC. There is no need to do that, the next consistent read have a chance to be served by nodes {A, B} (no ReadRepair) or by pair {?, C} and in that case ReadRepair will be fired and brings nodeC to the consistent state
> Right now we are reading from the Index a lot and starting from some point in time we are getting TimeOutException because cluster is overloaded by the ReadRepairRequests *even* if all nodes has the same data :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira