You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Sam Klock <sk...@akamai.com> on 2018/03/22 17:24:39 UTC

Optimizing queries for partition keys

Cassandra devs,

We use workflows in some of our clusters (running 3.0.15) that involve
"SELECT DISTINCT key FROM..."-style queries.  For some tables, we
observed extremely poor performance under light load (i.e., a small
number of rows per second and frequent timeouts), which we eventually
traced to replicas shipping entire rows (which in some cases could store
on the order of MBs of data) to service the query.  That surprised us
(partly because 2.1 doesn't seem to behave this way), so we did some
digging, and we eventually came up with a patch that modifies
SelectStatement.java in the following way: if the selection in the query
only includes the partition key, then when building a ColumnFilter for
the query, use:

    builder = ColumnFilter.selectionBuilder();

instead of:

    builder = ColumnFilter.allColumnsBuilder();

to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
seems to repair the performance regression, and it doesn't appear to
break any functionality (based on the unit tests and some smoke tests we
ran involving insertions and deletions).

We'd like to contribute this patch back to the project, but we're not
convinced that there aren't subtle correctness issues we're missing,
judging both from comments in the code and the existence of
CASSANDRA-5912, which suggests optimizing this kind of query is nontrivial.

So: does this change sound safe to make, or are there corner cases we
need to account for?  If there are corner cases, are there plausibly
ways of addressing them at the SelectStatement level, or will we need to
look deeper?

Thanks,
SK

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Optimizing queries for partition keys

Posted by Sam Klock <sk...@akamai.com>.

Can someone please take a look at CASSANDRA-14415 when you have chance?
Getting a fix into a Cassandra release is not especially urgent for us,
but in lieu of that we would like to know whether it's safe to include
in our local build of Cassandra before attempting to deploy it.

Thanks,
SK

On 2018-04-24 14:16, Sam Klock wrote:
> Thanks.  For those interested: opened CASSANDRA-14415.
> 
> SK
> 
> On 2018-04-19 06:04, Benjamin Lerer wrote:
>> Hi Sam,
>>
>> Your finding is interesting. Effectively, if the number of bytes to skip is
>> larger than the remaining bytes in the buffer + the buffer size it could be
>> faster to use seek.
>> Feel free to open a JIRA ticket and attach your patch. It will be great if
>> you could add to the ticket your table schema as well
>>  as some information on your environment (e.g. disk type).
>>
>> On Tue, Apr 17, 2018 at 8:53 PM, Sam Klock <sk...@akamai.com> wrote:
>>
>>> Thanks (and apologies for the delayed response); that was the kind of
>>> feedback we were looking for.
>>>
>>> We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially
>>> addresses our problem in the sense that it does limit the data sent on
>>> the wire.  The performance is still extremely poor, however, due to the
>>> fact that Cassandra continues to read large volumes of data from disk.
>>> (We've also confirmed this behavior in 3.11.2.)
>>>
>>> With a bit more investigation, we now believe the problem (after
>>> CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(),
>>> which appears to read bytes in order to skip them.  The subclass used in
>>> our case, RandomAccessReader, exposes a seek(), so we overrode
>>> skipBytes() in it to make use of seek(), and that seems to resolve the
>>> problem.
>>>
>>> This change is intuitively much safer than the one we'd originally
>>> identified, but we'd still like to confirm with you folks whether it's
>>> likely safe and, if so whether it's also potentially worth contributing.
>>>
>>> Thanks,
>>> Sk
>>>
>>>
>>> On 2018-03-22 18:16, Benjamin Lerer wrote:
>>>
>>>> You should check the 3.x release. CASSANDRA-10657 could have fixed your
>>>> problem.
>>>>
>>>>
>>>> On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer <
>>>> benjamin.lerer@datastax.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>> Syvlain explained the problem in CASSANDRA-4536:
>>>>> " Let me note that in CQL3 a row that have no live column don't exist, so
>>>>> we can't really implement this with a range slice having an empty columns
>>>>> list. Instead we should do a range slice with a full-row slice predicate
>>>>> with a count of 1, to make sure we do have a live column before including
>>>>> the partition key. "
>>>>>
>>>>> By using ColumnFilter.selectionBuilder(); you do not select all the
>>>>> columns. By consequence, some partitions might be returned while they
>>>>> should not.
>>>>>
>>>>> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <sk...@akamai.com> wrote:
>>>>>
>>>>> Cassandra devs,
>>>>>>
>>>>>> We use workflows in some of our clusters (running 3.0.15) that involve
>>>>>> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
>>>>>> observed extremely poor performance under light load (i.e., a small
>>>>>> number of rows per second and frequent timeouts), which we eventually
>>>>>> traced to replicas shipping entire rows (which in some cases could store
>>>>>> on the order of MBs of data) to service the query.  That surprised us
>>>>>> (partly because 2.1 doesn't seem to behave this way), so we did some
>>>>>> digging, and we eventually came up with a patch that modifies
>>>>>> SelectStatement.java in the following way: if the selection in the query
>>>>>> only includes the partition key, then when building a ColumnFilter for
>>>>>> the query, use:
>>>>>>
>>>>>>      builder = ColumnFilter.selectionBuilder();
>>>>>>
>>>>>> instead of:
>>>>>>
>>>>>>      builder = ColumnFilter.allColumnsBuilder();
>>>>>>
>>>>>> to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
>>>>>> seems to repair the performance regression, and it doesn't appear to
>>>>>> break any functionality (based on the unit tests and some smoke tests we
>>>>>> ran involving insertions and deletions).
>>>>>>
>>>>>> We'd like to contribute this patch back to the project, but we're not
>>>>>> convinced that there aren't subtle correctness issues we're missing,
>>>>>> judging both from comments in the code and the existence of
>>>>>> CASSANDRA-5912, which suggests optimizing this kind of query is
>>>>>> nontrivial.
>>>>>>
>>>>>> So: does this change sound safe to make, or are there corner cases we
>>>>>> need to account for?  If there are corner cases, are there plausibly
>>>>>> ways of addressing them at the SelectStatement level, or will we need to
>>>>>> look deeper?
>>>>>>
>>>>>> Thanks,
>>>>>> SK
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>
>>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Optimizing queries for partition keys

Posted by Sam Klock <sk...@akamai.com>.

Thanks.  For those interested: opened CASSANDRA-14415.

SK

On 2018-04-19 06:04, Benjamin Lerer wrote:
> Hi Sam,
> 
> Your finding is interesting. Effectively, if the number of bytes to skip is
> larger than the remaining bytes in the buffer + the buffer size it could be
> faster to use seek.
> Feel free to open a JIRA ticket and attach your patch. It will be great if
> you could add to the ticket your table schema as well
>  as some information on your environment (e.g. disk type).
> 
> On Tue, Apr 17, 2018 at 8:53 PM, Sam Klock <sk...@akamai.com> wrote:
> 
>> Thanks (and apologies for the delayed response); that was the kind of
>> feedback we were looking for.
>>
>> We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially
>> addresses our problem in the sense that it does limit the data sent on
>> the wire.  The performance is still extremely poor, however, due to the
>> fact that Cassandra continues to read large volumes of data from disk.
>> (We've also confirmed this behavior in 3.11.2.)
>>
>> With a bit more investigation, we now believe the problem (after
>> CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(),
>> which appears to read bytes in order to skip them.  The subclass used in
>> our case, RandomAccessReader, exposes a seek(), so we overrode
>> skipBytes() in it to make use of seek(), and that seems to resolve the
>> problem.
>>
>> This change is intuitively much safer than the one we'd originally
>> identified, but we'd still like to confirm with you folks whether it's
>> likely safe and, if so whether it's also potentially worth contributing.
>>
>> Thanks,
>> Sk
>>
>>
>> On 2018-03-22 18:16, Benjamin Lerer wrote:
>>
>>> You should check the 3.x release. CASSANDRA-10657 could have fixed your
>>> problem.
>>>
>>>
>>> On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer <
>>> benjamin.lerer@datastax.com
>>>
>>>> wrote:
>>>>
>>>
>>> Syvlain explained the problem in CASSANDRA-4536:
>>>> " Let me note that in CQL3 a row that have no live column don't exist, so
>>>> we can't really implement this with a range slice having an empty columns
>>>> list. Instead we should do a range slice with a full-row slice predicate
>>>> with a count of 1, to make sure we do have a live column before including
>>>> the partition key. "
>>>>
>>>> By using ColumnFilter.selectionBuilder(); you do not select all the
>>>> columns. By consequence, some partitions might be returned while they
>>>> should not.
>>>>
>>>> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <sk...@akamai.com> wrote:
>>>>
>>>> Cassandra devs,
>>>>>
>>>>> We use workflows in some of our clusters (running 3.0.15) that involve
>>>>> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
>>>>> observed extremely poor performance under light load (i.e., a small
>>>>> number of rows per second and frequent timeouts), which we eventually
>>>>> traced to replicas shipping entire rows (which in some cases could store
>>>>> on the order of MBs of data) to service the query.  That surprised us
>>>>> (partly because 2.1 doesn't seem to behave this way), so we did some
>>>>> digging, and we eventually came up with a patch that modifies
>>>>> SelectStatement.java in the following way: if the selection in the query
>>>>> only includes the partition key, then when building a ColumnFilter for
>>>>> the query, use:
>>>>>
>>>>>      builder = ColumnFilter.selectionBuilder();
>>>>>
>>>>> instead of:
>>>>>
>>>>>      builder = ColumnFilter.allColumnsBuilder();
>>>>>
>>>>> to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
>>>>> seems to repair the performance regression, and it doesn't appear to
>>>>> break any functionality (based on the unit tests and some smoke tests we
>>>>> ran involving insertions and deletions).
>>>>>
>>>>> We'd like to contribute this patch back to the project, but we're not
>>>>> convinced that there aren't subtle correctness issues we're missing,
>>>>> judging both from comments in the code and the existence of
>>>>> CASSANDRA-5912, which suggests optimizing this kind of query is
>>>>> nontrivial.
>>>>>
>>>>> So: does this change sound safe to make, or are there corner cases we
>>>>> need to account for?  If there are corner cases, are there plausibly
>>>>> ways of addressing them at the SelectStatement level, or will we need to
>>>>> look deeper?
>>>>>
>>>>> Thanks,
>>>>> SK
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>
>>>>>
>>>>>
>>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>
>>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Optimizing queries for partition keys

Posted by Benjamin Lerer <be...@datastax.com>.

Hi Sam,

Your finding is interesting. Effectively, if the number of bytes to skip is
larger than the remaining bytes in the buffer + the buffer size it could be
faster to use seek.
Feel free to open a JIRA ticket and attach your patch. It will be great if
you could add to the ticket your table schema as well
 as some information on your environment (e.g. disk type).

On Tue, Apr 17, 2018 at 8:53 PM, Sam Klock <sk...@akamai.com> wrote:

> Thanks (and apologies for the delayed response); that was the kind of
> feedback we were looking for.
>
> We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially
> addresses our problem in the sense that it does limit the data sent on
> the wire.  The performance is still extremely poor, however, due to the
> fact that Cassandra continues to read large volumes of data from disk.
> (We've also confirmed this behavior in 3.11.2.)
>
> With a bit more investigation, we now believe the problem (after
> CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(),
> which appears to read bytes in order to skip them.  The subclass used in
> our case, RandomAccessReader, exposes a seek(), so we overrode
> skipBytes() in it to make use of seek(), and that seems to resolve the
> problem.
>
> This change is intuitively much safer than the one we'd originally
> identified, but we'd still like to confirm with you folks whether it's
> likely safe and, if so whether it's also potentially worth contributing.
>
> Thanks,
> Sk
>
>
> On 2018-03-22 18:16, Benjamin Lerer wrote:
>
>> You should check the 3.x release. CASSANDRA-10657 could have fixed your
>> problem.
>>
>>
>> On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer <
>> benjamin.lerer@datastax.com
>>
>>> wrote:
>>>
>>
>> Syvlain explained the problem in CASSANDRA-4536:
>>> " Let me note that in CQL3 a row that have no live column don't exist, so
>>> we can't really implement this with a range slice having an empty columns
>>> list. Instead we should do a range slice with a full-row slice predicate
>>> with a count of 1, to make sure we do have a live column before including
>>> the partition key. "
>>>
>>> By using ColumnFilter.selectionBuilder(); you do not select all the
>>> columns. By consequence, some partitions might be returned while they
>>> should not.
>>>
>>> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <sk...@akamai.com> wrote:
>>>
>>> Cassandra devs,
>>>>
>>>> We use workflows in some of our clusters (running 3.0.15) that involve
>>>> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
>>>> observed extremely poor performance under light load (i.e., a small
>>>> number of rows per second and frequent timeouts), which we eventually
>>>> traced to replicas shipping entire rows (which in some cases could store
>>>> on the order of MBs of data) to service the query.  That surprised us
>>>> (partly because 2.1 doesn't seem to behave this way), so we did some
>>>> digging, and we eventually came up with a patch that modifies
>>>> SelectStatement.java in the following way: if the selection in the query
>>>> only includes the partition key, then when building a ColumnFilter for
>>>> the query, use:
>>>>
>>>>      builder = ColumnFilter.selectionBuilder();
>>>>
>>>> instead of:
>>>>
>>>>      builder = ColumnFilter.allColumnsBuilder();
>>>>
>>>> to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
>>>> seems to repair the performance regression, and it doesn't appear to
>>>> break any functionality (based on the unit tests and some smoke tests we
>>>> ran involving insertions and deletions).
>>>>
>>>> We'd like to contribute this patch back to the project, but we're not
>>>> convinced that there aren't subtle correctness issues we're missing,
>>>> judging both from comments in the code and the existence of
>>>> CASSANDRA-5912, which suggests optimizing this kind of query is
>>>> nontrivial.
>>>>
>>>> So: does this change sound safe to make, or are there corner cases we
>>>> need to account for?  If there are corner cases, are there plausibly
>>>> ways of addressing them at the SelectStatement level, or will we need to
>>>> look deeper?
>>>>
>>>> Thanks,
>>>> SK
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>
>>>>
>>>>
>>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: Optimizing queries for partition keys

Posted by Sam Klock <sk...@akamai.com>.

Thanks (and apologies for the delayed response); that was the kind of
feedback we were looking for.

We backported the fix for CASSANDRA-10657 to 3.0.16, and it partially
addresses our problem in the sense that it does limit the data sent on
the wire.  The performance is still extremely poor, however, due to the
fact that Cassandra continues to read large volumes of data from disk.
(We've also confirmed this behavior in 3.11.2.)

With a bit more investigation, we now believe the problem (after
CASSNDRA-10657 is applied) is in RebufferingInputStream.skipBytes(),
which appears to read bytes in order to skip them.  The subclass used in
our case, RandomAccessReader, exposes a seek(), so we overrode
skipBytes() in it to make use of seek(), and that seems to resolve the
problem.

This change is intuitively much safer than the one we'd originally
identified, but we'd still like to confirm with you folks whether it's
likely safe and, if so whether it's also potentially worth contributing.

Thanks,
Sk

On 2018-03-22 18:16, Benjamin Lerer wrote:
> You should check the 3.x release. CASSANDRA-10657 could have fixed your
> problem.
> 
> 
> On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer <benjamin.lerer@datastax.com
>> wrote:
> 
>> Syvlain explained the problem in CASSANDRA-4536:
>> " Let me note that in CQL3 a row that have no live column don't exist, so
>> we can't really implement this with a range slice having an empty columns
>> list. Instead we should do a range slice with a full-row slice predicate
>> with a count of 1, to make sure we do have a live column before including
>> the partition key. "
>>
>> By using ColumnFilter.selectionBuilder(); you do not select all the
>> columns. By consequence, some partitions might be returned while they
>> should not.
>>
>> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <sk...@akamai.com> wrote:
>>
>>> Cassandra devs,
>>>
>>> We use workflows in some of our clusters (running 3.0.15) that involve
>>> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
>>> observed extremely poor performance under light load (i.e., a small
>>> number of rows per second and frequent timeouts), which we eventually
>>> traced to replicas shipping entire rows (which in some cases could store
>>> on the order of MBs of data) to service the query.  That surprised us
>>> (partly because 2.1 doesn't seem to behave this way), so we did some
>>> digging, and we eventually came up with a patch that modifies
>>> SelectStatement.java in the following way: if the selection in the query
>>> only includes the partition key, then when building a ColumnFilter for
>>> the query, use:
>>>
>>>      builder = ColumnFilter.selectionBuilder();
>>>
>>> instead of:
>>>
>>>      builder = ColumnFilter.allColumnsBuilder();
>>>
>>> to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
>>> seems to repair the performance regression, and it doesn't appear to
>>> break any functionality (based on the unit tests and some smoke tests we
>>> ran involving insertions and deletions).
>>>
>>> We'd like to contribute this patch back to the project, but we're not
>>> convinced that there aren't subtle correctness issues we're missing,
>>> judging both from comments in the code and the existence of
>>> CASSANDRA-5912, which suggests optimizing this kind of query is
>>> nontrivial.
>>>
>>> So: does this change sound safe to make, or are there corner cases we
>>> need to account for?  If there are corner cases, are there plausibly
>>> ways of addressing them at the SelectStatement level, or will we need to
>>> look deeper?
>>>
>>> Thanks,
>>> SK
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>
>>>
>>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Optimizing queries for partition keys

Posted by Benjamin Lerer <be...@datastax.com>.

You should check the 3.x release. CASSANDRA-10657 could have fixed your
problem.


On Thu, Mar 22, 2018 at 9:15 PM, Benjamin Lerer <benjamin.lerer@datastax.com
> wrote:

> Syvlain explained the problem in CASSANDRA-4536:
> " Let me note that in CQL3 a row that have no live column don't exist, so
> we can't really implement this with a range slice having an empty columns
> list. Instead we should do a range slice with a full-row slice predicate
> with a count of 1, to make sure we do have a live column before including
> the partition key. "
>
> By using ColumnFilter.selectionBuilder(); you do not select all the
> columns. By consequence, some partitions might be returned while they
> should not.
>
> On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <sk...@akamai.com> wrote:
>
>> Cassandra devs,
>>
>> We use workflows in some of our clusters (running 3.0.15) that involve
>> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
>> observed extremely poor performance under light load (i.e., a small
>> number of rows per second and frequent timeouts), which we eventually
>> traced to replicas shipping entire rows (which in some cases could store
>> on the order of MBs of data) to service the query.  That surprised us
>> (partly because 2.1 doesn't seem to behave this way), so we did some
>> digging, and we eventually came up with a patch that modifies
>> SelectStatement.java in the following way: if the selection in the query
>> only includes the partition key, then when building a ColumnFilter for
>> the query, use:
>>
>>     builder = ColumnFilter.selectionBuilder();
>>
>> instead of:
>>
>>     builder = ColumnFilter.allColumnsBuilder();
>>
>> to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
>> seems to repair the performance regression, and it doesn't appear to
>> break any functionality (based on the unit tests and some smoke tests we
>> ran involving insertions and deletions).
>>
>> We'd like to contribute this patch back to the project, but we're not
>> convinced that there aren't subtle correctness issues we're missing,
>> judging both from comments in the code and the existence of
>> CASSANDRA-5912, which suggests optimizing this kind of query is
>> nontrivial.
>>
>> So: does this change sound safe to make, or are there corner cases we
>> need to account for?  If there are corner cases, are there plausibly
>> ways of addressing them at the SelectStatement level, or will we need to
>> look deeper?
>>
>> Thanks,
>> SK
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>
>>
>

Re: Optimizing queries for partition keys

Posted by Benjamin Lerer <be...@datastax.com>.

Syvlain explained the problem in CASSANDRA-4536:
" Let me note that in CQL3 a row that have no live column don't exist, so
we can't really implement this with a range slice having an empty columns
list. Instead we should do a range slice with a full-row slice predicate
with a count of 1, to make sure we do have a live column before including
the partition key. "

By using ColumnFilter.selectionBuilder(); you do not select all the
columns. By consequence, some partitions might be returned while they
should not.

On Thu, Mar 22, 2018 at 6:24 PM, Sam Klock <sk...@akamai.com> wrote:

> Cassandra devs,
>
> We use workflows in some of our clusters (running 3.0.15) that involve
> "SELECT DISTINCT key FROM..."-style queries.  For some tables, we
> observed extremely poor performance under light load (i.e., a small
> number of rows per second and frequent timeouts), which we eventually
> traced to replicas shipping entire rows (which in some cases could store
> on the order of MBs of data) to service the query.  That surprised us
> (partly because 2.1 doesn't seem to behave this way), so we did some
> digging, and we eventually came up with a patch that modifies
> SelectStatement.java in the following way: if the selection in the query
> only includes the partition key, then when building a ColumnFilter for
> the query, use:
>
>     builder = ColumnFilter.selectionBuilder();
>
> instead of:
>
>     builder = ColumnFilter.allColumnsBuilder();
>
> to initialize the ColumnFilter.Builder in gatherQueriedColumns().  That
> seems to repair the performance regression, and it doesn't appear to
> break any functionality (based on the unit tests and some smoke tests we
> ran involving insertions and deletions).
>
> We'd like to contribute this patch back to the project, but we're not
> convinced that there aren't subtle correctness issues we're missing,
> judging both from comments in the code and the existence of
> CASSANDRA-5912, which suggests optimizing this kind of query is nontrivial.
>
> So: does this change sound safe to make, or are there corner cases we
> need to account for?  If there are corner cases, are there plausibly
> ways of addressing them at the SelectStatement level, or will we need to
> look deeper?
>
> Thanks,
> SK
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>