You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jordan West (JIRA)" <ji...@apache.org> on 2019/08/06 14:44:00 UTC
[jira] [Comment Edited] (CASSANDRA-15259) Selecting Index by Lowest
Mean Column Count Selects Random Index
[ https://issues.apache.org/jira/browse/CASSANDRA-15259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900575#comment-16900575 ]
Jordan West edited comment on CASSANDRA-15259 at 8/6/19 2:43 PM:
-----------------------------------------------------------------
[~bdeggleston] good catch re: 2.x sstables. I see two ways to handle that off the top of my head – besides not including the legacy sstables in the calculation which is broken.
I think I prefer {{getMeanRowCount2}} (average of the row count and column count) because in the case of 100% legacy sstables or 100% new sstables it degrades to {{getMeanColumns}} or the original {{getMeanRowCount.}} Neither implementation is ideal since we have to handle it at the per sstable level and what that means for an average is ambiguous.
Also, I wonder if the method name should change and/or if the logic should be moved to somewhere index specific like {{CassandraIndex}}, now that what its doing is a bit more specialized and less clear. WDYT?
{code:java}
public int getMeanRowCount()
{
long totalRows = 0;
long totalPartitions = 0;
for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
{
if (sstable.descriptor.version.storeRows())
{
totalPartitions += sstable.getEstimatedPartitionSize().count();
totalRows += sstable.getTotalRows();
} else
{
long colCount = sstable.getEstimatedColumnCount().count();
totalPartitions += colCount;
totalRows += sstable.getEstimatedColumnCount().mean() * colCount;
}
}
return totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
}
public int getMeanRowCount2()
{
long totalRows = 0;
long totalPartitions = 0;
long legacyCols = 0;
long legacyTotal = 0;
for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
{
if (sstable.descriptor.version.storeRows())
{
totalPartitions += sstable.getEstimatedPartitionSize().count();
totalRows += sstable.getTotalRows();
} else
{
long colCount = sstable.getEstimatedColumnCount().count();
legacyCols += sstable.getEstimatedColumnCount().mean() * colCount;
legacyTotal += colCount;
}
}
int rowMean = totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
int legacyMean = legacyTotal > 0 ? (int) (legacyCols / legacyTotal) : 0;
return (int) (((rowMean * totalPartitions) + (legacyMean * legacyTotal)) / (totalPartitions + legacyTotal));
}
{code}
was (Author: jrwest):
[~bdeggleston] good catch re: 2.1 sstables. I see two ways to handle that off the top of my head – besides not including the legacy sstables in the calculation which is broken.
I think I prefer {{getMeanRowCount2}} (average of the row count and column count) because in the case of 100% legacy sstables or 100% new sstables it degrades to {{getMeanColumns}} or the original {{getMeanRowCount.}} Neither implementation is ideal since we have to handle it at the per sstable level and what that means for an average is ambiguous.
Also, I wonder if the method name should change and/or if the logic should be moved to somewhere index specific like {{CassandraIndex}}, now that what its doing is a bit more specialized and less clear. WDYT?
{code:java}
public int getMeanRowCount()
{
long totalRows = 0;
long totalPartitions = 0;
for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
{
if (sstable.descriptor.version.storeRows())
{
totalPartitions += sstable.getEstimatedPartitionSize().count();
totalRows += sstable.getTotalRows();
} else
{
long colCount = sstable.getEstimatedColumnCount().count();
totalPartitions += colCount;
totalRows += sstable.getEstimatedColumnCount().mean() * colCount;
}
}
return totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
}
public int getMeanRowCount2()
{
long totalRows = 0;
long totalPartitions = 0;
long legacyCols = 0;
long legacyTotal = 0;
for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
{
if (sstable.descriptor.version.storeRows())
{
totalPartitions += sstable.getEstimatedPartitionSize().count();
totalRows += sstable.getTotalRows();
} else
{
long colCount = sstable.getEstimatedColumnCount().count();
legacyCols += sstable.getEstimatedColumnCount().mean() * colCount;
legacyTotal += colCount;
}
}
int rowMean = totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
int legacyMean = legacyTotal > 0 ? (int) (legacyCols / legacyTotal) : 0;
return (int) (((rowMean * totalPartitions) + (legacyMean * legacyTotal)) / (totalPartitions + legacyTotal));
}
{code}
> Selecting Index by Lowest Mean Column Count Selects Random Index
> ----------------------------------------------------------------
>
> Key: CASSANDRA-15259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15259
> Project: Cassandra
> Issue Type: Bug
> Components: Feature/2i Index
> Reporter: Jordan West
> Assignee: Jordan West
> Priority: Urgent
> Fix For: 3.0.19, 4.0, 3.11.x
>
>
> {{CassandraIndex}} uses [{{ColumnFamilyStore#getMeanColumns}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/internal/CassandraIndex.java#L273], average columns per partition, which always returns the same answer for index CFs because they contain no regular columns and clustering columns aren't included in the count in Cassandra 3.0+.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org