You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jordan West (JIRA)" <ji...@apache.org> on 2019/08/06 14:44:00 UTC
[jira] [Comment Edited] (CASSANDRA-15259) Selecting Index by Lowest Mean Column Count Selects Random Index

    [ https://issues.apache.org/jira/browse/CASSANDRA-15259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900575#comment-16900575 ] 

Jordan West edited comment on CASSANDRA-15259 at 8/6/19 2:43 PM:
-----------------------------------------------------------------

[~bdeggleston] good catch re: 2.x sstables. I see two ways to handle that off the top of my head – besides not including the legacy sstables in the calculation which is broken.

I think I prefer {{getMeanRowCount2}} (average of the row count and column count) because in the case of 100% legacy sstables or 100% new sstables it degrades to {{getMeanColumns}} or the original {{getMeanRowCount.}} Neither implementation is ideal since we have to handle it at the per sstable level and what that means for an average is ambiguous. 

Also, I wonder if the method name should change and/or if the logic should be moved to somewhere index specific like {{CassandraIndex}}, now that what its doing is a bit more specialized and less clear. WDYT?

 
{code:java}
public int getMeanRowCount()
{
    long totalRows = 0;
    long totalPartitions = 0;
    for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
    {
        if (sstable.descriptor.version.storeRows())
        {
            totalPartitions += sstable.getEstimatedPartitionSize().count();
            totalRows += sstable.getTotalRows();
        } else
        {
            long colCount = sstable.getEstimatedColumnCount().count();
            totalPartitions += colCount;
            totalRows += sstable.getEstimatedColumnCount().mean() * colCount;
        }
    }

    return totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
}

public int getMeanRowCount2()
{
    long totalRows = 0;
    long totalPartitions = 0;
    long legacyCols = 0;
    long legacyTotal = 0;
    for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
    {
        if (sstable.descriptor.version.storeRows())
        {
            totalPartitions += sstable.getEstimatedPartitionSize().count();
            totalRows += sstable.getTotalRows();
        } else
        {
            long colCount = sstable.getEstimatedColumnCount().count();
            legacyCols += sstable.getEstimatedColumnCount().mean() * colCount;
            legacyTotal += colCount;
        }
    }

    int rowMean = totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
    int legacyMean = legacyTotal > 0 ? (int) (legacyCols / legacyTotal) : 0;

    return (int) (((rowMean * totalPartitions) + (legacyMean * legacyTotal)) / (totalPartitions + legacyTotal));
}
{code}
 


was (Author: jrwest):
[~bdeggleston] good catch re: 2.1 sstables. I see two ways to handle that off the top of my head – besides not including the legacy sstables in the calculation which is broken.

I think I prefer {{getMeanRowCount2}} (average of the row count and column count) because in the case of 100% legacy sstables or 100% new sstables it degrades to {{getMeanColumns}} or the original {{getMeanRowCount.}} Neither implementation is ideal since we have to handle it at the per sstable level and what that means for an average is ambiguous. 

Also, I wonder if the method name should change and/or if the logic should be moved to somewhere index specific like {{CassandraIndex}}, now that what its doing is a bit more specialized and less clear. WDYT?

 
{code:java}
public int getMeanRowCount()
{
    long totalRows = 0;
    long totalPartitions = 0;
    for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
    {
        if (sstable.descriptor.version.storeRows())
        {
            totalPartitions += sstable.getEstimatedPartitionSize().count();
            totalRows += sstable.getTotalRows();
        } else
        {
            long colCount = sstable.getEstimatedColumnCount().count();
            totalPartitions += colCount;
            totalRows += sstable.getEstimatedColumnCount().mean() * colCount;
        }
    }

    return totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
}

public int getMeanRowCount2()
{
    long totalRows = 0;
    long totalPartitions = 0;
    long legacyCols = 0;
    long legacyTotal = 0;
    for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
    {
        if (sstable.descriptor.version.storeRows())
        {
            totalPartitions += sstable.getEstimatedPartitionSize().count();
            totalRows += sstable.getTotalRows();
        } else
        {
            long colCount = sstable.getEstimatedColumnCount().count();
            legacyCols += sstable.getEstimatedColumnCount().mean() * colCount;
            legacyTotal += colCount;
        }
    }

    int rowMean = totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
    int legacyMean = legacyTotal > 0 ? (int) (legacyCols / legacyTotal) : 0;

    return (int) (((rowMean * totalPartitions) + (legacyMean * legacyTotal)) / (totalPartitions + legacyTotal));
}
{code}
 

> Selecting Index by Lowest Mean Column Count Selects Random Index
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-15259
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15259
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Feature/2i Index
>            Reporter: Jordan West
>            Assignee: Jordan West
>            Priority: Urgent
>             Fix For: 3.0.19, 4.0, 3.11.x
>
>
> {{CassandraIndex}} uses [{{ColumnFamilyStore#getMeanColumns}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/internal/CassandraIndex.java#L273], average columns per partition, which always returns the same answer for index CFs because they contain no regular columns and clustering columns aren't included in the count in Cassandra 3.0+.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org