You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2013/09/19 14:36:53 UTC

[jira] [Comment Edited] (CASSANDRA-5906) Avoid allocating over-large bloom filters

    [ https://issues.apache.org/jira/browse/CASSANDRA-5906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771836#comment-13771836 ] 

Jonathan Ellis edited comment on CASSANDRA-5906 at 9/19/13 12:35 PM:
---------------------------------------------------------------------

bq. Since ByteBuffer's hashCode is only a function of the number of bits remaining we cannot use it directly in the offer function.

I don't follow -- that should be exactly the desired behavior.  The ByteBuffer offset/remaining are telling us, "this is the part of the backing array that we're interested in," which lets us "split up" regions of memory without having to actually copy to new arrays.  So BBU.getArray is only for when some API only allows arrays and possibly having to perform a copy is the only alternative:

{code}
/**
     * You should almost never use this.  Instead, use the write* methods to avoid copies.
     */
    public static byte[] getArray(ByteBuffer buffer)
    {
        int length = buffer.remaining();

        if (buffer.hasArray())
        {
            int boff = buffer.arrayOffset() + buffer.position();
            if (boff == 0 && length == buffer.array().length)
                return buffer.array();
            else
                return Arrays.copyOfRange(buffer.array(), boff, boff + length);
        }
        // else, DirectByteBuffer.get() is the fastest route
        byte[] bytes = new byte[length];
        buffer.duplicate().get(bytes);

        return bytes;
    }
{code}

bq. The size of the HLL is a function of how precise you need it to be. If we use a p of 15 instead of 16 the size drops to 21K. Inserting the same 500K elements into a HLL+ with p=15 yields of .58% in my tests.

So, we can trade a factor of 2 size for roughly a factor of 2 precision?.  Unless we have a use for keeping these on heap that I can't think of, I'd say we should double the size and only read them in for compaction.
                
      was (Author: jbellis):
    bq. Since ByteBuffer's hashCode is only a function of the number of bits remaining we cannot use it directly in the offer function.

I don't follow -- that should be exactly the desired behavior.  The ByteBuffer offset/remaining are telling us, "this is the part of the backing array that we're interested in," which lets us "split up" regions of memory without having to actually copy to new arrays.

bq. The size of the HLL is a function of how precise you need it to be. If we use a p of 15 instead of 16 the size drops to 21K. Inserting the same 500K elements into a HLL+ with p=15 yields of .58% in my tests.

So, we can trade a factor of 2 size for roughly a factor of 2 precision?.  Unless we have a use for keeping these on heap that I can't think of, I'd say we should double the size and only read them in for compaction.
                  
> Avoid allocating over-large bloom filters
> -----------------------------------------
>
>                 Key: CASSANDRA-5906
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5906
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Yuki Morishita
>             Fix For: 2.0.1
>
>
> We conservatively estimate the number of partitions post-compaction to be the total number of partitions pre-compaction.  That is, we assume the worst-case scenario of no partition overlap at all.
> This can result in substantial memory wasted in sstables resulting from highly overlapping compactions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira