You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Dmitry Erokhin (JIRA)" <ji...@apache.org> on 2017/06/05 17:18:04 UTC

[jira] [Commented] (CASSANDRA-13545) Exception in CompactionExecutor leading to tmplink files not being removed

    [ https://issues.apache.org/jira/browse/CASSANDRA-13545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037233#comment-16037233 ] 

Dmitry Erokhin commented on CASSANDRA-13545:
--------------------------------------------

One of our engineers has been able to find at least one issue which leads to this condition. His findings are below.
---

With a consistent reproduction outside of the production cluster, I downloaded the cassandra source code, setup a remote debugger (eclipse) and connected it to the cassandra process running on my node.
 
At this point I was able to setup breakpoints and examine a live system, starting at the last frame in the traceback (org.apache.cassandra.io.sstable.IndexSummary.<init>(IndexSummary.java:86)). Stepping through the code duing a live compaction, I was able to determine that the issue is indeed a bug in Cassandra that occurs when it is trying to run a compaction job with a very large number of partitions.
 
The SafeMemoryWriter class is used to build the index summary for the new sstable.
{code:java}
public class SafeMemoryWriter extends DataOutputBuffer
{
    private SafeMemory memory;
 
    @SuppressWarnings("resource")
    public SafeMemoryWriter(long initialCapacity)
    {
        this(new SafeMemory(initialCapacity));
    }
 
    private SafeMemoryWriter(SafeMemory memory)
    {
        super(tailBuffer(memory).order(ByteOrder.BIG_ENDIAN));
        this.memory = memory;
    }
 
    public SafeMemory currentBuffer()
    {
        return memory;
    }
 
    @Override
    protected void reallocate(long count)
    {
        long newCapacity = calculateNewSize(count);
        if (newCapacity != capacity())
        {
            long position = length();
            ByteOrder order = buffer.order();
 
            SafeMemory oldBuffer = memory;
            memory = this.memory.copy(newCapacity);
            buffer = tailBuffer(memory);
 
            int newPosition = (int) (position - tailOffset(memory));
            buffer.position(newPosition);
            buffer.order(order);
 
            oldBuffer.free();
        }
    }
 
    public void setCapacity(long newCapacity)
    {
        reallocate(newCapacity);
    }
 
    public void close()
    {
        memory.close();
    }
 
    public Throwable close(Throwable accumulate)
    {
        return memory.close(accumulate);
    }
 
    public long length()
    {
        return tailOffset(memory) +  buffer.position();
    }
 
    public long capacity()
    {
        return memory.size();
    }
 
    @Override
    public SafeMemoryWriter order(ByteOrder order)
    {
        super.order(order);
        return this;
    }
 
    @Override
    public long validateReallocation(long newSize)
    {
        return newSize;
    }
 
    private static long tailOffset(Memory memory)
    {
        return Math.max(0, memory.size - Integer.MAX_VALUE);
    }
 
    private static ByteBuffer tailBuffer(Memory memory)
    {
        return memory.asByteBuffer(tailOffset(memory), (int) Math.min(memory.size, Integer.MAX_VALUE));
    }
}
{code}
The appears like it is intended to work with buffers larger than Integer.MAX_VALUE, however if the initial size of the buffer is larger than that the initial value of length() will be incorrect (it won’t be zero) and writing via the DataOutputBuffer will write in the wrong location (it won’t start at offset 0).
 
 
{code:java}
    public IndexSummaryBuilder(long expectedKeys, int minIndexInterval, int samplingLevel)
    {
        this.samplingLevel = samplingLevel;
        this.startPoints = Downsampling.getStartPoints(BASE_SAMPLING_LEVEL, samplingLevel);
 
        long maxExpectedEntries = expectedKeys / minIndexInterval;
        if (maxExpectedEntries > Integer.MAX_VALUE)
        {
            // that's a _lot_ of keys, and a very low min index interval
            int effectiveMinInterval = (int) Math.ceil((double) Integer.MAX_VALUE / expectedKeys);
            maxExpectedEntries = expectedKeys / effectiveMinInterval;
            assert maxExpectedEntries <= Integer.MAX_VALUE : maxExpectedEntries;
            logger.warn("min_index_interval of {} is too low for {} expected keys; using interval of {} instead",
                        minIndexInterval, expectedKeys, effectiveMinInterval);
            this.minIndexInterval = effectiveMinInterval;
        }
        else
        {
            this.minIndexInterval = minIndexInterval;
        }
 
        // for initializing data structures, adjust our estimates based on the sampling level
        maxExpectedEntries = Math.max(1, (maxExpectedEntries * samplingLevel) / BASE_SAMPLING_LEVEL);
        offsets = new SafeMemoryWriter(4 * maxExpectedEntries).order(ByteOrder.nativeOrder());
        entries = new SafeMemoryWriter(40 * maxExpectedEntries).order(ByteOrder.nativeOrder());
 
        // the summary will always contain the first index entry (downsampling will never remove it)
        nextSamplePosition = 0;
        indexIntervalMatches++;
    }
{code}
The bug occurs when the entries table in the index summary for the new sstable is larger than Integer.MAX_VALUE bytes (2 GiB). This happens when expectedKeys > Integer.MAX_VALUE / 40 * minIndexInterval . Our partitions for the blocks table have a mean size of 179 bytes, so we would expect to see issues on this table for compactions over about 1.12 TiB.
 
The default value of minIndexInterval is 128, however it is adjustable per table and can be used to avoid this condition. It should be set to a power of 2. I ran this cql on my test node:
{code:sql}
ALTER TABLE tablename.blocks WITH min_index_interval = 512 ;
{code}
Since this change, I haven’t seen the assertion. The compaction has proceeded much farther than before, but it has not completed yet since it is so large.
{noformat}
$ nodetool compactionstats -H
pending tasks: 1
                                     id   compaction type       keyspace    table   completed     total    unit   progress
   9965f4b0-4749-11e7-b21c-91cb0a91f895        Compaction   tablename   blocks   629.51 GB   1.34 TB   bytes     45.71%
Active compaction remaining time :        n/a
{noformat}
I would expect that making this change would fix the issue for all future compactions on all nodes.
 
The index summary is used to reduce disk io to the sstable index. A larger index interval would result in a less efficient index summary and more io to the sstable index. However the min is just the minimum value, the actual value is controlled automatically by Cassandra. On p10, it is 2048 for the larger blocks sstables, so I would not expect a performance impact.

Compaction failed with new error
{code}
ERROR [CompactionExecutor:6] 2017-06-04 10:15:26,115 CassandraDaemon.java:185 - Exception in thread Thread[CompactionExecutor:6,1,RMI Runtime]
java.lang.AssertionError: Illegal bounds [-2147483648..-2147483640); size: 3355443200
        at org.apache.cassandra.io.util.Memory.checkBounds(Memory.java:339) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.util.SafeMemory.checkBounds(SafeMemory.java:104) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.util.Memory.getLong(Memory.java:260) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.compress.CompressionMetadata.chunkFor(CompressionMetadata.java:224) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.util.CompressedSegmentedFile.createMappedSegments(CompressedSegmentedFile.java:80) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.<init>(CompressedPoolingSegmentedFile.java:38) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:101) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:188) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:179) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinal(BigTableWriter.java:345) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinalEarly(BigTableWriter.java:333) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:297) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:345) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:169) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.doPrepare(CompactionAwareWriter.java:79) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:169) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish(Transactional.java:179) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.finish(CompactionAwareWriter.java:89) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:196) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:74) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:256) ~[apache-cassandra-2.2.5.jar:2.2.5]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_131]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
{code}


> Exception in CompactionExecutor leading to tmplink files not being removed
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13545
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13545
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>            Reporter: Dmitry Erokhin
>
> We are facing an issue where compactions fail on a few nodes with the following message
> {code}
> ERROR [CompactionExecutor:1248] 2017-05-22 15:32:55,390 CassandraDaemon.java:185 - Exception in thread Thread[CompactionExecutor:1248,1,main]
> java.lang.AssertionError: null
> 	at org.apache.cassandra.io.sstable.IndexSummary.<init>(IndexSummary.java:86) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.io.sstable.IndexSummaryBuilder.build(IndexSummaryBuilder.java:235) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openEarly(BigTableWriter.java:316) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.io.sstable.SSTableRewriter.maybeReopenEarly(SSTableRewriter.java:170) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:115) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.db.compaction.writers.DefaultCompactionWriter.append(DefaultCompactionWriter.java:64) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:184) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:74) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:256) ~[apache-cassandra-2.2.5.jar:2.2.5]
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_121]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_121]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
> 	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> {code}
> Also, the number of tmplink files in /var/lib/cassandra/data/<keyspace name>/blocks/tmplink* is growing constantly until node runs out of space. Restarting cassandra removes all tmplink files, but the issue still continues.
> We are using Cassandra 2.2.5 on Debian 8 with Oracle Java 8
> {code}
> root@cassandra-p10:/var/lib/cassandra/data/mugenstorage/blocks-33167ef0447a11e68f3e5b42fc45b62f# dpkg -l | grep -E "java|cassandra"
> ii  cassandra                      2.2.5                        all          distributed storage system for structured data
> ii  cassandra-tools                2.2.5                        all          distributed storage system for structured data
> ii  java-common                    0.52                         all          Base of all Java packages
> ii  javascript-common              11                           all          Base support for JavaScript library packages
> ii  oracle-java8-installer         8u121-1~webupd8~0            all          Oracle Java(TM) Development Kit (JDK) 8
> ii  oracle-java8-set-default       8u121-1~webupd8~0            all          Set Oracle JDK 8 as default Java
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org