You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jai Bheemsen Rao Dhanwada <ja...@gmail.com> on 2018/10/23 16:48:05 UTC

Re: Bootstrap streaming issues

Did anyone run into similar issues?

On Thu, Sep 6, 2018 at 10:27 AM Jai Bheemsen Rao Dhanwada <
jaibheemsen@gmail.com> wrote:

> Here is the stacktrace from the failure, it looks like it's trying to
> gather all the columfamily metrics and going OOM. Is this just for the JMX
> metrics?
>
>
> https://github.com/apache/cassandra/blob/cassandra-2.1.16/src/java/org/apache/cassandra/metrics/ColumnFamilyMetrics.java
>
> ERROR [MessagingService-Incoming-/10.133.33.57] 2018-09-06 15:43:19,280
> CassandraDaemon.java:231 - Exception in thread
> Thread[MessagingService-Incoming-/x.x.x.x,5,main]
> java.lang.OutOfMemoryError: Java heap space
>         at java.io.DataInputStream.<init>(DataInputStream.java:58)
> ~[na:1.8.0_151]
>         at
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:139)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
> ERROR [InternalResponseStage:1] 2018-09-06 15:43:19,281
> CassandraDaemon.java:231 - Exception in thread
> Thread[InternalResponseStage:1,5,main]
> java.lang.OutOfMemoryError: Java heap space
>         at
> org.apache.cassandra.metrics.ColumnFamilyMetrics$AllColumnFamilyMetricNameFactory.createMetricName(
> *ColumnFamilyMetrics.java:784*) ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.metrics.ColumnFamilyMetrics.createColumnFamilyHistogram(ColumnFamilyMetrics.java:716)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.metrics.ColumnFamilyMetrics.<init>(ColumnFamilyMetrics.java:597)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:361)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:527)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:498)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:335)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:385)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:75)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:54)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64)
> ~[apache-cassandra-2.1.16.jar:2.1.16]
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[na:1.8.0_151]
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[na:1.8.0_151]
>         at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151]
>
> On Thu, Aug 30, 2018 at 12:51 PM Jai Bheemsen Rao Dhanwada <
> jaibheemsen@gmail.com> wrote:
>
>> thank you
>>
>> On Thu, Aug 30, 2018 at 11:58 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>
>>> This is the closest JIRA that comes to mind (from memory, I didn't
>>> search, there may be others):
>>> https://issues.apache.org/jira/browse/CASSANDRA-8150
>>>
>>> The best blog that's all in one place on tuning GC in cassandra is
>>> actually Amy's 2.1 tuning guide:
>>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html -
>>> it's somewhat out of date as it's for 2.1, but since that's what you're
>>> running, that works out in your favor.
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 30, 2018 at 10:53 AM Jai Bheemsen Rao Dhanwada <
>>> jaibheemsen@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> Is there any JIRA that talks about increasing the HEAP will help?
>>>> Also, any other alternatives than increasing the HEAP Size? last time
>>>> when I tried increasing the heap, longer GC Pauses caused more damage in
>>>> terms of latencies while gc pause.
>>>>
>>>> On Wed, Aug 29, 2018 at 11:07 PM Jai Bheemsen Rao Dhanwada <
>>>> jaibheemsen@gmail.com> wrote:
>>>>
>>>>> okay, thank you
>>>>>
>>>>> On Wed, Aug 29, 2018 at 11:04 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>
>>>>>> You’re seeing an OOM, not a socket error / timeout.
>>>>>>
>>>>>> --
>>>>>> Jeff Jirsa
>>>>>>
>>>>>>
>>>>>> On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada <
>>>>>> jaibheemsen@gmail.com> wrote:
>>>>>>
>>>>>> Jeff,
>>>>>>
>>>>>> any idea if this is somehow related to :
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11840?
>>>>>> does increasing the value of streaming_socket_timeout_in_ms to a
>>>>>> higher value helps?
>>>>>>
>>>>>> On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada <
>>>>>> jaibheemsen@gmail.com> wrote:
>>>>>>
>>>>>>> I have 72 nodes in the cluster, across 8 datacenters.. the moment I
>>>>>>> try to increase the node above 84 or so, the issue starts.
>>>>>>>
>>>>>>> I am still using CMS Heap, assuming it will create more harm if I
>>>>>>> increase the heap size beyond 8G(recommended).
>>>>>>>
>>>>>>> On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Given the size of your schema, you’re probably getting flooded with
>>>>>>>> a bunch of huge schema mutations as it hops into gossip and tries to pull
>>>>>>>> the schema from every host it sees. You say 8 DCs but you don’t say how
>>>>>>>> many nodes - I’m guessing it’s  a lot?
>>>>>>>>
>>>>>>>> This is something that’s incrementally better in 3.0, but a real
>>>>>>>> proper fix has been talked about a few times  -
>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11748 and
>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13569 for example
>>>>>>>>
>>>>>>>> In the short term, you may be able to work around this by
>>>>>>>> increasing your heap size. If that doesn’t work, there’s an ugly ugly hack
>>>>>>>> that’ll work on 2.1:  limiting the number of schema blobs you can get at a
>>>>>>>> time - in this case, that means firewall off all but a few nodes in your
>>>>>>>> cluster for 10-30 seconds, make sure it gets the schema (watch the logs or
>>>>>>>> file system for the tables to be created), then remove the firewall so it
>>>>>>>> can start the bootstrap process (it needs the schema to setup the streaming
>>>>>>>> plan, and it needs all the hosts up in gossip to stream successfully, so
>>>>>>>> this is an ugly hack to give you time to get the schema and then heal the
>>>>>>>> cluster so it can bootstrap).
>>>>>>>>
>>>>>>>> Yea that’s awful. Hopefully either of the two above JIRAs lands to
>>>>>>>> make this less awful.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jeff Jirsa
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada <
>>>>>>>> jaibheemsen@gmail.com> wrote:
>>>>>>>>
>>>>>>>> It fails before bootstrap
>>>>>>>>
>>>>>>>> streaming throughpu on the nodes is set to 400Mb/ps
>>>>>>>>
>>>>>>>> On Wednesday, August 29, 2018, Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Is the bootstrap plan succeeding (does streaming start or does it
>>>>>>>>> crash before it logs messages about streaming starting)?
>>>>>>>>>
>>>>>>>>> Have you capped the stream throughput on the existing hosts?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jeff Jirsa
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada <
>>>>>>>>> jaibheemsen@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hello All,
>>>>>>>>>
>>>>>>>>> We are seeing some issue when we add more nodes to the cluster,
>>>>>>>>> where new node bootstrap is not able to stream the entire metadata and
>>>>>>>>> fails to bootstrap. Finally the process dies with OOM (java.lang.OutOfMemoryError:
>>>>>>>>> Java heap space)
>>>>>>>>>
>>>>>>>>> But if I remove few nodes from the cluster we don't see this issue.
>>>>>>>>>
>>>>>>>>> Cassandra Version: 2.1.16
>>>>>>>>> # of KS and CF : 100, 3000 (approx)
>>>>>>>>> # of DC: 8
>>>>>>>>> # of Vnodes per node: 256
>>>>>>>>>
>>>>>>>>> Not sure what is causing this behavior, has any one come across
>>>>>>>>> this scenario?
>>>>>>>>> thanks in advance.
>>>>>>>>>
>>>>>>>>>

Re: Bootstrap streaming issues

Posted by Jai Bheemsen Rao Dhanwada <ja...@gmail.com>.

Also, I see this issue only when I have more columnfamilies. looks like be
number of vnodes * number of CF combination.
does anyone have any idea on this?

On Tue, Oct 23, 2018 at 9:48 AM Jai Bheemsen Rao Dhanwada <
jaibheemsen@gmail.com> wrote:

> Did anyone run into similar issues?
>
> On Thu, Sep 6, 2018 at 10:27 AM Jai Bheemsen Rao Dhanwada <
> jaibheemsen@gmail.com> wrote:
>
>> Here is the stacktrace from the failure, it looks like it's trying to
>> gather all the columfamily metrics and going OOM. Is this just for the JMX
>> metrics?
>>
>>
>> https://github.com/apache/cassandra/blob/cassandra-2.1.16/src/java/org/apache/cassandra/metrics/ColumnFamilyMetrics.java
>>
>> ERROR [MessagingService-Incoming-/10.133.33.57] 2018-09-06 15:43:19,280
>> CassandraDaemon.java:231 - Exception in thread
>> Thread[MessagingService-Incoming-/x.x.x.x,5,main]
>> java.lang.OutOfMemoryError: Java heap space
>>         at java.io.DataInputStream.<init>(DataInputStream.java:58)
>> ~[na:1.8.0_151]
>>         at
>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:139)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>> ERROR [InternalResponseStage:1] 2018-09-06 15:43:19,281
>> CassandraDaemon.java:231 - Exception in thread
>> Thread[InternalResponseStage:1,5,main]
>> java.lang.OutOfMemoryError: Java heap space
>>         at
>> org.apache.cassandra.metrics.ColumnFamilyMetrics$AllColumnFamilyMetricNameFactory.createMetricName(
>> *ColumnFamilyMetrics.java:784*) ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.metrics.ColumnFamilyMetrics.createColumnFamilyHistogram(ColumnFamilyMetrics.java:716)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.metrics.ColumnFamilyMetrics.<init>(ColumnFamilyMetrics.java:597)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:361)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:527)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:498)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:335)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.db.DefsTables.addColumnFamily(DefsTables.java:385)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:293)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:194)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:166)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:75)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:54)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64)
>> ~[apache-cassandra-2.1.16.jar:2.1.16]
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> ~[na:1.8.0_151]
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> ~[na:1.8.0_151]
>>         at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151]
>>
>> On Thu, Aug 30, 2018 at 12:51 PM Jai Bheemsen Rao Dhanwada <
>> jaibheemsen@gmail.com> wrote:
>>
>>> thank you
>>>
>>> On Thu, Aug 30, 2018 at 11:58 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>
>>>> This is the closest JIRA that comes to mind (from memory, I didn't
>>>> search, there may be others):
>>>> https://issues.apache.org/jira/browse/CASSANDRA-8150
>>>>
>>>> The best blog that's all in one place on tuning GC in cassandra is
>>>> actually Amy's 2.1 tuning guide:
>>>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html -
>>>> it's somewhat out of date as it's for 2.1, but since that's what you're
>>>> running, that works out in your favor.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 30, 2018 at 10:53 AM Jai Bheemsen Rao Dhanwada <
>>>> jaibheemsen@gmail.com> wrote:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Is there any JIRA that talks about increasing the HEAP will help?
>>>>> Also, any other alternatives than increasing the HEAP Size? last time
>>>>> when I tried increasing the heap, longer GC Pauses caused more damage in
>>>>> terms of latencies while gc pause.
>>>>>
>>>>> On Wed, Aug 29, 2018 at 11:07 PM Jai Bheemsen Rao Dhanwada <
>>>>> jaibheemsen@gmail.com> wrote:
>>>>>
>>>>>> okay, thank you
>>>>>>
>>>>>> On Wed, Aug 29, 2018 at 11:04 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>>>
>>>>>>> You’re seeing an OOM, not a socket error / timeout.
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Jirsa
>>>>>>>
>>>>>>>
>>>>>>> On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada <
>>>>>>> jaibheemsen@gmail.com> wrote:
>>>>>>>
>>>>>>> Jeff,
>>>>>>>
>>>>>>> any idea if this is somehow related to :
>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11840?
>>>>>>> does increasing the value of streaming_socket_timeout_in_ms to a
>>>>>>> higher value helps?
>>>>>>>
>>>>>>> On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada <
>>>>>>> jaibheemsen@gmail.com> wrote:
>>>>>>>
>>>>>>>> I have 72 nodes in the cluster, across 8 datacenters.. the moment I
>>>>>>>> try to increase the node above 84 or so, the issue starts.
>>>>>>>>
>>>>>>>> I am still using CMS Heap, assuming it will create more harm if I
>>>>>>>> increase the heap size beyond 8G(recommended).
>>>>>>>>
>>>>>>>> On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jj...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Given the size of your schema, you’re probably getting flooded
>>>>>>>>> with a bunch of huge schema mutations as it hops into gossip and tries to
>>>>>>>>> pull the schema from every host it sees. You say 8 DCs but you don’t say
>>>>>>>>> how many nodes - I’m guessing it’s  a lot?
>>>>>>>>>
>>>>>>>>> This is something that’s incrementally better in 3.0, but a real
>>>>>>>>> proper fix has been talked about a few times  -
>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11748 and
>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13569 for example
>>>>>>>>>
>>>>>>>>> In the short term, you may be able to work around this by
>>>>>>>>> increasing your heap size. If that doesn’t work, there’s an ugly ugly hack
>>>>>>>>> that’ll work on 2.1:  limiting the number of schema blobs you can get at a
>>>>>>>>> time - in this case, that means firewall off all but a few nodes in your
>>>>>>>>> cluster for 10-30 seconds, make sure it gets the schema (watch the logs or
>>>>>>>>> file system for the tables to be created), then remove the firewall so it
>>>>>>>>> can start the bootstrap process (it needs the schema to setup the streaming
>>>>>>>>> plan, and it needs all the hosts up in gossip to stream successfully, so
>>>>>>>>> this is an ugly hack to give you time to get the schema and then heal the
>>>>>>>>> cluster so it can bootstrap).
>>>>>>>>>
>>>>>>>>> Yea that’s awful. Hopefully either of the two above JIRAs lands to
>>>>>>>>> make this less awful.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jeff Jirsa
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada <
>>>>>>>>> jaibheemsen@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> It fails before bootstrap
>>>>>>>>>
>>>>>>>>> streaming throughpu on the nodes is set to 400Mb/ps
>>>>>>>>>
>>>>>>>>> On Wednesday, August 29, 2018, Jeff Jirsa <jj...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Is the bootstrap plan succeeding (does streaming start or does it
>>>>>>>>>> crash before it logs messages about streaming starting)?
>>>>>>>>>>
>>>>>>>>>> Have you capped the stream throughput on the existing hosts?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada <
>>>>>>>>>> jaibheemsen@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello All,
>>>>>>>>>>
>>>>>>>>>> We are seeing some issue when we add more nodes to the cluster,
>>>>>>>>>> where new node bootstrap is not able to stream the entire metadata and
>>>>>>>>>> fails to bootstrap. Finally the process dies with OOM (java.lang.OutOfMemoryError:
>>>>>>>>>> Java heap space)
>>>>>>>>>>
>>>>>>>>>> But if I remove few nodes from the cluster we don't see this
>>>>>>>>>> issue.
>>>>>>>>>>
>>>>>>>>>> Cassandra Version: 2.1.16
>>>>>>>>>> # of KS and CF : 100, 3000 (approx)
>>>>>>>>>> # of DC: 8
>>>>>>>>>> # of Vnodes per node: 256
>>>>>>>>>>
>>>>>>>>>> Not sure what is causing this behavior, has any one come across
>>>>>>>>>> this scenario?
>>>>>>>>>> thanks in advance.
>>>>>>>>>>
>>>>>>>>>>