You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2019/06/28 09:08:26 UTC

[GitHub] [pulsar] massakam opened a new issue #4634: Delivery of messages to consumers stops until broker is restarted

massakam opened a new issue #4634: Delivery of messages to consumers stops until broker is restarted
URL: https://github.com/apache/pulsar/issues/4634
 
 
   In our Pulsar cluster, an event that delivery of messages of a specific topic stopped happened twice a day. The topics that caused the problem in each event are different, but they have the following common points.
   
   - They are topics under the same namespace. The namespace policy is as follows:
   ```js
   {
     "auth_policies" : {
       "namespace_auth" : {
         "xxx.xxx.xxx" : [ "consume", "produce" ],
         "xxx.xxx.xxx" : [ "consume", "produce" ]
       },
       "destination_auth" : { },
       "subscription_auth_roles" : { }
     },
     "replication_clusters" : [ "cluster-a", "cluster-b" ],
     "bundles" : {
       "boundaries" : [ "0x00000000", "0x20000000", "0x30000000", "0x34000000", "0x36000000", "0x36400000", "0x36600000", "0x36608000", "0x3660c000", "0x3660e000", "0x3660e800", "0x3660ea00", "0x3660eb00", "0x3660ec00", "0x3660f000", "0x36610000", "0x36620000", "0x36640000", "0x36680000", "0x36700000", "0x36800000", "0x37000000", "0x38000000", "0x40000000", "0x44000000", "0x46000000", "0x46080000", "0x460a0000", "0x460a1000", "0x460a1800", "0x460a1c00", "0x460a1e00", "0x460a1e40", "0x460a1e80", "0x460a1f00", "0x460a2000", "0x460a4000", "0x460a8000", "0x460b0000", "0x460c0000", "0x46100000", "0x46200000", "0x46400000", "0x46800000", "0x47000000", "0x48000000", "0x50000000", "0x60000000", "0x80000000", "0xa0000000", "0xa4000000", "0xa6000000", "0xa6800000", "0xa6c00000", "0xa6c40000", "0xa6c48000", "0xa6c4c000", "0xa6c4d000", "0xa6c4d800", "0xa6c4dc00", "0xa6c4dc40", "0xa6c4dc41", "0xa6c4dc42", "0xa6c4dc44", "0xa6c4dc48", "0xa6c4dc50", "0xa6c4dc60", "0xa6c4dc80", "0xa6c4dd00", "0xa6c4de00", "0xa6c4e000", "0xa6c50000", "0xa6c60000", "0xa6c80000", "0xa6d00000", "0xa6e00000", "0xa7000000", "0xa8000000", "0xac000000", "0xae000000", "0xaf000000", "0xaf400000", "0xaf600000", "0xaf680000", "0xaf690000", "0xaf698000", "0xaf69a000", "0xaf69b000", "0xaf69b800", "0xaf69ba00", "0xaf69bb00", "0xaf69bb10", "0xaf69bb20", "0xaf69bb40", "0xaf69bb80", "0xaf69bc00", "0xaf69c000", "0xaf6a0000", "0xaf6c0000", "0xaf700000", "0xaf800000", "0xb0000000", "0xc0000000", "0xcfffffff", "0xd7ffffff", "0xd83fffff", "0xd85fffff", "0xd867ffff", "0xd86bffff", "0xd86dffff", "0xd86e7fff", "0xd86e87ff", "0xd86e89ff", "0xd86e8aff", "0xd86e8b7f", "0xd86e8b9f", "0xd86e8bbf", "0xd86e8bff", "0xd86e8fff", "0xd86e9fff", "0xd86ebfff", "0xd86effff", "0xd86fffff", "0xd87fffff", "0xd8ffffff", "0xd9ffffff", "0xdbffffff", "0xdfffffff", "0xffffffff" ],
       "numBundles" : 128
     },
     "backlog_quota_map" : {
       "destination_storage" : {
         "limit" : 107374182400,
         "policy" : "consumer_backlog_eviction"
       }
     },
     "clusterDispatchRate" : {
       "kks" : {
         "dispatchThrottlingRateInMsg" : 40000,
         "dispatchThrottlingRateInByte" : 40960000,
         "ratePeriodInSecond" : 1
       },
       "ssk" : {
         "dispatchThrottlingRateInMsg" : 40000,
         "dispatchThrottlingRateInByte" : 40960000,
         "ratePeriodInSecond" : 1
       }
     },
     "subscriptionDispatchRate" : {
       "kks" : {
         "dispatchThrottlingRateInMsg" : 1000,
         "dispatchThrottlingRateInByte" : -1,
         "ratePeriodInSecond" : 1
       },
       "ssk" : {
         "dispatchThrottlingRateInMsg" : 1000,
         "dispatchThrottlingRateInByte" : -1,
         "ratePeriodInSecond" : 1
       }
     },
     "clusterSubscribeRate" : { },
     "deduplicationEnabled" : false,
     "latency_stats_sample_rate" : { },
     "message_ttl_in_seconds" : 0,
     "retention_policies" : {
       "retentionTimeInMinutes" : 10080,
       "retentionSizeInMB" : 102401
     },
     "deleted" : false,
     "encryption_required" : false,
     "subscription_auth_mode" : "None",
     "max_producers_per_topic" : 0,
     "max_consumers_per_topic" : 0,
     "max_consumers_per_subscription" : 0,
     "compaction_threshold" : 0,
     "offload_threshold" : -1,
     "schema_auto_update_compatibility_strategy" : "Full"
   }
   ```
   - They are both partitioned topics that have 16 partitions.
   - There are a large number of exclusive subscriptions to which one consumer connects.
   
   We tried to reproduce these phenomena but did not succeed.
   
   # 1st event
   
   Users of the topic in question did the following:
   
   1. Create subscriptions with REST API
   2. Reset the cursors for the created subscriptions to the past position
   3. Consumer start connecting to the topic
   
   Then, in one of the internal topics, messages ware not delivered to the added consumers, and the backlog messages did not decrease. The stats for the internal topic at that time is here: [stats1.txt](https://github.com/apache/pulsar/files/3338390/stats1.txt)
   
   We ran the jstack command on the broker that owns the internal topic in question, but it failed.
   ```sh
   $ sudo jstack 56617
   
   56617: Unable to open socket file: target process not responding or HotSpot VM not loaded
   The -F option can be used when the target process is not responding
   ```
   
   However, we noticed that the result had been output to `/var/log` later. This is that thread dump: [threaddump1.txt](https://github.com/apache/pulsar/files/3338409/threaddump1.txt)
   
   This phenomenon was solved by restarting the broker.
   
   # 2nd event
   
   The second event occurred several hours after the first event. Backlog messages continued to increase in a different topic under the same namespace. We confirmed that the delivery of messages was stopped for all subscriptions in one of the internal topics.
   
   Below is the stats and metadata of the internal topic. As you can see in stats-internal, all cursors stopped at the same position except for the cursor for geo-replication.
   
   - [stats2.txt](https://github.com/apache/pulsar/files/3338411/stats2.txt)
   - [stats-internal2.txt](https://github.com/apache/pulsar/files/3338412/stats-internal2.txt)
   - [info-internal2.txt](https://github.com/apache/pulsar/files/3338419/info-internal2.txt)
   
   The jstack command on the broker also failed, but the result was output to `/var/log`.
   
   [threaddump2.txt](https://github.com/apache/pulsar/files/3338420/threaddump2.txt)
   
   As with the first one, the problem was solved by restarting the broker.
   
   **Pulsar version**
   broker: 2.3.2
   client: 2.1.1-incubating

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services