You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2019/10/23 09:11:03 UTC

Slack digest for #general - 2019-10-23

2019-10-22 10:51:26 UTC - sunliuchang: @sunliuchang has joined the channel
----
2019-10-22 12:17:25 UTC - Retardust: Consumer stucks after restart. And only after restarting pulsar and consumer it's continues to parse backlog.
any ideas? nothing in logs
----
2019-10-22 12:37:34 UTC - Sijie Guo: Can you get topics stats using “pulsar-admin topic-stats” ?
----
2019-10-22 12:41:53 UTC - Alexandre DUVAL: Hi, I have an issue on a function: it starts well and after consuming 3xxx messages it stucks. No error, function is considered running, the stucked message is never the same. Each time I restart the function always 3xxx messages and stuck appears. There is RAM available. The stuck appears on context.publish(). Do you have an idea?
----
2019-10-22 12:44:10 UTC - Retardust: {
  "msgRateIn" : 0.0,
  "msgThroughputIn" : 0.0,
  "msgRateOut" : 10.00003482795463,
  "msgThroughputOut" : 3106588.20290695,
  "averageMsgSize" : 0.0,
  "storageSize" : 10668635541,
  "publishers" : [ ],
  "subscriptions" : {
    "skdf4k" : {
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateRedeliver" : 0.0,
      "msgBacklog" : 383636,
      "blockedSubscriptionOnUnackedMsgs" : false,
      "msgDelayed" : 0,
      "unackedMessages" : 0,
      "msgRateExpired" : 0.0,
      "consumers" : [ ],
      "isReplicated" : false
    },
    "journal_consumer" : {
      "msgRateOut" : 10.00003482795463,
      "msgThroughputOut" : 3106588.20290695,
      "msgRateRedeliver" : 0.0,
      "msgBacklog" : 383736,
      "blockedSubscriptionOnUnackedMsgs" : false,
      "msgDelayed" : 0,
      "unackedMessages" : 0,
      "type" : "Failover",
      "activeConsumerName" : "91149",
      "msgRateExpired" : 0.0,
      "consumers" : [ {
        "msgRateOut" : 10.00003482795463,
        "msgThroughputOut" : 3106588.20290695,
        "msgRateRedeliver" : 0.0,
        "consumerName" : "91149",
        "availablePermits" : 0,
        "unackedMessages" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "metadata" : { },
        "connectedSince" : "2019-10-22T12:14:02.351Z",
        "clientVersion" : "2.4.1",
        "address" : "/172.28.117.8:60366"
      } ],
      "isReplicated" : true
    }
  },
  "replication" : { },
  "deduplicationStatus" : "Disabled"
}


seems ok
----
2019-10-22 12:44:41 UTC - Retardust: but it's after restart
----
2019-10-22 12:45:06 UTC - Retardust: I will try to get stats on problem
----
2019-10-22 12:48:29 UTC - Raph: @Raph has joined the channel
----
2019-10-22 12:57:03 UTC - Alexandre DUVAL: 
----
2019-10-22 13:08:23 UTC - Sijie Guo: OK :ok_hand:
----
2019-10-22 13:23:41 UTC - Alexandre DUVAL: I bumped my function worker from 2.4.0 to 2.4.1 and now get this error:
----
2019-10-22 13:23:43 UTC - Alexandre DUVAL: ```13:10:51.038 [clevercloud/functions/accessLogsCleverCloudADCHaproxy-0] ERROR org.apache.pulsar.functions.instance.JavaInstanceRunnable - [clevercloud/functions/accessLogsCleverCloudADCHaproxy:0] Uncaught exception in Java Instance
java.lang.RuntimeException: User class constructor throws exception
        at org.apache.pulsar.functions.utils.Reflections.createInstance(Reflections.java:126) ~[org.apache.pulsar-pulsar-functions-utils-2.4.1.jar:2.4.1]
        at org.apache.pulsar.functions.instance.JavaInstanceRunnable.setupJavaInstance(JavaInstanceRunnable.java:189) ~[org.apache.pulsar-pulsar-functions-instance-2.4.1.jar:?]
        at org.apache.pulsar.functions.instance.JavaInstanceRunnable.run(JavaInstanceRunnable.java:234) [org.apache.pulsar-pulsar-functions-instance-2.4.1.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_192]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_192]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_192]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_192]
        at org.apache.pulsar.functions.utils.Reflections.createInstance(Reflections.java:118) ~[org.apache.pulsar-pulsar-functions-utils-2.4.1.jar:2.4.1]
        ... 3 more
Caused by: java.lang.LinkageError: ClassCastException: attempting to castjar:file:/pulsar/lib/javax.ws.rs-javax.ws.rs-api-2.1.jar!/javax/ws/rs/client/ClientBuilder.class to file:/tmp/pulsar-nar/pulsar-functions-0.1.0-SNAPSHOT.jar-unpacked/javax/ws/rs/client/ClientBuilder.class
        at javax.ws.rs.client.ClientBuilder.newBuilder(ClientBuilder.java:81) ~[javax.ws.rs-javax.ws.rs-api-2.1.jar:2.1]
        at javax.ws.rs.client.ClientBuilder.newClient(ClientBuilder.java:97) ~[javax.ws.rs-javax.ws.rs-api-2.1.jar:2.1]
        at com.clevercloud.pulsar.util.GeoIPAPI.updateDatabase(GeoIPAPI.java:101) ~[?:?]
        at com.clevercloud.pulsar.util.GeoIPAPI.&lt;init&gt;(GeoIPAPI.java:45) ~[?:?]
        at com.clevercloud.pulsar.function.ApplicationsAddonsHaproxyAccessLogs.&lt;init&gt;(ApplicationsAddonsHaproxyAccessLogs.java:28) ~[?:?]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_192]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_192]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_192]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_192]
        at org.apache.pulsar.functions.utils.Reflections.createInstance(Reflections.java:118) ~[org.apache.pulsar-pulsar-functions-utils-2.4.1.jar:2.4.1]
        ... 3 more
13:10:51.047 [clevercloud/functions/accessLogsCleverCloudADCHaproxy-0] INFO  org.apache.pulsar.functions.instance.JavaInstanceRunnable - Closing instance```
----
2019-10-22 14:12:24 UTC - Tim Howard: ASF Jenkins still sideways? it looks like it from the build statuses...
----
2019-10-22 14:18:17 UTC - Matteo Merli: We’re still working and getting closer to a solution 
----
2019-10-22 14:18:51 UTC - Tim Howard: thanks for the update
----
2019-10-22 14:49:12 UTC - dbartz: @dbartz has joined the channel
----
2019-10-22 15:04:53 UTC - Alexandre DUVAL: @Matteo Merli After bump to 2.4.1 it's not shaded anymore?
----
2019-10-22 15:05:43 UTC - Matteo Merli: No, the change was meant for 2.5 though got backported to 2.4.1 as well
----
2019-10-22 15:06:26 UTC - Matteo Merli: the function framework is not shaded anymore, rather it’s using different classloaders for framewokr and user code
----
2019-10-22 15:16:28 UTC - Alexandre DUVAL: So how should I use this?
----
2019-10-22 15:19:05 UTC - Alexandre DUVAL: Do you have example?
----
2019-10-22 15:33:19 UTC - Retardust: ```
public class Bridge implements MessageListener&lt;byte[]&gt; {

    private final Producer&lt;JournalBatch&gt; batchProducer;
    private final JournalBatchParser parser;

    @Override
    @SneakyThrows
    public void received(Consumer&lt;byte[]&gt; consumer, Message&lt;byte[]&gt; msg) {
        JournalBatch batch = parse(msg);
        batchProducer.sendAsync(batch)
                .thenAccept(messageId -&gt; ack(consumer, msg));
    }
    
    private JournalBatch parse(Message&lt;byte[]&gt; msg) {
            return parser.parse(msg.getData());
    }

    @SneakyThrows
    private void ack(Consumer&lt;byte[]&gt; consumer, Message&lt;byte[]&gt; msg) {
        consumer.acknowledgeCumulativeAsync(msg)
                .thenAccept(d -&gt; log.debug("Message ack"));
    }
}

```

is that ok for connect two topics with preserving order and at-least-once guaranties?
which settings I should pay attention?
----
2019-10-22 15:35:48 UTC - Matteo Merli: make sure to set `blockIfQueueFull(true)` when creating the `batchProducer`
heavy_check_mark : Retardust
----
2019-10-22 15:36:27 UTC - Matteo Merli: to get backpressure (instead of error) when publishing on the downstream topic
----
2019-10-22 15:39:11 UTC - Matteo Merli: also, you’d need to handle send failures. There are 2 possible ways:
 1. `sendTimeout` set to 0, to have producer to retry forever
 2. Negative ack when publish error:

```
batchProducer.sendAsync(batch)
                .thenAccept(messageId -&gt; ack(consumer, msg))
                .exceptionally(ex -&gt; {
                     consumer.negativeAck(msg);
                     return null;
                });
```
----
2019-10-22 15:45:14 UTC - Retardust: ok
----
2019-10-22 15:48:06 UTC - Retardust: But for throughput and latency everythink seems ok?
It's not fast right now:( I have only one consumer, cause I need to parse single ordered stream without partitioning.

I see 200mbit/s throughtput
20% usage of cpu
not a lot of gc pauses.
there could be the bottleneck?

should I check settings like direct buffers for example?
----
2019-10-22 15:50:08 UTC - Retardust: default overides

Producer:
batching enabled, 50ms window, up to 500 messages
1000 messages maxPendingMessages
LZ4

Consumer:
1000 max receiver q
----
2019-10-22 15:50:33 UTC - Retardust: messages something between 5kb and 1mb
----
2019-10-22 15:50:52 UTC - Matteo Merli: are the messages batched in the upstream topic?
----
2019-10-22 15:51:19 UTC - Matteo Merli: also, check the topic stats for the upstream topic
----
2019-10-22 15:51:35 UTC - Matteo Merli: `pulsar-admin topics stats $TOPIC`
----
2019-10-22 15:52:31 UTC - Matteo Merli: and check for :

```
  "availablePermits" : 766, // Number of flow-control permits that Pulsar
                                  // has currently from a consumer. When &gt; 0, it
                                  // means Pulsar can push more messages. When it's
                                  // &lt;= 0, the broker will pause the delivery to
                                  // adjust to consumer processing speed
```
----
2019-10-22 15:53:23 UTC - Retardust: availablePermits = 1000 in stats at least
----
2019-10-22 15:54:03 UTC - Matteo Merli: when traffic is ongoing?
----
2019-10-22 15:54:11 UTC - Matteo Merli: then consumer is fast enough
----
2019-10-22 15:54:53 UTC - Retardust: is there prometheus metric to check this? doesn't see
----
2019-10-22 15:57:55 UTC - Matteo Merli: no, it’s not reported on Prometheus
----
2019-10-22 16:08:41 UTC - Retardust: "availablePermits" : 600,
on load.
but there is huge lag (Im reset offset and wait to reprocess)
cpu is ok, gc is ok:) network is ok:)

but still low rate

```
      "msgRateOut" : 4.999984407631958,
      "msgThroughputOut" : 2559363.268668303,

```
----
2019-10-22 16:24:56 UTC - Retardust: and what stats should I check on the upstream topic? permits are on upstream topic consumer
rates are from upstream topic too
----
2019-10-22 16:27:34 UTC - Retardust: downstream topic stats are weird
```
{
  "msgRateIn" : 0.0,
  "msgThroughputIn" : 0.0,
  "msgRateOut" : 0.0,
  "msgThroughputOut" : 0.0,
  "averageMsgSize" : 0.0,
  "storageSize" : 64107851814,
  "publishers" : [ {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "producerId" : 0,
    "metadata" : { },
    "producerName" : "kappa-1295-20",
    "connectedSince" : "2019-10-22T15:44:42.119Z",
    "clientVersion" : "2.4.1",
    "address" : "/172.28.117.36:58302"
  } ],
```
----
2019-10-22 16:39:03 UTC - Retardust: ```

786837 2019-10-22 18:57:38,992 INFO  [ pulsar-client-io-1-1 ] o.a.p.c.i.ClientCnx                                     | [<http://corpint5.moscow.alfaintra.net/172.28.117.19:9022|corpint5.moscow.alfaintra.net/172.28.117.19:9022>] Broker notification of Closed consumer: 0
786838 2019-10-22 18:57:38,993 INFO  [ pulsar-client-io-1-1 ] o.a.p.c.i.ConnectionHandler                             | [<persistent://t1/n1/queue_journal>] [journal_consumer] Closed connection [id: 0x6bb31ad1, L:/172.17.0.50:58302 - R:<http://corpint5.moscow.alfaintra.net/172.28.117.19:9022|corpint5.moscow.alfaintra.net/172.28.117.19:9022>] -- Will try again in 0.1 s
786940 2019-10-22 18:57:39,095 INFO  [ pulsar-timer-6-1 ] o.a.p.c.i.ConnectionHandler                             | [<persistent://t1/n1/queue_journal>] [journal_consumer] Reconnecting after timeout
787007 2019-10-22 18:57:39,162 INFO  [ pulsar-client-io-1-1 ] o.a.p.c.i.ConsumerImpl                                  | [<persistent://t1/n1/queue_journal>][journal_consumer] Subscribing to topic on cnx [id: 0x6bb31ad1, L:/172.17.0.50:58302 - R:<http://corpint5.moscow.alfaintra.net/172.28.117.19:9022|corpint5.moscow.alfaintra.net/172.28.117.19:9022>]
787010 2019-10-22 18:57:39,165 INFO  [ pulsar-client-io-1-1 ] o.a.p.c.i.ConsumerImpl                                  | [<persistent://t1/n1/queue_journal>][journal_consumer] Subscribed to topic on <http://corpint5.moscow.alfaintra.net/172.28.117.19:9022|corpint5.moscow.alfaintra.net/172.28.117.19:9022> -- consumer: 0
1170871 2019-10-22 19:04:03,026 INFO  [ pulsar-client-io-1-1 ] o.a.p.c.i.ClientCnx                                     | [<http://corpint5.moscow.alfaintra.net/172.28.117.19:9022|corpint5.moscow.alfaintra.net/172.28.117.19:9022>] Broker notification of Closed consumer: 0
1170872 2019-10-22 19:04:03,027 INFO  [ pulsar-client-io-1-1 ] o.a.p.c.i.ConnectionHandler                             | [<persistent://t1/n1/queue_journal>] [journal_consumer] Closed connection [id: 0x6bb31ad1, L:/172.17.0.50:58302 - R:<http://corpint5.moscow.alfaintra.net/172.28.117.19:9022|corpint5.moscow.alfaintra.net/172.28.117.19:9022>] -- Will try again in 0.1 s
1170974 2019-10-22 19:04:03,129 INFO  [ pulsar-timer-6-1 ] o.a.p.c.i.ConnectionHandler                             | [<persistent://t1/n1/queue_journal>] [journal_consumer] Reconnecting after timeout
1171101 2019-10-22 19:04:03,256 INFO  [ pulsar-client-io-1-1 ] o.a.p.c.i.ConsumerImpl                                  | [<persistent://t1/n1/queue_journal>][journal_consumer] Subscribing to topic on cnx [id: 0x6bb31ad1, L:/172.17.0.50:58302 - R:<http://corpint5.moscow.alfaintra.net/172.28.117.19:9022|corpint5.moscow.alfaintra.net/172.28.117.19:9022>]
1171104 2019-10-22 19:04:03,259 INFO  [ pulsar-client-io-1-1 ] o.a.p.c.i.ConsumerImpl                                  | [<persistent://t1/n1/queue_journal>][journal_consumer] Subscribed to topic on <http://corpint5.moscow.alfaintra.net/172.28.117.19:9022|corpint5.moscow.alfaintra.net/172.28.117.19:9022> -- consumer: 0
```

what could be the reason?
----
2019-10-22 16:50:55 UTC - Alexandre DUVAL: @xiaolong.ran hi, @Sijie Guo told me to tag on this. About function stucked on context.publish After processed ~3000 messages. Same After multiple restarts.
----
2019-10-22 16:51:12 UTC - Alexandre DUVAL: This function was running for few weeks ans today got this.
----
2019-10-22 17:36:49 UTC - Retardust: I wondering why
msgRateIn per topic is 3mb
but Publish throughput for producer is 12.87 msg/s --- 816.96 Mbit/s
:slightly_smiling_face:
----
2019-10-22 17:37:32 UTC - Sergey Zhemzhitsky: What do you guys think about recent announcement of Streamlio acquisition by Splunk?
Splunk has already had Kafka and Flink internally, so I’m worried about Pulsar’s destiny.
```
Streamlio's experience with Pulsar, combined with Splunk's existing expertise in Apache Flink and Apache Kafka will result in the world's best real-time stream processing solution.
...
Splunk intends to continue to maintain Apache Pulsar and other projects through our acquisition of Streamlio. We're eager to find new ways to support the Apache Software Foundation, and the Pulsar project.
```

<https://www.splunk.com/blog/2019/10/21/splunk-to-expand-streaming-expertise-announces-intent-to-acquire-streamlio-open-source-distributed-messaging-leader.html>
----
2019-10-22 17:41:50 UTC - Endre Karlson: @Sijie Guo ^??
----
2019-10-22 17:42:25 UTC - Matteo Merli: Splunk has committed to ensuring the ongoing growth and success of Apache Pulsar through contributions and continuing support of the open source community (see Splunk blog).
----
2019-10-22 17:47:36 UTC - Sergey Zhemzhitsky: Well, Splunk will be fully committed in case it decides to replace its internal dataflows going through Kafka with Pulsar )
----
2019-10-22 17:59:11 UTC - Retardust: and
pulsar_rate_in : 80msg/s
but
rate(pulsar_storage_backlog_size[1m]) for same topic is
3 000 000 msg/s

what?))
----
2019-10-22 17:59:19 UTC - Matteo Merli: For now, we can only comment that Splunk plans to use Apache Pulsar in a number of its internal services and products.

I’m the least worried about Pulsar’s destiny :slightly_smiling_face:We’ve been working on maturing the technology for many years now and we’ll continue on the same path, to take it to the next level.

At the same time you might have noticed that the community has considerably expanded, with many companies invested on it for critical systems and contributing back.
----
2019-10-22 18:07:55 UTC - Retardust: pulsar_storage_backlog_size seems to be in bytes, not messages
?
documentation says it's messages
----
2019-10-22 18:46:47 UTC - Vladimir Shchur: Can you please comment regarding streamlio cloud? Is it discontinued?
----
2019-10-22 18:50:43 UTC - David Kjerrumgaard: @Vladimir Shchur The free trial period for Streamlio cloud has concluded. Any existing trials will continue until they conclude, but we will not be accepting new trial applications at this time.
----
2019-10-22 18:55:01 UTC - Vladimir Shchur: @David Kjerrumgaard what about non-trial offering? We've regarded Streamlio as Pulsar as a service platform on AWS and planned to have some business with it, is it gone?
----
2019-10-22 18:59:13 UTC - Chris Bartholomew: FYI, we offer Pulsar as a service on AWS and GCP. Azure coming soon. <https://kafkaesque.io/>
----
2019-10-22 19:02:36 UTC - Sijie Guo: @Sergey Zhemzhitsky Apache Pulsar is a 100% open source project, hosted at the vendor-independent Apache Software Foundation. PMC is the group of people who lead the direction and development of Pulsar. Pulsar PMC is from many different companies, Yahoo, Yahoo! JAPAN, Zhaopin and etc. It will not fall apart due to one vendor acquisition. That’s also the whole point of running Pulsar in the Apache way.  Community lives much much longer than vendors.

Also the Pulsar community is growing really fast. Many large companies have already invested heavily in using Pulsar in their mission critical services. For example, Tencent (one of the largest and most valuable internet companies) has adopted and run Pulsar at a very large scale (<https://streamnative.io/blog/tech/2019-10-22-powering-tencent-billing-platform-with-apache-pulsar/>). It uses Pulsar to power its billing platform for processing tens of billions of transactions every day for its total escrowed accounts of 30 billions dollars.

We, StreamNative, a company also founded by a group of Pulsar/BookKeeper PMC members, will continue our commitments to provide commercial support for Pulsar and work with the broader Pulsar community including Splunk to push the project to next level. We are really positive about the project and the community, as we have helped a lot of companies adopted Pulsar and have seen a fast-growing adoption pace. It just takes some time for those adopters to share their stories publicly. We have published some of the success stories (<https://streamnative.io/success-stories/>) and more will be coming.

Hope this give you some ideas.
heart_eyes : Pierre Zemb
----
2019-10-22 19:06:34 UTC - Vladimir Shchur: Thank you! I've evaluated your project as well, but looks like pulsar functions support is missing, which is crucial for us
----
2019-10-22 19:07:45 UTC - Chris Bartholomew: Pulsar functions is definitely on the roadmap.
----
2019-10-22 19:08:09 UTC - Vladimir Shchur: Any time commitments available?
----
2019-10-22 19:09:23 UTC - Chris Bartholomew: Definitely before the end of Dec, if not earlier.
----
2019-10-22 19:09:55 UTC - Vladimir Shchur: Thank you, will keep that in mind
----
2019-10-22 19:10:46 UTC - Chris Bartholomew: We will be supporting all Pulsar core features (Functions, IO Connectors,  Schema registry).
----
2019-10-22 19:12:02 UTC - Luke Lu: Congrats on closing the seed round :slightly_smiling_face:
----
2019-10-22 22:46:58 UTC - Nicolas Ha: I am having the same kubernetes issue as last time pop up again 2 of the 3 brokers are in `CrashLoopBackOff` state. Here is what I see in the logs
```
22:41:56.520 [main] ERROR org.apache.pulsar.PulsarBrokerStarter - Failed to start pulsar service.
org.apache.pulsar.broker.PulsarServerException: java.lang.RuntimeException: org.apache.pulsar.client.api.PulsarClientException$BrokerPersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger
	at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:472) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
	at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
	at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:332) [org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
Caused by: java.lang.RuntimeException: org.apache.pulsar.client.api.PulsarClientException$BrokerPersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger
	at org.apache.pulsar.functions.worker.WorkerService.start(WorkerService.java:206) ~[org.apache.pulsar-pulsar-functions-worker-2.4.1.jar:2.4.1]
	at org.apache.pulsar.broker.PulsarService.startWorkerService(PulsarService.java:1046) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
	at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:459) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
	... 2 more
Caused by: org.apache.pulsar.client.api.PulsarClientException$BrokerPersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger
	at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:271) ~[org.apache.pulsar-pulsar-client-api-2.4.1.jar:2.4.1]
	at org.apache.pulsar.client.impl.ProducerBuilderImpl.create(ProducerBuilderImpl.java:88) ~[org.apache.pulsar-pulsar-client-original-2.4.1.jar:2.4.1]
	at org.apache.pulsar.functions.worker.FunctionMetaDataManager.getServiceRequestManager(FunctionMetaDataManager.java:484) ~[org.apache.pulsar-pulsar-functions-worker-2.4.1.jar:2.4.1]
	at org.apache.pulsar.functions.worker.FunctionMetaDataManager.&lt;init&gt;(FunctionMetaDataManager.java:74) ~[org.apache.pulsar-pulsar-functions-worker-2.4.1.jar:2.4.1]
	at org.apache.pulsar.functions.worker.WorkerService.start(WorkerService.java:156) ~[org.apache.pulsar-pulsar-functions-worker-2.4.1.jar:2.4.1]
	at org.apache.pulsar.broker.PulsarService.startWorkerService(PulsarService.java:1046) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
	at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:459) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
	... 2 more
```
Not being a kubernetes expert, what would you check first?
Note, there is plenty of disk
----
2019-10-22 22:48:58 UTC - Nicolas Ha: kubectl describe deployments.apps broker
```
Name:                   broker
Namespace:              default
CreationTimestamp:      Mon, 02 Sep 2019 01:54:28 +0100
Labels:                 app=pulsar
                        component=broker
Annotations:            <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: 2
                        <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>:
                          {"apiVersion":"apps/v1beta1","kind":"Deployment","metadata":{"annotations":{},"name":"broker","namespace":"default"},"spec":{"replicas":3,...
Selector:               app=pulsar,component=broker
Replicas:               3 desired | 3 updated | 3 total | 1 available | 2 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:       app=pulsar
                component=broker
  Annotations:  <http://prometheus.io/port|prometheus.io/port>: 8080
                <http://prometheus.io/scrape|prometheus.io/scrape>: true
  Containers:
   broker:
    Image:       apachepulsar/pulsar-all:2.4.1
    Ports:       8080/TCP, 6650/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      sh
      -c
    Args:
      bin/apply-config-from-env.py conf/broker.conf &amp;&amp; bin/apply-config-from-env.py conf/pulsar_env.sh &amp;&amp; bin/gen-yml-from-env.py conf/functions_worker.yml &amp;&amp; bin/pulsar broker

    Limits:
      memory:  2Gi
    Requests:
      memory:  2Gi
    Environment Variables from:
      broker-config  ConfigMap  Optional: false
    Environment:
      advertisedAddress:   (v1:status.podIP)
    Mounts:               &lt;none&gt;
  Volumes:                &lt;none&gt;
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      False   MinimumReplicasUnavailable
OldReplicaSets:  &lt;none&gt;
NewReplicaSet:   broker-84678846d6 (3/3 replicas created)
Events:          &lt;none&gt;
```
----
2019-10-22 22:58:08 UTC - Ambud Sharma: congratulations @Matteo Merli
----
2019-10-22 23:10:00 UTC - Matteo Merli: thanks @Ambud Sharma
----
2019-10-23 03:39:50 UTC - xiaolong.ran: Hello in broker, is there any log information about this error?
----
2019-10-23 05:39:46 UTC - Sijie Guo: it seems that it failed to recover a ledger
----
2019-10-23 05:40:02 UTC - Sijie Guo: did you replace any disk or erase disks before?
----
2019-10-23 05:55:25 UTC - Sijie Guo: <https://medium.com/streamnative/how-to-use-apache-pulsar-manager-with-herddb-dd265c955ca4>
----
2019-10-23 05:56:29 UTC - Sijie Guo: The new blog post from @Enrico Olivelli about using HerdDB in Pulsar Manager.
+1 : Retardust
----
2019-10-23 06:31:01 UTC - Retardust: and there is no info about pulsar_msg_backlog metric in documentation, will pr
----
2019-10-23 07:58:52 UTC - Nicolas Ha: Nothing between functioning fine and the errors, which is why I am puzzled
----
2019-10-23 07:59:56 UTC - Nicolas Ha: Although I went from 5 nodes to 3 nodes. Could it be it?
----
2019-10-23 08:00:21 UTC - Nicolas Ha: And more importantly, if that's the case how can I recover?
----