You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2018/01/26 22:19:55 UTC

Slack digest for #general - 2018-01-26

2018-01-26 00:54:38 UTC - Jaebin Yoon: While doing the load tests, all consumer connections got closed and I see the following exceptions on the broker.
```2018-01-26 00:38:08,489 - ERROR - [pulsar-web-61-28:PulsarWebResource@381] - [null] Failed to validate namespace bundle netflix/prod/ns1/0x84000000_0x86000000
java.lang.IllegalArgumentException: Invalid upper boundary for bundle
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
at org.apache.pulsar.common.naming.NamespaceBundles.validateBundle(NamespaceBundles.java:110)
at org.apache.pulsar.broker.web.PulsarWebResource.validateNamespaceBundleRange(PulsarWebResource.java:378)
at org.apache.pulsar.broker.web.PulsarWebResource.validateNamespaceBundleOwnership(PulsarWebResource.java:404)
at org.apache.pulsar.broker.admin.Namespaces.splitNamespaceBundle(Namespaces.java:876)
at sun.reflect.GeneratedMethodAccessor127.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:143)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)```
----
2018-01-26 00:56:48 UTC - Jaebin Yoon: It seems it tried to split the bundle from 4 to the bigger one when the traffic gets heavy.
----
2018-01-26 00:58:02 UTC - Matteo Merli: &gt; all consumer connections got closed

That is expected since the default configuration is to “unload” immediately the newly split bundles to give them a chance to get immediately reassigned to a different broker
----
2018-01-26 01:00:32 UTC - Matteo Merli: I think the exception that gets printed is related to the fact the bundles list gets updated by all the split happening in parallel. I don’t think that should pose a big problem, the split for that particular bundle should have been re-attempted at the next iteration
----
2018-01-26 01:01:57 UTC - Matteo Merli: (note that it should be avoided.. )
----
2018-01-26 01:01:57 UTC - Jaebin Yoon: somehow the consumers stopped consuming right after that happened.. So in this case, the bundle should be 8 then.. i'll check if that is the case.
----
2018-01-26 01:02:44 UTC - Matteo Merli: did they reconnected?
----
2018-01-26 01:07:20 UTC - Jaebin Yoon: @Jaebin Yoon uploaded a file: <https://apache-pulsar.slack.com/files/U8CM86831/F8YV113EE/-.txt|Untitled>
----
2018-01-26 01:07:23 UTC - Matteo Merli: the exception itself is thrown in the validation phase, so it shouldn’t affect the running state in any way, the split for that particular bundle has currently failed
----
2018-01-26 01:07:56 UTC - Jaebin Yoon: so it seems it failed to reconnect
----
2018-01-26 01:08:01 UTC - Matteo Merli: Oh I see, you got an exception in the `consumer.acknowledge()`
----
2018-01-26 01:09:00 UTC - Matteo Merli: which is to be expected from an API perspective
----
2018-01-26 01:09:28 UTC - Jaebin Yoon: I see..
----
2018-01-26 01:10:01 UTC - Jaebin Yoon: so need to catch that retry message?
----
2018-01-26 01:10:44 UTC - Matteo Merli: `acknowledge()` will fail when non connected to broker. there’s no much else to do
----
2018-01-26 01:11:09 UTC - Matteo Merli: retry the acnkowledgement is not needed either, because the message will be replayed anyway by the broker
----
2018-01-26 01:11:38 UTC - Matteo Merli: the easiest way to deal with it is to just use `consumer.acknowledgeAsync()` and not bother to track the future
----
2018-01-26 01:12:18 UTC - Jaebin Yoon: ok. let me try that. thanks!
----
2018-01-26 01:13:33 UTC - Jaebin Yoon: regarding the the split failure, it failed to split because of the race condition and it is ok?
----
2018-01-26 01:14:23 UTC - Jaebin Yoon: it should be split .. if it fails, will it retry?
----
2018-01-26 01:14:57 UTC - Matteo Merli: yes, the same logic that leads to the split is checked periodically (traffic, # of topics, …)
----
2018-01-26 01:15:34 UTC - Jaebin Yoon: alright. cool. thanks!
----
2018-01-26 01:26:48 UTC - Matteo Merli: btw: I’ve been merging most of the fixes to issues you encountered:

Default limit on number of lookup with 10K partitions:
<https://github.com/apache/incubator-pulsar/pull/1116>

Fix for Kafka consumer wrapper slow consumption (this will be merged shortly):
<https://github.com/apache/incubator-pulsar/pull/1115>

Fix for the race condition in producer code:
<https://github.com/apache/incubator-pulsar/pull/1108>

Reducing the exception printed in logs after disconnections:
<https://github.com/apache/incubator-pulsar/pull/1107>

Load balancer not being enabled by default in embedded broker:
<https://github.com/apache/incubator-pulsar/pull/1104>

My next step is to test out with you VM configuration to verify in the same conditions. So far my test VMs were the ones that I deploy from the messaging benchmark deployment scripts and I cannot see anymore issue at this point here
----
2018-01-26 02:25:41 UTC - Jaebin Yoon: Oh that's great! Thanks @Matteo Merli for quickly addressing the issues. Let me give you the exact test setup and configuration i have. The most problems I've seen were from the stressful condition where there were many partitions and consumers kept asking old data because they didn't ack messages. After consumers were following up the producers closely, everything got smoother. Apparently that reduces the traffic between brokers and bookies. I'm still using one mounting point over multiple HDD for both journals and ledgers, which makes things worse in reading old data.
----
2018-01-26 03:36:48 UTC - Jaebin Yoon: I'm seeing lots of this exception in the broker log. It seems the load balancing doesn't kick in because of this.
```2018-01-26 03:33:38,823 - WARN - [pulsar-1-2:SimpleResourceAllocationPolicies@56] - GetIsolationPolicies: Unable to get the namespaceIsolationPolicies [{}]
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:770)
at com.google.common.base.Joiner.toString(Joiner.java:454)
at com.google.common.base.Joiner.appendTo(Joiner.java:109)
at com.google.common.base.Joiner.appendTo(Joiner.java:154)
at com.google.common.base.Joiner.appendTo(Joiner.java:141)
at com.google.common.base.Joiner.appendTo(Joiner.java:168)
at org.apache.pulsar.broker.web.PulsarWebResource.path(PulsarWebResource.java:101)
at org.apache.pulsar.broker.loadbalance.impl.SimpleResourceAllocationPolicies.getIsolationPolicies(SimpleResourceAllocationPolicies.java:54)
at org.apache.pulsar.broker.loadbalance.impl.SimpleResourceAllocationPolicies.isSharedBroker(SimpleResourceAllocationPolicies.java:97)
at org.apache.pulsar.broker.loadbalance.impl.LoadManagerShared.applyPolicies(LoadManagerShared.java:129)
at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.selectBrokerForAssignment(ModularLoadManagerImpl.java:659)
at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.getLeastLoaded(ModularLoadManagerWrapper.java:67)
at org.apache.pulsar.broker.namespace.NamespaceService.getLeastLoadedFromLoadManager(NamespaceService.java:463)
at org.apache.pulsar.broker.namespace.NamespaceService.searchForCandidateBroker(NamespaceService.java:338)
at org.apache.pulsar.broker.namespace.NamespaceService.lambda$15(NamespaceService.java:301)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)```
----
2018-01-26 03:39:48 UTC - Jaebin Yoon: And the split failures repeating.. resulting in consumer failing to connect to the broker.
----
2018-01-26 03:43:27 UTC - Jaebin Yoon: Producers and consumers connected back eventually but the brokers got unstable for 4~5 min for this
----
2018-01-26 03:47:01 UTC - Jaebin Yoon: After that I got these exceptions on the brokers :
```2018-01-26 03:42:48,926 - WARN - [pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split namespace bundle netflix/prod/ns1/0x12c80000_0x12d00000 because namespace netflix/prod/ns1 has too many bundles: 128
2018-01-26 03:42:48,926 - WARN - [pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split namespace bundle netflix/prod/ns1/0xf21b9fff_0xf21bafff because namespace netflix/prod/ns1 has too many bundles: 128
2018-01-26 03:42:48,926 - WARN - [pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split namespace bundle netflix/prod/ns1/0x1b780000_0x1b782000 because namespace netflix/prod/ns1 has too many bundles: 128
2018-01-26 03:42:48,927 - WARN - [pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split namespace bundle netflix/prod/ns1/0x65c80000_0x65cc0000 because namespace netflix/prod/ns1 has too many bundles: 128
2018-01-26 03:42:48,927 - WARN - [pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split namespace bundle netflix/prod/ns1/0x1c100000_0x1c200000 because namespace netflix/prod/ns1 has too many bundles: 128```
----
2018-01-26 03:48:01 UTC - Jaebin Yoon: Some how it split but kept failing in validation so it seems it kept doing that again and again without actually distributing topics to bundles
----
2018-01-26 03:50:15 UTC - Jaebin Yoon: Maybe my namespace setup is missing some policies to make this load balancing work?
----
2018-01-26 03:52:39 UTC - Matteo Merli: From the log message it looks like it already split that into 128 bundles, which is the default limit for auto split. Will look into the previous exception
----
2018-01-26 03:53:42 UTC - Jaebin Yoon: How do I check if the topic partitions got assigned to different bundles? I have only 10 partitions for my topic now.
----
2018-01-26 03:54:55 UTC - Matteo Merli: The assignment is based on the hash of the topic name
----
2018-01-26 03:55:42 UTC - Matteo Merli: You can get the load report for the broker which contains all the stats per/bundle as well
----
2018-01-26 03:56:10 UTC - Matteo Merli: E.g.: Pulsar-admin broker-stats load-report
----
2018-01-26 04:10:27 UTC - Jaebin Yoon: ok. I see the 10 topic got distributed over 8 brokers out of 10. Not sure why it ended up split upto 128.
----
2018-01-26 04:55:57 UTC - Jaebin Yoon: So 128 is the auto split limit but you can set any high number manually, right? The bundle is the broker distribution unit so it definitely needs to be more than 128.
----
2018-01-26 04:56:59 UTC - Matteo Merli: Correct, that's the default for the auto split limit.
----
2018-01-26 04:57:48 UTC - Matteo Merli: You can increase the default and you can also precreate a namespace with a larger amount to begin with
----
2018-01-26 04:58:50 UTC - Jaebin Yoon: I see. it seems to be better starting with bigger one. Auto-split can make the data pipeline unstable for a while.
----
2018-01-26 04:59:08 UTC - Jaebin Yoon: regarding the separate disk for bookie journal, can i use the loopback device for the test?
----
2018-01-26 05:00:06 UTC - Jaebin Yoon: do you think it would work for tests? It's hard to attach a separate hard disk in our env now.. and we don't survive machine when it's down any way
----
2018-01-26 05:00:35 UTC - Jaebin Yoon: when the machine is gone, it's gone. we cannot reuse. even no rebooting.
----
2018-01-26 05:01:23 UTC - Jaebin Yoon: eventually, i think we should have a separate disk for the journal but for testing, do you see any issue with using loopback device?
----
2018-01-26 05:02:04 UTC - Jaebin Yoon: I just want to separate io for read, write to make sure that's not the cause of system instability.
----
2018-01-26 05:03:37 UTC - Matteo Merli: What do you mean by loopback device? The disk that comes by default with the vm?
----
2018-01-26 05:04:06 UTC - Jaebin Yoon: like /tmp .. memory based device.
----
2018-01-26 05:04:19 UTC - Jaebin Yoon: memory based block device.
----
2018-01-26 05:04:34 UTC - Matteo Merli: Oh I see, dev/shm ?
----
2018-01-26 05:04:38 UTC - Jaebin Yoon: yeah
----
2018-01-26 05:04:46 UTC - Matteo Merli: That should be good
----
2018-01-26 05:05:11 UTC - Jaebin Yoon: ok. i'll use that for testing. that'll help us do io isolation for testing.
----
2018-01-26 05:05:27 UTC - Matteo Merli: You can reduce the number of journal files retained from 5 to 1 as well
----
2018-01-26 05:06:02 UTC - Matteo Merli: Let me search he conf name
----
2018-01-26 05:07:26 UTC - Matteo Merli: journalMaxBackups=1 in bookkeeper.conf
----
2018-01-26 05:08:07 UTC - Jaebin Yoon: i see. ok.
----
2018-01-26 05:08:21 UTC - Jaebin Yoon: i'll change that. thanks!
----
2018-01-26 05:08:56 UTC - Matteo Merli: Just to use less memory
----
2018-01-26 05:09:08 UTC - Jaebin Yoon: yup
----
2018-01-26 06:20:33 UTC - Jaebin Yoon: I meant "ramdisk" when I said "loopback" device or disk. I haven't used those terms for long time so got confused. Anyway, I think it might not be a terrible idea to use that even for production env where you cannot recover machines with rebooting as long as machines have enough memory to hold a journal. I'll think about this setup more for our environment.
----
2018-01-26 06:27:03 UTC - Jaebin Yoon: To upgrade bookies, I think we should do the rolling upgrade for the live cluster. After upgrading one bookie, what metics should I look for to know that I can upgrade the next one so that I don't lose any data?
----
2018-01-26 07:13:47 UTC - Matteo Merli: In general we were doing rolling-upgrade with these steps:
* PRE : verify there are no under-replicated ledgers `bin/bookkepeer shell listunderreplicated` should be empty
* PRE : disable auto-recovery since we don’t want to trigger data copy when the bookie will come back shortly
the setting here is cluster wide: `bin/bookkeeper shell autorecovery -disable`
* Do the upgrade..
* POST: do bookie sanity to check health: `bin/bookkeeper shell bookie-sanity` (also this should be run every x minute on each bookie for alerting)
* POST: re-enable auto-recovery: `bin/bookkeeper shell autorecovery -enable` to leave the cluster setting in initial state
----
2018-01-26 08:05:30 UTC - Jaebin Yoon: thanks a lot. So these steps are for each bookie, right? Let me try.
----
2018-01-26 08:15:30 UTC - Jaebin Yoon: Ah this is a cluster setting. So basically "do the upgrade" means rolling restart all bookies I guess.
----
2018-01-26 08:17:54 UTC - Jaebin Yoon: What if we need to provision new machines replacing the existing ones? In this case, you shouldn't disable auto recovery but kill one bookie and provision new one and check "listunderreplicated" empty before doing the next one.
----
2018-01-26 09:21:42 UTC - Jaebin Yoon: Do I need to run a separate autorecovery service ? (I'm seeing there is bin/bookkeeper autorecovery).
I brought up new bookies and kill the existing bookies one by one but I don't see 'underreplicated' entry in bookkeeper shell command but not sure if the data get copied to others.
----
2018-01-26 16:43:22 UTC - Matteo Merli: &gt; Ah this is a cluster setting. So basically “do the upgrade” means rolling restart all bookies I guess.

Sorry I didn’t specify. the steps are per-bookie. the flag to enable/disable auto-recovery is per-cluster.

Given that, since all the automation we had was per-host, we can flip the flag off and on for each host. It’s also possible to turn it off before the upgrade and back on at the very end
----
2018-01-26 16:46:18 UTC - Matteo Merli: &gt; What if we need to provision new machines replacing the existing ones? In this case, you shouldn’t disable auto recovery but kill one bookie and provision new one and check “listunderreplicated” empty before doing the next one.

That’s correct. The auto-recovery should be always “on”, except when there’s planned maintenance and you want to avoid the data getting copied around. (btw: for that there’s also and over-replication check that happens periodically, default every 24h, to check if we have extra copies of some ledgers and get rid of them)&gt;
----
2018-01-26 16:56:19 UTC - Matteo Merli: &gt; Do I need to run a separate autorecovery service ? (I’m seeing there is bin/bookkeeper autorecovery).
&gt; I brought up new bookies and kill the existing bookies one by one but I don’t see ‘underreplicated’ entry in bookkeeper shell command but not sure if the data get copied to others.

Yes, I think I forgot to mention this. The auto-recovery service is a logically separated component, completely stateless and can be started in a few different ways :
* In same JVM process as bookies (`autoRecoveryDaemonEnabled=true` in bookkeeper.conf). It’s easy to start but in general you want to avoid the recovery process to interfere with bookie process and also to pollute the bookie logs since it’s kind of noisy)
* In a different process running alongside each bookie process (`bin/bookkeeper autorecovery`). This process needs little memory (-Xmx512M) and no other special tuning
* Running a pool of auto-recovery workers independently from the bookies. This is generally a good option when running with any orchestration framework, where it’s easy to manage stateless components.
----
2018-01-26 17:06:39 UTC - Matteo Merli: regarding auto-recovery, there’s more documentation here: <http://bookkeeper.apache.org/docs/latest/admin/autorecovery/>
----
2018-01-26 17:33:35 UTC - Jaebin Yoon: Thanks @Matteo Merli. I don't see that option (autoRecoveryDaemonEnabled) is specified in the bookkeeper.conf. (maybe we should add the default value in bookkeeper.conf so that people get aware of this option?) Without that service running, it seems no data copying is done when a bookie goes down (nobody updates underreplicated ledgers. So checking listunderreplicated doesn't do anything). I think I lost my all my test data. ^^ it's ok since those are just test data.
Can you explain how this auto-recovery work in highlevel ? I would like to confirm how the data flows.
Does the auto-recovery worker just talk to zookeeper and find out any missing bookies and update underreplicated list in znode ? Or it involves in moving ledgers from one bookie to the other? who would move actual ledgers?
----
2018-01-26 17:43:53 UTC - Matteo Merli: &gt; I don’t see that option (autoRecoveryDaemonEnabled) is specified in the bookkeeper.conf.

Yes, for some reason the options for auto-recovery with default values are not there in the `bookkeeper.conf` included with Pulsar. We should fix that.

&gt; Can you explain how this auto-recovery work in highlevel ? I would like to confirm how the data flows.
&gt; Does the auto-recovery worker just talk to zookeeper and find out any missing bookies and update underreplicated list in znode ? Or it involves in moving ledgers from one bookie to the other? who would move actual ledgers?

There 2 parts: the auditor and the workers.
* Auditor monitors bookies to check for missing ones
* When it finds one, it will mark all ledgers that have a copy on that bookie as “under-replicated”
* Workers will pick up tasks to do, in the form of ledgers to replicate
* When a worker is done replicating a particular ledger, it will update the metadata (replacing the mention of the failed bookie with the new bookie that contains the copy of the data now) and then it will clear the under-replication status for that ledger
* The workers keep waiting for next ledger to replicate

<http://bookkeeper.apache.org/docs/latest/admin/autorecovery/#autorecovery-architecture>
----
2018-01-26 17:48:04 UTC - Jaebin Yoon: Oh I should've read the doc first. It seems the process is well documented there. So the bookies themselves have no ability to replicate so these auto-recovery workers read/write from/to bookies. So potentially the network can be saturated between these auto-recovery nodes and bookies when a bookie goes down.
----
2018-01-26 17:48:36 UTC - Matteo Merli: there’s a throttling at the worker level
----
2018-01-26 17:48:57 UTC - Matteo Merli: the max number of entries to replicate at a given time
----
2018-01-26 17:49:41 UTC - Matteo Merli: These are the options: <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/conf/bk_server.conf#L213>
----
2018-01-26 17:49:52 UTC - Jaebin Yoon: i see. I'll dig more into this recovery since this would be important part for the environment when the machines can go away easily.
----
2018-01-26 17:50:26 UTC - Matteo Merli: This conf file is from master. the only option that is not there in the current version is the `lostBookieRecoveryDelay`. We’ll have that once we switch to bk 4.7
----
2018-01-26 18:02:02 UTC - Jaebin Yoon: Since I lost all ledgers for the existing topic, I need to clean them up. Currently nobody produces, consumes from the topic. Can I just delete the topic that lost data and recreate it to clean up the previous meta data?
----
2018-01-26 18:22:15 UTC - Jaebin Yoon: Another question. It seems that this auto-recovery takes care of the case where a bookie is gone. How about the under-replicated case because of the broker fails to duplicate because of the temporary connection issue?
For example, if write quorum = 2, ack quorum =1, then the message may fail to be replicated to two bookies. This will lead to inconsistency in the duplicated ledgers? or Any mechanism to handle this kind of failure?
----
2018-01-26 19:20:43 UTC - Matteo Merli: &gt; For example, if write quorum = 2, ack quorum =1, then the message may fail to be replicated to two bookies. This will lead to inconsistency in the duplicated ledgers? or Any mechanism to handle this kind of failure?

If the write operation eventually fails on the second bookie, the BK client will take care of it by replacing the failed bookie with a different one from the cluster and re-sending all the entries that were not acknowledged by that particular bookie
----
2018-01-26 19:30:04 UTC - Matteo Merli: &gt; Since I lost all ledgers for the existing topic, I need to clean them up. Currently nobody produces, consumes from the topic. Can I just delete the topic that lost data and recreate it to clean up the previous meta data?

Yes, using the `bin/pulsar-admin persistent delete-partitioned-topic $MY_TOPIC` should get the trick
----