You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2018/01/26 01:37:10 UTC
Slack digest for #general - 2018-01-25

2018-01-25 04:50:44 UTC - Jaebin Yoon: I'm seeing the bunch of these messages in the broker logs and producers and consumers get disconnected and connect again. ```2018-01-25 04:48:34,147 - WARN  - [BookKeeperClientWorker-24-1:LedgerHandle@919] - Write did not succeed to 100.82.113.61:3181, bookieIndex 0, but we have already fixed it.
2018-01-25 04:48:34,147 - WARN  - [BookKeeperClientWorker-24-1:PendingAddOp@238] - Write did not succeed: L211502 E56322 on 100.82.113.61:3181, rc = -23
2018-01-25 04:48:34,147 - WARN  - [BookKeeperClientWorker-24-1:LedgerHandle@919] - Write did not succeed to 100.82.113.61:3181, bookieIndex 0, but we have already fixed it.
2018-01-25 04:48:34,148 - WARN  - [BookKeeperClientWorker-24-1:PendingAddOp@238] - Write did not succeed: L211502 E52420 on 100.82.113.61:3181, rc = -23
2018-01-25 04:48:34,148 - WARN  - [BookKeeperClientWorker-24-1:LedgerHandle@919] - Write did not succeed to 100.82.113.61:3181, bookieIndex 0, but we have already fixed it.```
What would be issues?
----
2018-01-25 04:55:34 UTC - Jaebin Yoon: @Jaebin Yoon uploaded a file: <https://apache-pulsar.slack.com/files/U8CM86831/F8YAFQN0L/-.txt|Untitled>
----
2018-01-25 06:52:24 UTC - Sijie Guo: @Jaebin Yoon is the bookie logs from `100.82.113.61`?
----
2018-01-25 06:59:40 UTC - Jaebin Yoon: @Sijie Guo I'm not sure. After sometime I've seen bunch of "Write did not succeed.." exceptions in brokers, I've seen the above exceptions on the multiple bookies. 
@Matteo Merli actually I found one issue with our consumer code. It didn't acknowledge messages.. (I think we forgot to add that when we switched from kafka wrapper to pulsar client because of the high partition issues). This would explain why the consumer queue was almost always full, leading to many different issues. I think this definitely stressed the brokers and bookies since it always asked the old messages again and again.
----
2018-01-25 07:15:18 UTC - Jaebin Yoon: But the traffic was not huge..(1k ~ 10k qps from one producer) so it was strange that it could make things very unstable. I'll see more if I can see those issues again with the usual setup.
----
2018-01-25 07:28:44 UTC - Sijie Guo: it sounds like the refcnt issue casues “Write did not succeed”. @Matteo Merli is that similar as the issue you fixed before?
----
2018-01-25 15:08:55 UTC - Matteo Merli: @Jaebin Yoon for the general instability, in particular when replaying old data, I'm starting to suspect some config/env non-optimal setting. Can you verify that, in bookies, journal and storage are set to different disk devices? I saw earlier you mentioned about the 12 HDD in raid 0. The normal confit there would be to dedicate 1 or 2 disks to journal and the rest to storage, with 2 Raid groups
----
2018-01-25 17:13:58 UTC - Jaebin Yoon: The current setup is the both journal and storage are on raid 0 hard disk over 12 HDD. I'll check if we can separate one out of 12 for journal. that configuration is something hard to change. Usually how big should the journal disk be?
----
2018-01-25 17:31:12 UTC - Matteo Merli: The journal doesn’t have to be big, though the key is for it to be a dedicated disk. That’s what allows to decouple read/writes completely in the bookies.

Journal is composed of 2 GB files, and it gets “trimmed” whenever the data is flushed in the storage device. (default config is to to keep the last 5 journals around, so most of times the space required is ~10Gb). Of course the IO bandwidth of the journal is what’s typically critical, but without fsync, even a single HDD should be giving a very decent throughput.
----
2018-01-25 17:31:25 UTC - Matteo Merli: (providing the HDD is dedicated)
----
2018-01-25 17:40:01 UTC - Jaebin Yoon: Ok. I'll look for the way to set a separate disk for the journal before scaling up more.
----
2018-01-25 17:50:50 UTC - Jaebin Yoon: I got this error and not sure why I'm getting this since the consumers now are acknowledging messages. 
```2018-01-25 17:21:13,504 WARN  org.apache.pulsar.client.impl.ClientCnx:359 [pulsar-client-io-1-1] [handleError] [id: 0x45292a46, L:/100.82.118.47:46366 - R:<http://prod.pulsarbroker.us-east-1.dynprod.netflix.net/100.82.102.123:6650|prod.pulsarbroker.us-east-1.dynprod.netflix.net/100.82.102.123:6650>] Received error from server: Cannot create producer on topic with backlog quota exceeded
2018-01-25 17:21:13,504 WARN  org.apache.pulsar.client.impl.ClientCnx:362 [pulsar-client-io-1-1] [handleError] [id: 0x45292a46, L:/100.82.118.47:46366 - R:<http://prod.pulsarbroker.us-east-1.dynprod.netflix.net/100.82.102.123:6650|prod.pulsarbroker.us-east-1.dynprod.netflix.net/100.82.102.123:6650>] Producer creation has been blocked because backlog quota exceeded for producer topic```
----
2018-01-25 17:51:34 UTC - Matteo Merli: You can check the topic stats to see the backlog and if there are any dangling subscriptions
----
2018-01-25 17:51:56 UTC - Matteo Merli: `bin/pulsar-admin persistent stats $TOPIC`
----
2018-01-25 17:52:13 UTC - Jaebin Yoon: ah.. i see. if there are subscriptions created for the topic but not reading from it, it would affect the producing side.
----
2018-01-25 17:52:33 UTC - Matteo Merli: that’s the default behavior, you can configure to drop all data
----
2018-01-25 17:52:44 UTC - Matteo Merli: all data ---&gt; “old data”
----
2018-01-25 17:54:04 UTC - Matteo Merli: eg: 
`bin/pulsar-admin namespaces set-backlog-quota $MY_NAMESACE --limit 1T --policy consumer_backlog_eviction`
----
2018-01-25 17:54:05 UTC - Jaebin Yoon: ok. got it. thanks! we're so used to with creating ad-hoc consumers and leaving it there.
----
2018-01-25 17:54:55 UTC - Jaebin Yoon: it makes sense.
----
2018-01-25 17:55:14 UTC - Matteo Merli: that will start “trimming” the backlog once the quota is reached. all the defaults we have are made to never delete data unexpectedly
----
2018-01-26 00:54:38 UTC - Jaebin Yoon: While doing the load tests, all consumer connections got closed and I see the following exceptions on the broker. 
```2018-01-26 00:38:08,489 - ERROR - [pulsar-web-61-28:PulsarWebResource@381] - [null] Failed to validate namespace bundle netflix/prod/ns1/0x84000000_0x86000000
java.lang.IllegalArgumentException: Invalid upper boundary for bundle
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
	at org.apache.pulsar.common.naming.NamespaceBundles.validateBundle(NamespaceBundles.java:110)
	at org.apache.pulsar.broker.web.PulsarWebResource.validateNamespaceBundleRange(PulsarWebResource.java:378)
	at org.apache.pulsar.broker.web.PulsarWebResource.validateNamespaceBundleOwnership(PulsarWebResource.java:404)
	at org.apache.pulsar.broker.admin.Namespaces.splitNamespaceBundle(Namespaces.java:876)
	at sun.reflect.GeneratedMethodAccessor127.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
	at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:143)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
	at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)```
----
2018-01-26 00:56:48 UTC - Jaebin Yoon: It seems it tried to split the bundle from 4 to the bigger one when the traffic gets heavy.
----
2018-01-26 00:58:02 UTC - Matteo Merli: &gt; all consumer connections got closed 

That is expected since the default configuration is to “unload” immediately the newly split bundles to give them a chance to get immediately reassigned to a different broker
----
2018-01-26 01:00:32 UTC - Matteo Merli: I think the exception that gets printed is related to the fact the bundles list gets updated by all the split happening in parallel. I don’t think that should pose a big problem, the split for that particular bundle should have been re-attempted at the next iteration
----
2018-01-26 01:01:57 UTC - Matteo Merli: (note that it should be avoided.. )
----
2018-01-26 01:01:57 UTC - Jaebin Yoon: somehow the consumers stopped consuming right after that happened.. So in this case, the bundle should be 8 then.. i'll check if that is the case.
----
2018-01-26 01:02:44 UTC - Matteo Merli: did they reconnected?
----
2018-01-26 01:07:20 UTC - Jaebin Yoon: @Jaebin Yoon uploaded a file: <https://apache-pulsar.slack.com/files/U8CM86831/F8YV113EE/-.txt|Untitled>
----
2018-01-26 01:07:23 UTC - Matteo Merli: the exception itself is thrown in the validation phase, so it shouldn’t affect the running state in any way, the split for that particular bundle has currently failed
----
2018-01-26 01:07:56 UTC - Jaebin Yoon: so it seems it failed to reconnect
----
2018-01-26 01:08:01 UTC - Matteo Merli: Oh I see, you got an exception in the `consumer.acknowledge()`
----
2018-01-26 01:09:00 UTC - Matteo Merli: which is to be expected from an API perspective
----
2018-01-26 01:09:28 UTC - Jaebin Yoon: I see..
----
2018-01-26 01:10:01 UTC - Jaebin Yoon: so need to catch that retry message?
----
2018-01-26 01:10:44 UTC - Matteo Merli: `acknowledge()` will fail when non connected to broker. there’s no much else to do
----
2018-01-26 01:11:09 UTC - Matteo Merli: retry the acnkowledgement is not needed either, because the message will be replayed anyway by the broker
----
2018-01-26 01:11:38 UTC - Matteo Merli: the easiest way to deal with it is to just use `consumer.acknowledgeAsync()` and not bother to track the future
----
2018-01-26 01:12:18 UTC - Jaebin Yoon: ok. let me try that. thanks!
----
2018-01-26 01:13:33 UTC - Jaebin Yoon: regarding the the split failure, it failed to split because of the race condition and it is ok?
----
2018-01-26 01:14:23 UTC - Jaebin Yoon: it should be split .. if it fails, will it retry?
----
2018-01-26 01:14:57 UTC - Matteo Merli: yes, the same logic that leads to the split is checked periodically (traffic, # of topics, …)
----
2018-01-26 01:15:34 UTC - Jaebin Yoon: alright. cool. thanks!
----
2018-01-26 01:26:48 UTC - Matteo Merli: btw: I’ve been merging most of the fixes to issues you encountered:

Default limit on number of lookup with 10K partitions:
<https://github.com/apache/incubator-pulsar/pull/1116>

Fix for Kafka consumer wrapper slow consumption (this will be merged shortly):
<https://github.com/apache/incubator-pulsar/pull/1115>

Fix for the race condition in producer code:
<https://github.com/apache/incubator-pulsar/pull/1108>

Reducing the exception printed in logs after disconnections:
<https://github.com/apache/incubator-pulsar/pull/1107>

Load balancer not being enabled by default in embedded broker:
<https://github.com/apache/incubator-pulsar/pull/1104>


My next step is to test out with you VM configuration to verify in the same conditions. So far my test VMs were the ones that I deploy from the messaging benchmark deployment scripts and I cannot see anymore issue at this point here
----