You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2018/11/09 09:11:04 UTC
Slack digest for #general - 2018-11-09

2018-11-08 09:12:03 UTC - David Tinker: I have created a PR for this <https://github.com/apache/pulsar/pull/2961>
----
2018-11-08 11:17:20 UTC - Christophe Bornet: @Christophe Bornet has joined the channel
----
2018-11-08 12:52:41 UTC - Chris Miller: We have a project with dependencies on 2.2.0 versions of pulsar-client, pulsar-client-admin, pulsar-client-flink, pulsar-functions-api. This results classpath collisions related to shaded vs unshaded Netty code, e.g:
```Exception in thread "PulsarListenerVerticle" java.lang.NoSuchMethodError: org.apache.pulsar.common.util.netty.EventLoopUtil.newEventLoopGroup(ILjava/util/concurrent/ThreadFactory;)Lorg/apache/pulsar/shade/io/netty/channel/EventLoopGroup;
        at org.apache.pulsar.client.impl.PulsarClientImpl.getEventLoopGroup(PulsarClientImpl.java:810)
        at org.apache.pulsar.client.impl.PulsarClientImpl.&lt;init&gt;(PulsarClientImpl.java:124)
        at org.apache.pulsar.client.impl.ClientBuilderImpl.build(ClientBuilderImpl.java:54)```
Seems like there's a fundamental problem with some Pulsar jars containing shaded classes and others that don't. Any ideas?

[in our case the conflict seems to be from having both pulsar-common-2.2.0.jar and pulsar-client-2.2.0.jar on the classpath, where the client jar contains shaded versions of some of the contents of common]
----
2018-11-08 13:34:00 UTC - Ivan Kelly: have you tried with pulsar-client-original?
----
2018-11-08 13:34:10 UTC - Ivan Kelly: i.e. the non-shaded version?
----
2018-11-08 13:59:23 UTC - Chris Miller: Interesting, I didn't realise that's what it was for. Presumably that will be fine as long as I get everything shaded off the classpath (or vice versa). I just find it surprising that eg pulsar-common and pulsar-client jars don't play well together.
----
2018-11-08 13:59:58 UTC - Ivan Kelly: ya, that is surprising
----
2018-11-08 14:00:09 UTC - Ivan Kelly: doesn't pulsar-client pull common though?
----
2018-11-08 14:03:43 UTC - Chris Miller: Haven't finished investigating this yet so I'm not 100% sure. Before I upgraded to 2.2.0, we weren't seeing pulsar-common-2.1.1.jar being pulled in to our classpath. I updated our dependency to 2.2.0 and also added pulsar-flink, now we're seeing pulsar-common being pulled in, so I still need to check what's pulling it in. Also, <https://github.com/apache/pulsar/pull/2783> might be related somehow?
----
2018-11-08 14:31:45 UTC - Ivan Kelly: ah, it's probably pulsar-flink pulling it in
----
2018-11-08 14:32:21 UTC - Ivan Kelly: hmm, no
----
2018-11-08 14:33:21 UTC - Ivan Kelly: ah, could be something to do with <https://github.com/apache/pulsar/commit/e72ed35cc2b07e2ec39b84a23ae1819d94b152f4>
----
2018-11-08 14:43:05 UTC - Chris Miller: Yes it is pulsar-flink. If I take a shell project with just that as a dependency, I end up with pulsar-client-2.2.0.jar, pulsar-client-original-2.2.0.jar, pulsar-common-2.2.0.jar, pulsar-flink-2.2.0.jar (and a whole bunch of other deps). And if I turn off transitive dependencies for pulsar-flink in the real project, the problem seems to be solved
----
2018-11-08 15:56:55 UTC - tuan nguyen anh: @tuan nguyen anh has joined the channel
----
2018-11-08 16:00:59 UTC - tuan nguyen anh: Hello all, i will presentation on pulsar so i am finding tool or framwork test performance in pulsar but not found. If you know, ask me. Thanks you
----
2018-11-08 16:03:10 UTC - Matteo Merli: <https://www.slideshare.net/mobile/merlimat/high-performance-messaging-with-apache-pulsar>
----
2018-11-08 16:05:55 UTC - tuan nguyen anh: thanks you, but i need testing in my pc and show your memtor
----
2018-11-08 16:06:07 UTC - Matteo Merli: @Chris Miller it’s some problem with the shade plugin that is including the transitive dependencies of a sibling module which gets shaded and should not have these dependencies. 

The commit that Ivan pointed out should fix the issue since shading was not really needed in the pulsar-flunk module itself. 

We’ll release it as part of 2.2.1 release soon
----
2018-11-08 16:09:14 UTC - Chris Miller: OK great to hear, thanks for the explanation and quick fix. In the meantime I'm happy disabling the transitive dependencies for 2.2.0 in my build script
----
2018-11-08 16:11:29 UTC - Matteo Merli: Then you can use `pulsar-perf` tool to send and receive traffic. More complex scenarios can be modeled through the OpenMessaging benchmark. 
----
2018-11-08 16:20:15 UTC - tuan nguyen anh: I found it
Thanks you and have a nice day
----
2018-11-08 20:25:22 UTC - Beast in Black: Hi Guys (and @Matteo Merli), I am using the pulsar cpp client in a C++ application on which I'm currently working, and I have a couple of questions regarding some odd behavior I've seen.

Some background: The application publishes a *lot* of data on multiple non-persistent topics (about 0.75KB to 1KB per second per topic), and I am seeing some OOM issues and node/instance restarts (My app is on AWS along with the pulsar infra - brokers, bookies, zookeeper etc, all as K8s pods in AWS). My current understanding is that non-persistent topics are in-memory only and do not persist anything to disk.

My questions are:
1. Would the vast amount of data published to the non-persistent topics cause an OOM condition on my server (the AWS node/instance in this case)?
2. If the answer to the above is "yes", is there any way to mitigate this issue? Maybe by setting memory constraints on the JVM?
----
2018-11-08 20:32:14 UTC - Beast in Black: Additionally,
3. If I set memory constraints on the JVM, what would happen to messages that are published once the JVM hits the memory limit?
4. If I delete my application's K8s pod, When the pod is restarted (it is a K8s StatefulSet), I notice that once the app comes back up, I see producer/consumer connect errors in the logs, these messages are from the pulsar client.

For (4) above I suspect that this is because when my app pod is restarted, it attempts to subscribe to the topics using the *same subscription name*. In this case, I'm wondering if the pulsar client errors are because from pulsar's perspective these subscriptions are still active, and so it errors out when I try to resubscribe using the same subscription name. When these errors happen, I've noticed that one way to fix it is to delete the broker pod first (which gets automatically restarted) and then delete my application pod (which also gets automatically restarted).

Doing this fixes the errors, which leads me to conclude that restarting the broker pod clears out existing subscriptions and so allows my application to come back up without the pulsar connect errors.

To fix this, my idea is to first unsubscribe the subscription name used by my app before I susbscribe again to the topic. So if I do this, *will it be safe i.e. will unsubscribing lead to loss of messages in the topic during the time between unsubscribing and then re-subscribing?*
----
2018-11-08 21:26:12 UTC - Emma Pollum: I'm having trouble restarting my pulsar nodes. when I do service restart it no longer connects to zookeeper with an exception saying that the znode is owned by another session. Any one run into this before?
----
2018-11-08 21:32:11 UTC - Emma Pollum: 
----
2018-11-08 22:03:28 UTC - Han Tang: Hi Experts, I am a newbie to Pulsar. Ignore me if I am asking a stupid question! Do any of you have interesting stories to share when your team was making the choice between Pulsar and Kafka?
----
2018-11-08 22:19:42 UTC - Beast in Black: @Han Tang IMNSHO this is a question best asked in the `random` slack channel. That said, here is some info for you on this matter:
<https://streaml.io/blog/pulsar-streaming-queuing>
<https://jack-vanlightly.com/blog/2018/9/14/how-to-lose-messages-on-a-kafka-cluster-part1> *along with* <https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an-apache-pulsar-cluster>

If you are a newbie, you may find this very informative and explanatory blog post extremely helpful (as I did): <https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works>
+1 : Matteo Merli, Ali Ahmed
----
2018-11-08 22:22:15 UTC - Beast in Black: @Han Tang some more informative posts:
<https://www.businesswire.com/news/home/20180306005633/en/Apache-Pulsar-Outperforms-Apache-Kafka-2.5x-OpenMessaging>
<http://www.jesse-anderson.com/2018/08/creating-work-queues-with-apache-kafka-and-apache-pulsar/>
+1 : Matteo Merli, Ali Ahmed
----
2018-11-08 22:34:44 UTC - Matteo Merli: That can happen with an abrupt shutdown. The session will automatically expire (eg. After 30sec) and the process will eventually succeed to restart. 

In any case, it’s better to do a graceful shutdown of processes, by sending SIGTERM instead of SIGKILL. That will ensure that the ZK session is properly closed and also ensure a smoother topic failover. 
----
2018-11-08 23:06:23 UTC - Christophe Bornet: Hi, I'm trying to add dead letter policy to the websocket proxy
I've added the lines
```
if (queryParams.containsKey("maxRedeliverCount")) {
            builder.deadLetterPolicy(DeadLetterPolicy.builder()
                    .maxRedeliverCount(Integer.parseInt(queryParams.get("maxRedeliverCount")))
                    .deadLetterTopic("DLQ")
                    .build());
        }
```
to ConsumerHandler::getConsumerConfiguration() but it doesn't seem to work. If I don't ack the messages I still get them in loop. Someone know what I do wrong ?
----
2018-11-08 23:13:55 UTC - Christophe Bornet: Oh, stupid me : I forgot to set the subscription as shared... I'll do a PR to add DLQ to websockets if that's fine for you
----
2018-11-08 23:15:24 UTC - Matteo Merli: Nice! :+1:
----
2018-11-08 23:26:51 UTC - Beast in Black: @Matteo Merli any insights you could give me on the questions I posted earlier?
----
2018-11-08 23:28:22 UTC - Matteo Merli: Missed those, let me take a look
----
2018-11-08 23:34:32 UTC - Beast in Black: @Matteo Merli np, thank you!
----
2018-11-08 23:56:27 UTC - Matteo Merli: @Beast in Black

&gt; 1. Would the vast amount of data published to the non-persistent topics cause an OOM condition on my server (the AWS node/instance in this case)?
&gt; 2. If the answer to the above is “yes”, is there any way to mitigate this issue? Maybe by setting memory constraints on the JVM?

Data for non-persistent topic is not “stored” in broker either: messages are immediately dispatched to consumers
or dropped if a consumer is not ready. The Pulsar broker acts like a proxy in this case, the only thing to pay
attention to is that all these in-flight messages are using memory (direct memory in this case).

There are ways to reduce the amount of memory used by Netty (through Netty specific configs) and we have work
in progress to make it simpler in Pulsar to configure that and also to have fallback strategy before giving up
to OOM (see: <https://github.com/apache/pulsar/wiki/PIP-24%3A-Simplify-memory-settings>)

For the specific problem, I’d suggest to increase direct memory limit in the broker and monitor the actual
used direct memory.

&gt; 3. If I set memory constraints on the JVM, what would happen to messages that are published once the JVM hits the memory limit?

There would be no change to the semantic. The problem is that if you hit JVM mem limit errors are being
triggered.

&gt; 4. If I delete my application’s K8s pod, When the pod is restarted (it is a K8s StatefulSet), I notice that once the app comes back up, I see producer/consumer connect errors in the logs, these messages are from the pulsar client.

The client library internally retries to reconnect until it suceeds, if you have multiple broker pods,
the reconnection time should be very quick.

&gt; For (4) above I suspect that this is because when my app pod is restarted, it attempts to subscribe to the topics using the *same subscription name*. In this case, I’m wondering if the pulsar client errors are because from pulsar’s perspective these subscriptions are still active, and so it errors out when I try to resubscribe using the same subscription name. When these errors happen, I’ve noticed that one way to fix it is to delete the broker pod first (which gets automatically restarted) and then delete my application pod (which also gets automatically restarted).

I’m not sure I get what’s happening there but I don’t think is a problem of the subscription being there. In any case the consumer will have disconnected so the broker will let it re-subscribe to the same subscription again.

&gt; To fix this, my idea is to first unsubscribe the subscription name used by my app before I susbscribe again to the topic. So if I do this, *will it be safe i.e. will unsubscribing lead to loss of messages in the topic during the time between unsubscribing and then re-subscribing?* (edited)

Keep in mind that, with non-persistent topics, data is never retained, so subscription are not persistent either.
----
2018-11-09 00:07:37 UTC - Beast in Black: @Matteo Merli that was very helpful, thank you very much.
----
2018-11-09 00:16:46 UTC - Beast in Black: @Matteo Merli a couple fo follow-up questions:

&gt; There would be no change to the semantic. The problem is that if you hit JVM mem limit errors are being triggered.
So would the messages (especially for non-persistent topics) still be published/received even if the JVM hits the limit? I imagine not, since that would - as you say - cause errors to be triggered.

&gt; I’m not sure I get what’s happening there but I don’t think is a problem of the subscription being there. In any case the consumer will have disconnected so the broker will let it re-subscribe to the same subscription again.
What would happen if my app process which runs the consumer never cleanly disconnected, but experienced a crash or was terminated using `SIGKILL`? Would the re-subscription (to persistent topics) still work in this case using the same subscription name?
----
2018-11-09 00:22:33 UTC - Matteo Merli: 1. When getting OOMs the JVM process might be left in bad state. That’s why I was recommending to set flags with decent amount of memory if you expect large throughput. The default we have are conservative, meant to be reasonable on many deployments, but should be increased for higher throughput. 


2. If a consumer is abruptly killed, crashed or simply partitioned from a broker, the broker will evict it after a while. There is a health check in the pulsar protocol to detect these cases (both for clients and brokers)
----
2018-11-09 00:25:28 UTC - Beast in Black: @Matteo Merli For (1) I will experiment with different values higher than the defaults, and see what happens.

For (2), what would happen in the case that the consumer comes back and tries to re-subscribe (using the same name) *before* the broker eviction kicks in (say within a minute of crashing/being crashed)? Would the broker allow the re-subscription?
----
2018-11-09 00:32:34 UTC - Matteo Merli: It depends on the subscription type, for exclusive subscription it will give error. For failover and shared subscriptions it will allow the consumer to get attached 
----
2018-11-09 00:33:56 UTC - Beast in Black: @Matteo Merli thanks!
+1 : Matteo Merli
----
2018-11-09 00:57:16 UTC - Beast in Black: @Matteo Merli just an FYI - I checked our application code, and it seems that on the app pod where I notice the consumer connect errors, the subscription type is *not* explicitly set to shared - it uses the default mode, which is exclusive mode as per `<https://pulsar.apache.org/docs/latest/getting-started/ConceptsAndArchitecture/#Exclusive-u84f1>` .

In light of the info you gave me above, it seems that this is most likely the cause of my woes, since when that pod is crashed, it recovers within a few minutes, so the broker does not have time to reap the previous subscription before the app pod tries to subscribe again with the same name. I will change the code to explicitly set the subscription mode to shared and see if my troubles melt away :smile:
----
2018-11-09 06:01:56 UTC - ytseng: @ytseng has joined the channel
----
2018-11-09 07:51:32 UTC - tuan nguyen anh: Hi all, i have a trouble with kafka and pulsar, when i run kafka and pulsar perf in my laptop with HDD disk, kafka with ~ 60k msg/s and pulsar 18k msg/s. So why diffirent it?
----
2018-11-09 07:54:05 UTC - Ali Ahmed: how are running the kafka test ?
----
2018-11-09 07:54:30 UTC - tuan nguyen anh: kafka-producer-perf-test.sh --topic TEST --num-records 18000000 --producer-props bootstrap.servers=localhost:9092 --throughput 500000 --record-size 1024
----
2018-11-09 07:55:14 UTC - tuan nguyen anh: this is my command
----
2018-11-09 07:56:03 UTC - tuan nguyen anh: but if i run in SSD, pulsar 180k and kafka 100k
----
2018-11-09 07:59:31 UTC - Ali Ahmed: I think kafka producer  is setup to be in batch
----
2018-11-09 08:04:46 UTC - Sijie Guo: @tuan nguyen anh how do you run pulsar perf?
----
2018-11-09 08:07:34 UTC - Karthik Ramasamy: @Sijie Guo - is this due to the non-separation of journal vs data?
----
2018-11-09 08:12:25 UTC - Sijie Guo: there are a couple of factors:

1) how many partitions of kafka topic `TEST`? and how many partitions of pulsar topic? in the use.
2) fsync is on by default in pulsar, while kafka never does fsync. if you want a better apple-to-apple comparison, you should consider setting `journalSyncData` to `false` in conf/bookkeeper.conf
3) for pulsar, if using ssd, configuring multiple journal directories to improve io parallelism; if using hdd, separate journal from ledgers directories.
----
2018-11-09 08:12:26 UTC - tuan nguyen anh: pulsar-perf produce -r 500000 -time 30 my-topic
----
2018-11-09 08:14:26 UTC - Karthik Ramasamy: If you are using HDD, a separate disk for journal and a separate disk for data - otherwise you will get into random seeks in the same disks that affects the publish throughput
----
2018-11-09 08:16:57 UTC - tuan nguyen anh: i think pulsar store data in file and ssd is good, it right?
----
2018-11-09 08:18:40 UTC - Sijie Guo: both pulsar and kafka store data in files, but just different in storage format. the difference is mainly on whether persistent to disk or not (pulsar does fsync to ensure no data loss, but kafka doesn’t).
----
2018-11-09 08:23:43 UTC - tuan nguyen anh: i try edit fsync in pulsar, but nothing change
----
2018-11-09 08:24:35 UTC - tuan nguyen anh: Throughput produced:  12861.0  msg/s ---    100.5 Mbit/s --- Latency: mean:  83.247 ms - med:   4.227 - 95pct:  11.610 - 99pct:  23.696 - 99.9pct: 10051.391 - 99.99pct: 10051.519 - Max: 10054.015
----
2018-11-09 08:25:14 UTC - Sijie Guo: did you restart pulsar after editing fsync?
----
2018-11-09 08:25:35 UTC - Sijie Guo: and this is hdd?
----
2018-11-09 08:25:47 UTC - tuan nguyen anh: yes, edit journalSyncData = false and restart pulsar with hdd disk
----
2018-11-09 08:30:08 UTC - Sijie Guo: are you running standalone or ? are all settings default?
----
2018-11-09 08:31:04 UTC - tuan nguyen anh: i running by pulsar standalone
----
2018-11-09 08:36:28 UTC - Sijie Guo: that’s a bit weird. your latency doesn’t look like `fsync` is disabled. your mean latency is 83ms, 999 latency is up to 10second
----
2018-11-09 08:37:20 UTC - Sijie Guo: oh
----
2018-11-09 08:37:41 UTC - Sijie Guo: you are running standalone, then you should modify conf/standalone.conf
----
2018-11-09 08:40:04 UTC - Sijie Guo: nvm, I don’t think you should use standalone for perf
----
2018-11-09 08:40:28 UTC - tuan nguyen anh: i checked and fsync set false in standalone.conf
----
2018-11-09 08:56:50 UTC - Sijie Guo: yeah, standalone is setting fsync to false. so the question would be why it takes 10 seconds to write to your hdd.

what is your memory? if you are on linux, can you run `free` to see how your memory was used?
----
2018-11-09 09:09:43 UTC - tuan nguyen anh: Here:
total        used        free      shared  buff/cache   available
Mem:        7860764     3417044     3336700      310032     1107020     3727636
Swap:       8075260      387584     7687676
----