You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2019/04/20 09:11:02 UTC

Slack digest for #general - 2019-04-20

2019-04-19 11:47:28 UTC - Mr BECHAMKI: @Mr BECHAMKI has joined the channel
----
2019-04-19 13:18:44 UTC - stefan: Hi. I am having trouble re initializing the cluster meta data. I end up with Exception in thread "main" org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /namespace
----
2019-04-19 13:31:06 UTC - Ruud Kamphuis: @Ruud Kamphuis has joined the channel
----
2019-04-19 13:33:55 UTC - stefan: Hi guys. When running locally on my laptop, i end up with a connection refused : Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:6650. Any help appreciated
----
2019-04-19 13:34:47 UTC - Ruud Kamphuis: There seems to be a typo in your address, it reads `localhost/127.0.0.1:6650` thats not good
----
2019-04-19 13:35:03 UTC - Ruud Kamphuis: it should be `localhost:6650` or `127.0.0.1:6650`
----
2019-04-19 13:36:06 UTC - Ruud Kamphuis: Hello everyone. I read the whole FAQ (<https://github.com/apache/pulsar/blob/master/faq.md>) but couldn't find the answer to this question:

Is it possible to have multiple consumers listening to 1 topic that have their own subscription type? For example, I have an ETL consumer that wants to make sure it gets all the messages. And I have a Stats consumer that keeps track off stats. I want to make sure there is only 1 ETL consumer, and only 1 Stats consumer.
----
2019-04-19 13:36:38 UTC - Ruud Kamphuis: As far as I know, 1 topic can only have 1 subscription? Or is there a way to somehow group consumers by consumerName ?
----
2019-04-19 13:37:28 UTC - stefan: agreed. i just downloaded it with wget and launch the standalon bin/pulsar standalone and called : ./bin/pulsar-client produce my-topic --messages "hello-pulsar"
----
2019-04-19 13:37:35 UTC - stefan: i did not even touch the conf
----
2019-04-19 13:38:15 UTC - Sijie Guo: 1 topic can have as many subscriptions as it can
----
2019-04-19 13:38:25 UTC - Sijie Guo: each subscription can choose its own subscription type.
----
2019-04-19 13:38:38 UTC - Sijie Guo: the consumers use same subscription name are in the same consumer group.
----
2019-04-19 13:44:34 UTC - Ruud Kamphuis: Is this somewhere documented? Because I read through the whole docs and faq but couldn't find it.
----
2019-04-19 13:44:40 UTC - Ruud Kamphuis: (thanks for your answer btw!)
----
2019-04-19 13:46:35 UTC - Ruud Kamphuis: Ah, I know what I was doing wrong.

I saw
`<ws://broker-service-url:8080/ws/v2/consumer/persistent/:tenant/:namespace/:topic/:subscription>`

And thought that `:subscription` was the type, so I entered `shared` there..

But that's just the name of the subscription, nice!
----
2019-04-19 13:49:39 UTC - Sijie Guo: :+1:
----
2019-04-19 13:57:35 UTC - Kai Levy: I understand ZK's general role, I am just hoping to get into the specifics. For example, does using pulsar's reader interface cause writes on ZK, like creating a subscription does?
----
2019-04-19 14:34:50 UTC - Ruud Kamphuis: Another question &gt; When using websockets with a schema, is it still required to base64 encode the `payload`? Or can you send a message like this:
```
{ "payload": { "id": 1, "event": "some-event" } }
```

Maybe I misunderstand the schemas thing.
----
2019-04-19 14:35:26 UTC - Ruud Kamphuis: So if I change my scema from `None` to `JSON`
----
2019-04-19 15:17:38 UTC - Joe Francis: Readers have no persistent state, so no.
----
2019-04-19 15:29:43 UTC - Kai Levy: So generally speaking, is there a list of operations that do use zookeeper, and whether they are reads or writes?
----
2019-04-19 15:31:15 UTC - Kai Levy: Or a straightforward way I can analyze the source code to find operations that use zookeeper?
----
2019-04-19 15:44:06 UTC - Joe Francis: Topics and Subscriptions have state and metadata, and so they will have ZK entries, and this metadata gets updated if you create/delete or set properties on them. Then there is Bookkeeper ledgers associated with the topics and cursors which gets updated when data files get rolled over. You can look ManagedLedgerInfo.java to see what metadata is kept
----
2019-04-19 16:05:54 UTC - Kai Levy: Does creating consumers on existing subscriptions ever write to zk? Or just read?
----
2019-04-19 16:14:42 UTC - Sébastien de Melo: Hi guys!
We encounter a very weird error with our Pulsar function. It has 2 input topics and when we make a load test on 1 topic, the function eventually stops listening to this topic at some point and never recovers. The messages sent to the other topic are still processed though (confirmed by the stats subcommand). Then we have to delete it and recreate it so that it works again.
----
2019-04-19 16:56:08 UTC - Sanjeev Kulkarni: @Sébastien de Melo huh, thats wierd. any errors in the functipn log? how long after the fnction starts do you see this happening
----
2019-04-19 16:56:37 UTC - Sanjeev Kulkarni: and whats the message rate on each of the topic?
----
2019-04-19 17:01:38 UTC - Ruud Kamphuis: Why is the pulsar docker 1GB big? Isn't there a Docker image available that only contains Pulsar itself?
----
2019-04-19 17:13:55 UTC - Joe Francis: In general no.
----
2019-04-19 18:03:49 UTC - Sam Leung: I have a question about phased rollout of a service that is a consumer. Our current system’s paradigm allows us to specify a percentage of traffic to route to a new deployment, e.g. 99% of traffic goes to service A v1, 1% goes to service A v2. Eventually we tweak those until all requests go to v2
In Pulsar, messages are pushed to the clients according to the subscription, so that means v1 and v2 will both process messages as fast as they can. Has precise throttling of a certain group of consumers been considered?
I see some potential solutions as:
- use consumer priority and permits to get rough distribution, but that does not actually give me control
- have consumers nack a % of received messages, but a lot of busy work and again not very precise
- create pulsar function to route messages in a distribution into v1's topic and v2's topic, but that could end up with a lot of duplication
- add something to `AbstractDispatcherMultipleConsumers` to support groups of consumers with a % of messages routed to them
Any thoughts?
----
2019-04-19 18:13:49 UTC - David Kjerrumgaard: @Ruud Kamphuis The Pulsar docker image currently includes bookkeeper, zookeeper, and other components that contribute to the size of the image. We could create a standalone "pulsar" only docker image, but it would be incumbent upon the user to also spin up a ZK, and BK image to configure the networking between them via docker-compose or similar. So far, nobody has elected to go down that route.
----
2019-04-19 18:16:59 UTC - David Kjerrumgaard: @Sam Leung If you are looking for a short term "hack" to simulate the behavior you described, you could write a simple pulsar function that processes the message, generates a random number between 1 and 100, if it is less than 100 then route it to service A v1, otherwise route it to service A v2.
----
2019-04-19 18:18:14 UTC - Sam Leung: @David Kjerrumgaard I understand that “hack” could work. I am trying to figure out the long term solution
----
2019-04-19 18:19:21 UTC - David Kjerrumgaard: @Sam Leung Sure, I am curious as to how the long term solution would be different from a routing perspective, i.e how would you determine which messages go to which consumers?
----
2019-04-19 18:20:57 UTC - David Kjerrumgaard: and how would you handle slow consumers, i.e. one consumer takes longer to process messages than others, would you adapt to the back-pressure, etc? What if one of the consumers fails? should the remaining one get 100% of the traffic?
----
2019-04-19 18:22:39 UTC - Sam Leung: Ah we have a microservice that could serve those percentage numbers. If we use pulsar functions to do the routing, I am thinking we would need to cache the numbers in redis or zookeeper.
We generally have a GA version, which all traffic is routed to by default, but divert 1% (or whatever) to the new deployments.
----
2019-04-19 18:22:54 UTC - David Kjerrumgaard: Just things to consider if you want to submit a PIP, etc.
----
2019-04-19 18:23:08 UTC - Sam Leung: Each service also has multiple instances, so it should be resilient enough that the GA has at least one consumer running.
----
2019-04-19 18:23:11 UTC - Matteo Merli: There are several optimizations that could be done on the Docker image
----
2019-04-19 18:23:41 UTC - Matteo Merli: Basically that image just needs the pulsar-bin.tar.gz plus JVM
----
2019-04-19 18:23:53 UTC - Sam Leung: Definitely good things to think about in a more general scenario though.
----
2019-04-19 18:24:13 UTC - Matteo Merli: There was some discussion here: <https://github.com/apache/pulsar/pull/3602>
----
2019-04-19 18:24:39 UTC - David Kjerrumgaard: Since this use case is geared towards A/B testing (in my mind anyway), I was thinking of the case were v2 of the service has a bug in it that causes ALL instances to fail.
----
2019-04-19 18:25:57 UTC - David Kjerrumgaard: users would think that some of the messages aren't getting processed by the system. A lot of messages would go un-acked which can cause issues, etc.
----
2019-04-19 18:26:57 UTC - Sam Leung: I see.. if v2 did not have an ack timeout and doesn’t disconnect, the messages would be stuck.
----
2019-04-19 18:27:14 UTC - David Kjerrumgaard: yep
----
2019-04-19 18:28:03 UTC - Sam Leung: Okay, alternatively, if we didn’t need the precision of exact percentages, what do you think would be a good canary test to ensure v2 works?
----
2019-04-19 18:30:44 UTC - David Kjerrumgaard: Assuming that v2 would in turn distribute messages to downstream services, etc?
----
2019-04-19 18:31:09 UTC - Sam Leung: sure
----
2019-04-19 18:31:24 UTC - Ruud Kamphuis: Thanks. I get that having a standalone image is great for everybody that just wants to test Pulsar out.

However, I do find the naming of the current docker files super confusing.

pulsar
pulsar-standalone
pulsar-all

They all seem to have ZK, BK and more installed.

I expected `pulsar` to be the single pulsar package. And `pulsar-standalone` to be P,ZK,BK,Dashboard etc

Why are they the same(ish)?

If you want to go to production, then you need / want to have these services split right?
----
2019-04-19 18:33:31 UTC - Ruud Kamphuis: Thanks I will subscribe to the issue.
----
2019-04-19 18:35:21 UTC - David Kjerrumgaard: That's a good question. I'd have to think about it a bit. Can your downstream services handle duplicate messages? If so, you can have v2 create its own subscription on the incoming topic
----
2019-04-19 18:36:49 UTC - David Kjerrumgaard: yes, in a production environment these services are typically spread out.
----
2019-04-19 18:37:40 UTC - Sam Leung: That would be nice for the cases where that the downstream services can handle that, it would put a bit of duplicated effort, but well worth it. But there are some that cannot.
----
2019-04-19 18:37:48 UTC - David Kjerrumgaard: We deploy the services separately as pods in K8s and use the configs to control which services are running in each pod
----
2019-04-19 18:38:09 UTC - Sébastien de Melo: Approximately 120 000 messages in 1 minute. The function processes between 50k and 85k and stops working. It takes a few minutes. There are some 500 errors from the API we call in the logs.
We had 9 instances of the function distributed across 3 brokers. Interestingly the problem does not occur if we create 20 instances instead of 9
----
2019-04-19 18:38:48 UTC - David Kjerrumgaard: Yea, the answer is going to be very specific to your environment
----
2019-04-19 18:39:19 UTC - Ruud Kamphuis: I created an issue on Github, <https://github.com/apache/pulsar/issues/4086> . I think it's better to have it there as others can also search for it.
----
2019-04-19 18:40:08 UTC - David Kjerrumgaard: FWIW, my "hack" would segregate the messages into different topics, and if v2 is having issue, you will be able to see that in the topic backlog, ack count, etc.
----
2019-04-19 18:40:34 UTC - David Kjerrumgaard: and it wouldn't impact the v1 flow
----
2019-04-19 18:41:22 UTC - David Kjerrumgaard: topics are cheap in Pulsar as well :smiley:
----
2019-04-19 18:44:54 UTC - Sam Leung: Makes sense. Yeah they’re cheap, but I’m thinking about the scale where we’re running at say 50% capacity, if we have 3 services that consume from the same topic on different subscriptions, and they each run their own A/B test, we suddenly are duplicating the messages into 6 topics, with 3x the number of messages.
----
2019-04-19 18:45:42 UTC - Sam Leung: But that’s relatively unlikely :slightly_smiling_face:
----
2019-04-19 18:46:05 UTC - David Kjerrumgaard: From a design perspective, I think it is best NOT to embed this behavior into the core classes, and instead use functions or similar tools to implement this and other unique behaviors, such as filtering, replicating, etc. Adding this into the base class makes the topic configuration that much more complicated.
----
2019-04-19 18:46:43 UTC - Sam Leung: I agree
----
2019-04-19 18:46:52 UTC - David Kjerrumgaard: I wouldn't worry about the scalability of Pulsar too much :smiley:
----
2019-04-19 18:47:29 UTC - David Kjerrumgaard: with proper message retention and expiration policies in place you will be fine
----
2019-04-19 18:55:04 UTC - Sam Leung: Thanks for all your help!
----
2019-04-19 19:34:06 UTC - Matteo Merli: Having ZK and BK in same image is not the reason for the big size :slightly_smiling_face:
----