You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2019/10/05 09:11:03 UTC

Slack digest for #general - 2019-10-05

2019-10-04 10:44:17 UTC - mgorelikov: @mgorelikov has joined the channel
----
2019-10-04 11:45:40 UTC - Roman: Hey Guys

I'm struggling with Bookkeeper and Presto and I hoped someone has a hint. I'm running a single bookkeeper node on Azure with 2 VCores, 4GB of RAM and two dedicated SSDs for BK ledgers and journal. Atm the system produces about 550-600 messages/s that is stored in Pulsar for 24 hours. This accumulates to roughly 60GB of data. Now I want to search through this data using Presto. However a simple SELECT gives me very poor performance. For example a query like `SELECT 1 FROM "topic-0" WHERE "session.id" = '1';` takes hours (even days), reading 300 messages/s with a throughput of about 400kB/s. Reading from the same topic through a broker gives me a way better throughput. I observed 100% CPU usage on the Bookkeeper server when a query is running. The query of course is executed on a different machine.

I guess I misconfigured something here... Or is this the performance I can expect from a single BK node? Any help is appreciated.
----
2019-10-04 12:49:09 UTC - Tim Howard: @Tim Howard has joined the channel
----
2019-10-04 12:51:14 UTC - Tim Howard: Greetings. We would like to use delayed message delivery in the go client. But, unless I'm mistaken, it does not appear to be implemented/exposed yet. Is that accurate and if it is, is anyone working on it?
----
2019-10-04 15:33:45 UTC - David Kjerrumgaard: @Roman Currently, the data stored in BK is only indexed by the message timestamp in Presto, This means that the query you ran above has to do an entire table scan. Since you are using a single BK, the opportunity for parallelism is very limited. Therefore your performance will be slow.
----
2019-10-04 15:35:20 UTC - Roman: So basically all I can do for the moment is to add more bookies?
----
2019-10-04 15:38:47 UTC - David Kjerrumgaard: Unfortunately for those queries, yes. I would suggest running a time-based query to see how fast it performs in comparison to the previous query.  We are also working on adding more indexes for Presto to use.
----
2019-10-04 15:44:31 UTC - Roman: Ok, I see. Guess I have to find another way to search through data as my use case requires to find all messages for a particular `session.id`. Maybe another job will do. Many thanks for the clarification.
----
2019-10-04 15:55:13 UTC - Matteo Merli: Correct, it’s not exposed yet, though it would be easy to add
----
2019-10-04 17:09:48 UTC - Tim Howard: Agreed. I have basically started and wanted to verify it didn't exist and then engage about doing the work.
----
2019-10-04 17:09:56 UTC - Tim Howard: Thanks for confirming.
----
2019-10-04 17:11:28 UTC - Tim Howard: I will try to email the list a bit later announcing that I am starting the work in case someone else comes along to start it?
----
2019-10-04 17:58:19 UTC - Matteo Merli: I’m creating a PR to add the support in C++ lib
----
2019-10-04 17:58:46 UTC - Tim Howard: oh, you are?
----
2019-10-04 17:59:00 UTC - Tim Howard: well- that exponentially reduces the work I have to do then.
----
2019-10-04 18:02:05 UTC - Tim Howard: do you have a branch I can use as a basis?
----
2019-10-04 18:10:35 UTC - Matteo Merli: Coming. Adding tests and I’ll push the PR
+1 : Tim Howard
----
2019-10-04 19:53:44 UTC - alexk777: @alexk777 has joined the channel
----
2019-10-04 21:15:27 UTC - Matteo Merli: <https://github.com/apache/pulsar/pull/5317>
----
2019-10-04 21:16:12 UTC - Tim Howard: outstanding, thanks
----
2019-10-04 21:33:56 UTC - Chris DiGiovanni: My team updated from Pulsar 2.3.0 to ~2.4.3~ 2.4.1.  Everything seemed to go well, though it looks like our brokers jetty server is randomly dying with the following error:

```
2019-10-04 13:58:12.292 [pulsar-web-29-5] WARN  org.eclipse.jetty.util.thread.strategy.EatWhatYouKill -
java.nio.channels.CancelledKeyException: null
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:71) ~[?:?]
        at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:130) ~[?:?]
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.processSelected(ManagedSelector.java:473) ~[org.eclipse.jetty-jetty-io-9.4.12.v20180830.jar:9.4.12.v20180830]
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:351) ~[org.eclipse.jetty-jetty-io-9.4.12.v20180830.jar:9.4.12.v20180830]
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:357) [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:181) [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) [org.eclipse.jetty-jetty-util-9.4.12.v20180830.jar:9.4.12.v20180830]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]
```
----
2019-10-04 21:34:39 UTC - Ali Ahmed: there is no “2.4.3” released
----
2019-10-04 21:34:44 UTC - Chris DiGiovanni: Sorry 2.4.1
----
2019-10-04 21:34:50 UTC - Chris DiGiovanni: I missed typed
----
2019-10-04 21:35:16 UTC - Ali Ahmed: can you create a github issue
----
2019-10-04 21:35:21 UTC - Chris DiGiovanni: Sure
----
2019-10-04 21:41:57 UTC - Chris Bartholomew: @Chris DiGiovanni I was seeing a similar problem. I think in 2.4.1 the broker can hit the jetty bug <https://github.com/eclipse/jetty.project/issues/3133>. The jetty version didn't change between 2.3.0 and 2.4.1, but this bug is only exposed in when debug logging is enabled. A change went into 2.4 that enabled debug logging by default: <https://github.com/apache/pulsar/pull/3661>. Since I manually reverted that change, I haven't see that broker failure.
----
2019-10-04 21:55:14 UTC - Chris DiGiovanni: Interesting... I'll look into those.
----
2019-10-04 21:55:48 UTC - Chris DiGiovanni: I created the issue: <https://github.com/apache/pulsar/issues/5318>
----
2019-10-04 21:59:25 UTC - Matteo Merli: Take a look at <https://github.com/apache/pulsar/pull/5079>
----
2019-10-04 22:05:20 UTC - Chris DiGiovanni: I just made the logging change on my brokers.  I'll comment on the issue if the broker survives over the weekend.
----
2019-10-04 22:20:53 UTC - Addison Higham: ```
6:08:22.204 [pulsar-external-listener-3-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [<persistent://restructure-test/cg-logs/iad-partition-1>] [consume-test] Closed Consumer (not connected)
16:08:22.206 [pulsar-external-listener-3-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [<persistent://restructure-test/cg-logs/iad-partition-0>] [consume-test] Closed Consumer (not connected)
16:08:22.207 [pulsar-external-listener-3-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [<persistent://restructure-test/cg-logs/iad-partition-3>] [consume-test] Closed Consumer (not connected)
16:08:22.207 [pulsar-external-listener-3-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [<persistent://restructure-test/cg-logs/iad-partition-2>] [consume-test] Closed Consumer (not connected)
16:08:22.207 [pulsar-external-listener-3-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [<persistent://restructure-test/cg-logs/iad-partition-4>] [consume-test] Closed Consumer (not connected)
16:08:22.208 [pulsar-external-listener-3-1] WARN  org.apache.pulsar.client.impl.MultiTopicsConsumerImpl - [<persistent://restructure-test/cg-logs/iad>] Failed to subscribe for topic [<persistent://restructure-test/cg-logs/iad>] in topics consumer, subscribe error: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 7 lookup request timedout after ms 30000
16:08:22.208 [pulsar-external-listener-3-1] WARN  org.apache.pulsar.client.impl.MultiTopicsConsumerImpl - Failed subscription for createPartitionedConsumer: <persistent://restructure-test/cg-logs/iad> 5, e:java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 7 lookup request timedout after ms 30000
16:08:22.209 [main] ERROR org.apache.pulsar.client.cli.PulsarClientTool - Error while consuming messages
16:08:22.210 [main] ERROR org.apache.pulsar.client.cli.PulsarClientTool - 7 lookup request timedout after ms 30000
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 7 lookup request timedout after ms 30000
	at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:247) ~[org.apache.pulsar-pulsar-client-api-2.4.1.jar:2.4.1]
	at org.apache.pulsar.client.impl.ConsumerBuilderImpl.subscribe(ConsumerBuilderImpl.java:91) ~[org.apache.pulsar-pulsar-client-original-2.4.1.jar:2.4.1]
...
```
note sure what is going on here... can't create a subscription to a partitioned topic. I also can't subscribe to a single partition. It appears to just be a problem in this namespace, works fine in other namespaces. I confirmed my role has read/write permissions... the broker has log messages that say it is creating subscription, it just never returns to the client
----
2019-10-04 23:04:04 UTC - Addison Higham: okay, a little more down the rabbit hole: I can't delete the topic, when I try and delete it it gives me a 500 error as it tries to delete all the topic partitions. I also tried manually deleting a a single partition of the topic and pointed directly at the assigned broker and it still failed. As best as I can tell, it is timing out somewhere and the exception is just swallowed so I get a 500
----
2019-10-04 23:10:02 UTC - Matteo Merli: can you take a jstack of the process?
----
2019-10-04 23:10:39 UTC - Matteo Merli: also, a heap dump will show the eventual exception, even if it gets eaten away
----
2019-10-04 23:13:53 UTC - Addison Higham: I have been periodically trying to see if I can figure out what is going on, as best as I can tell, it gets stuck here:
```
pulsar-web-33-3" #86 prio=5 os_prio=0 tid=0x00007f339ea36000 nid=0xcb waiting on condition [0x00007f333a921000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  &lt;0x00000000ccfabf60&gt; (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
	at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.getTopicReference(PersistentTopicsBase.java:1346)
	at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.internalDeleteTopicForcefully(PersistentTopicsBase.java:312)
	at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.internalDeleteTopic(PersistentTopicsBase.java:533)
	at org.apache.pulsar.broker.admin.v2.PersistentTopics.deleteTopic(PersistentTopics.java:360)
	at sun.reflect.GeneratedMethodAccessor132.invoke(Unknown Source)
	....
```
----
2019-10-04 23:14:05 UTC - Addison Higham: at least that is what consistently keeps showing up
----
2019-10-04 23:15:02 UTC - Addison Higham: 
----
2019-10-04 23:15:08 UTC - Addison Higham: ^^ full thread dump
----
2019-10-04 23:16:31 UTC - Addison Higham: so also... the reason I went all the down this path: I started a pulsar source (kinesis) and it output a few messages... and then stopped, it looked like the function was working just fine internally (i.e. filling up an internal buffer) but then read was just never getting called
----
2019-10-04 23:16:51 UTC - Addison Higham: then things got weirder with not being able to subscribe and now not being able to delete
----
2019-10-04 23:28:33 UTC - Addison Higham: a new topic in the same namespace has the same problem, but a new namespace has no issues. So I think there is something wrong the namespace. I created this namespace with some new terraform automation... and I am guessing something is weird there
----
2019-10-04 23:31:48 UTC - Addison Higham: okay yeah... all this is something weird with that namespace, I can't figure out what settings it doesn't like, we had used the `pulsar-admin` defaults of `0` for max-consumers/max-producers, which I am guess that default isn't correct and unset means something totally different
----
2019-10-04 23:33:27 UTC - Matteo Merli: That should mean unlimited 
----
2019-10-04 23:40:45 UTC - Addison Higham: :thinking_face: well I will try playing with settings and see if anything there can explain that behavior
----
2019-10-04 23:41:18 UTC - Addison Higham: but a new namespace created by the CLI (instead of by our terraform automation) has this totally working with the kinesis source, which is what sent me down this crazy path in the first place
----
2019-10-05 08:58:11 UTC - Poule: is the kubernetes version required to be 1.10.5 or it's just an example in the doc:
```
minikube start --memory=8192 --cpus=4 \
    --kubernetes-version=v1.10.5
```
----
2019-10-05 09:00:59 UTC - Ali Ahmed: it’s just the example
hugging_face : Poule
----