You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2018/05/18 09:11:02 UTC
Slack digest for #general - 2018-05-18

2018-05-17 09:15:45 UTC - Xiaolin Zhang: @Xiaolin Zhang has joined the channel
----
2018-05-17 15:01:19 UTC - Igor Zubchenok: Hello guys!

We're connecting pulsar client to a single broker by URL. If this broker goes down, pulsar client does not try to connect to other working brokers. I expected Pulsar client to do this.
What guidelines do you have to handle this case?
----
2018-05-17 15:03:18 UTC - Matteo Merli: A common way is to either use a VIP load balancer for service discovery (if available) or setup a DNS name that resolve to the list of IPs of the brokers 
----
2018-05-17 15:13:12 UTC - Igor Zubchenok: We'll try this to 'setup a DNS name that resolve to the list of IPs of the brokers'

However we broke Pulsar finally and get this exception:
org.apache.pulsar.client.api.PulsarClientException$BrokerPersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger
----
2018-05-17 15:14:19 UTC - Igor Zubchenok: Bookkeeper has many exceptions in logs after restarting main broker.
----
2018-05-17 15:15:47 UTC - Vasily Yanov: exceptions:
```
2018-05-17 15:14:36,509 - WARN  [bookkeeper-ml-workers-38-1:ServerCnx@650] - [/1.1.1.1:32945][<persistent://server-ali-t2-1526567437134/prod-pulsar-cluster-1/session_queue/7092e9f8-80a1-4769-8a89-76c682a38a73>][sender] Failed to create consumer: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger
java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
```
----
2018-05-17 15:17:35 UTC - Vasily Yanov: ```
2018-05-17 15:14:51,507 - ERROR [BookKeeperClientWorker-23-1:PersistentDispatcherSingleActiveConsumer@323] - [<persistent://server-ali-t2-1526567437134/prod-pulsar-cluster-1/session_init/1ab5aea6-bebc-472d-b4b2-7492f3af44a4> / sender-Consumer{subscription=PersistentSubscription{topic=<persistent://server-ali-t2-1526567437134/prod-pulsar-cluster-1/session_init/1ab5aea6-bebc-472d-b4b2-7492f3af44a4>, name=sender}, consumerId=1746, consumerName=38bc2, address=/1.1.1.1:32945}] Error reading entries at 1092669:1 : Bookie operation timeout - Retrying to read in 15.0 seconds
2018-05-17 15:15:06,507 - WARN  [BookKeeperClientWorker-22-1:PendingAddOp@238] - Write did not succeed: L1101172 E0 on 1.1.1.1:3181, rc = -23
2018-05-17 15:15:06,507 - WARN  [BookKeeperClientWorker-22-1:RackawareEnsemblePlacementPolicy@553] - Failed to choose a bookie: excluded [&lt;Bookie:1.1.1.1:3181&gt;, &lt;Bookie:3.3.3.3:3181&gt;, &lt;Bookie:2.2.2.2:3181&gt;], fallback to choose bookie randomly from the cluster.
```
----
2018-05-17 15:18:35 UTC - Vasily Yanov: ```
2018-05-17 15:17:21,509 - WARN  [BookKeeperClientWorker-25-1:PendingAddOp@238] - Write did not succeed: L1100903 E418 on 2.2.2.2:3181, rc = -23
2018-05-17 15:17:21,509 - WARN  [BookKeeperClientWorker-25-1:LedgerHandle@919] - Write did not succeed to 2.2.2.2:3181, bookieIndex 0, but we have already fixed it.
2018-05-17 15:17:21,509 - WARN  [BookKeeperClientWorker-25-1:PendingAddOp@238] - Write did not succeed: L1100903 E421 on 2.2.2.2:3181, rc = -23
2018-05-17 15:17:21,509 - WARN  [BookKeeperClientWorker-25-1:LedgerHandle@919] - Write did not succeed to 2.2.2.2:3181, bookieIndex 0, but we have already fixed it.
```
----
2018-05-17 15:32:38 UTC - Igor Zubchenok: We're trying to reproduce it again and send you all logs (zk+bk+broker from all 3 nodes) with INFO level
----
2018-05-17 15:42:37 UTC - Sijie Guo: @Igor Zubchenok do you mind describing the sequence on how this happens?
----
2018-05-17 15:49:38 UTC - Igor Zubchenok: - we setup 3 nodes: pulsar-01, pulsar-02, pulsar-03
- we start pulsar-03, then pulsar-02, then pulsar-01
- then our pulsar client connects to pulsarbroker-03 directly (no VIP load balancing or multiple IP addresses in DNS)
- then we stop pulsarbroker-01 wait 30 seconds, start pulsarbroker-01
- then we wait 2 minutes, stop pulsarbroker-02 wait 30 seconds, start pulsarbroker-02
- then we wait 2 minutes, stop pulsarbroker-03
- here our instance has a lot of exceptions
- we wait 30 seconds, start pulsarbroker-03
- our instance cannot work even after restart until we create another topics/properties
----
2018-05-17 15:59:41 UTC - Sijie Guo: a couple of more questions:

- are you using default configuration, basically replication settings 2/2/2?
- the WARN logging seems to be normal when you kill a pulsar instance. because it basically try to write the entries to the pulsar broker (bookie) you stop, and it does ensemble changes.
- the last step seems a bit unusual. “instance cannot work until create another topics”? can you describe more about “instance cannot work”?
----
2018-05-17 16:05:02 UTC - Vasily Yanov: @Sijie Guo
1. it should be because we didn't changed anything in bookkeeper.conf except zk hosts and journalSyncData
2. ok
----
2018-05-17 16:05:37 UTC - Igor Zubchenok: 3. my instance cannot work cause I get `org.apache.pulsar.client.api.PulsarClientException$BrokerPersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger`
note: we stop/start only broker during testing.
P.S. we'have failed to reproduce the issue with steps above, but we've deleted more than 40GB of bookkeeper and zookeeper data.
----
2018-05-17 16:05:41 UTC - Vasily Yanov: btw: what parameters should I check in order to be sure about replication values?
----
2018-05-17 16:15:31 UTC - Sijie Guo: @Vasily Yanov in the broker conf, managedLedgerDefaultEnsembleSize / managedLedgerDefaultWriteQuorum / managedLedgerDefaultAckQuorum
----
2018-05-17 16:16:03 UTC - Sijie Guo: &gt; we stop/start only broker during testing.

so you run bookies and brokers as separate processes, or in same process?
----
2018-05-17 16:17:04 UTC - Vasily Yanov: ```
cat /opt/pulsar/conf/broker.conf | grep -E "managedLedgerDefaultEnsembleSize|managedLedgerDefaultWriteQuorum|managedLedgerDefaultAckQuorum"
managedLedgerDefaultEnsembleSize=2
managedLedgerDefaultWriteQuorum=2
managedLedgerDefaultAckQuorum=2
```
----
2018-05-17 16:17:11 UTC - Sijie Guo: &gt; my instance cannot work cause I get `org.apache.pulsar.client.api.PulsarClientException$BrokerPersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger`

interesting. I am wondering if that’s transient errors. does the client succeed after retries?
----
2018-05-17 16:17:24 UTC - Sijie Guo: @Vasily Yanov thank you
----
2018-05-17 16:18:18 UTC - Vasily Yanov: no
----
2018-05-17 16:19:07 UTC - Vasily Yanov: how I can check if bookie and broker as separate process or no?
----
2018-05-17 16:25:27 UTC - Sijie Guo: how do you start the pulsar brokers?
----
2018-05-17 16:25:58 UTC - Igor Zubchenok: no, it does not succeed, we tried several times
----
2018-05-17 16:36:40 UTC - Vasily Yanov: as systemd unit:
----
2018-05-17 16:36:51 UTC - Vasily Yanov: ```
[Unit]

Description=Apache Pulsar
Documentation=<https://pulsar.incubator.apache.org/>
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=simple
#ExecStart=/opt/pulsar/bin/pulsar-daemon start broker
ExecStart=/opt/pulsar/bin/pulsar broker
ExecStop=/opt/pulsar/bin/pulsar-daemon stop broker
Restart=on-failure
SyslogIdentifier=broker
LimitNOFILE=64536
LimitNPROC=8192

[Install]
WantedBy=multi-user.target
```
----
2018-05-17 16:42:28 UTC - Karthik Palanivelu: Hello There, I am trying build pulsar in docker. I am trying to bring up 2 bookies in port 3181 and 3182 respectively on the same host. It is failing on below exception. Can you please help how I can have multiple bookies on the same host:
----
2018-05-17 16:45:41 UTC - Karthik Palanivelu: @Karthik Palanivelu uploaded a file: <https://apache-pulsar.slack.com/files/U7VRE0Q1G/FASJYB0HL/-.m|Untitled>
----
2018-05-17 16:54:50 UTC - Karthik Palanivelu: Hi, Parameter does not work on RHEL, you need to wrap it in another script like below:

```
#!/bin/bash

export JAVA_HOME=/opt/jdk1.8
/opt/pulsar/bin/pulsar broker
```
----
2018-05-17 17:10:15 UTC - Vasily Yanov: Hi! I think it's not our case.
----
2018-05-17 17:40:44 UTC - Ali Ahmed: @Karthikeyan Palanivelu How are you configuring the bookies ? are you using docker compose ?
----
2018-05-17 17:45:25 UTC - Matteo Merli: @Karthikeyan Palanivelu Are they sharing the same disk paths?
----
2018-05-17 17:46:39 UTC - Matteo Merli: BookKeeper has a mechanism (called “cookies”) to ensure a bookie advertised name matches the data it contains. If the are discrepancies, it refuses to startup
----
2018-05-17 17:47:23 UTC - Matteo Merli: in this case it looks one bookie is trying to starts with the data that supposed to belong to the other bookie
----
2018-05-17 17:48:12 UTC - Sijie Guo: @Vasily Yanov it seems that this script  only starts broker. do you have a separate script to start bookie?
----
2018-05-17 17:48:29 UTC - Vasily Yanov: yes
----
2018-05-17 17:49:03 UTC - Vasily Yanov: ```
[Unit]

Description=Bookkeeper
Documentation=something realy strange
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/opt/pulsar/bin/bookkeeper bookie
#ExecStart=/opt/pulsar/bin/pulsar-daemon start bookie
ExecStop=/opt/pulsar/bin/pulsar-daemon stop bookie
Restart=on-failure
SyslogIdentifier=bookkeeper
LimitNOFILE=64536
LimitNPROC=8192

[Install]
WantedBy=multi-user.target
```
----
2018-05-17 17:49:06 UTC - Sijie Guo: oh so you started bookie and broker separately, and during your tests, you only kill brokers?
----
2018-05-17 17:49:13 UTC - Vasily Yanov: yes
----
2018-05-17 17:49:17 UTC - Vasily Yanov: right
----
2018-05-17 17:49:24 UTC - Sijie Guo: that’s interesting.
----
2018-05-17 17:49:57 UTC - Vasily Yanov: exactly. Only brokers were affected with systemctl stop|start
----
2018-05-17 17:50:19 UTC - Sijie Guo: what are the hardware of these 3 nodes? like number of cpus, number of disks, memory size?
----
2018-05-17 17:50:58 UTC - Vasily Yanov: 8xCPU, 32Gb RAM, 2x2Tb HDD in RAID1
----
2018-05-17 17:51:57 UTC - Vasily Yanov: brokers and bookies start with:
-Xms4g -Xmx8g -XX:MaxDirectMemorySize=8g
----
2018-05-17 18:50:37 UTC - Sijie Guo: @Vasily Yanov interesting. the hardware settings and jvm settings seem to be good. and since you are only start/stop brokers, it shouldn’t be timed out on reading from bookies, that’s a bit strange. unless start/stop brokers will impact disks. is the RAID1 used only by pulsar?
----
2018-05-17 18:52:12 UTC - Vasily Yanov: Yes. It used only by pulsar/bookkeeper/zookeeper
----
2018-05-17 18:53:34 UTC - Sijie Guo: do you have any monitoring mechanisms to see what’s happening around network/disks?
----
2018-05-17 18:56:03 UTC - Karthik Palanivelu: @Ali Ahmed @Matteo Merli I am not using docker compose. Yes I am trying assign two bookies to same host on same data dir. How could I segregate it to own its own path? I prefer to hold base path /prod/data/. Inside which bookies should write their data to.
----
2018-05-17 19:00:43 UTC - Sijie Guo: @Karthikeyan Palanivelu: you can create a subdirectory under /prod/data/bookie-x for each bookie. for example, /prod/data/bookie-3181 and /prod/data/bookie-3182. then when you start the docker passing the environment : journalDirectory=/prod/data/bookie-x/journal and ledgerDirectories=/prod/data/bookie-x/ledgers
----
2018-05-17 19:01:47 UTC - Sijie Guo: so each bookie will use its separated directory, those two environment variables will configure the bookie docker process to use different directories.
----
2018-05-17 19:01:57 UTC - Sijie Guo: does that address your requirement?
----
2018-05-17 19:02:19 UTC - Karthik Palanivelu: Yes Cool That works for me...
heavy_check_mark : Sijie Guo
----
2018-05-17 19:02:53 UTC - Sijie Guo: that’s interesting. after that it never succeed, or it eventually succeed?
----
2018-05-17 19:03:06 UTC - Karthik Palanivelu: One more question, how can I associate the data created by bookie-1 to bookie-2 when bookie-1 is dead?
----
2018-05-17 19:04:34 UTC - Karthik Palanivelu: Or is that even advisable?
----
2018-05-17 19:04:38 UTC - Ali Ahmed: I don’t think it’s advisable
----
2018-05-17 19:05:30 UTC - Ali Ahmed: you generally don’t re associate node data, you keep enough nodes with replicas to tolerate failures
----
2018-05-17 19:06:56 UTC - Karthik Palanivelu: Ok Cool got it.
----
2018-05-17 19:11:25 UTC - Karthik Palanivelu: Related to the above question/answer, If I get a residue of data left after being few containers are dead, do I need to clean up the disk eventually?
----
2018-05-17 19:21:15 UTC - Karthik Palanivelu: @Karthik Palanivelu uploaded a file: <https://apache-pulsar.slack.com/files/U7VRE0Q1G/FARJR4DNG/-.xml|Untitled> and commented: Do we have this feature built in for Pulsar instance of BookKeeper?
----
2018-05-17 19:29:32 UTC - Sijie Guo: @Karthikeyan Palanivelu 

- wondering where is the text from? it seems to be out-of-dated. e.g. BookKeeperTools is removed and the new command is `bin/bookkeeper shell recover`; now bookie supports adding a new disk on-the-fly; and such.

&gt; Do we have this feature built in for Pulsar instance of BookKeeper?

It should be also available in the shell script shipped as part of pulsar. `$ bin/bookkeeper shell recover`

back to your original question:

&gt; If I get a residue of data left after being few containers are dead, , do I need to clean up the disk eventually?

do you need this data? if you need this data, you don’t need to do anything, just relauch your docker process.

if you don’t need this data, you can just simply wipe out the data by removing the directory; or use the tool `bin/bookkeeper shell bookieformat`
----
2018-05-17 19:32:18 UTC - Ali Ahmed: @Karthikeyan Palanivelu I think you may be looking at twitter bookkeeper which is an old repo
----
2018-05-17 19:32:29 UTC - Ali Ahmed: the new location is here ```<https://github.com/apache/bookkeeper>```
----
2018-05-17 21:08:27 UTC - Karthik Palanivelu: Oh sure got it. Let me try this option. Reason is in case the docker IP changes on the host upon restart/start  we should have a means to associate the data with Bookie.
----
2018-05-17 21:08:53 UTC - Sijie Guo: oh i see
----
2018-05-17 21:09:33 UTC - Sijie Guo: you actually can configure advertisedAddress, which you probably can use hostIP the adverstisedAddress
----
2018-05-17 21:09:48 UTC - Sijie Guo: it is able to do it in k8s. so I assume it is doable using plain docker
----
2018-05-17 21:10:29 UTC - Sijie Guo: I am not sure if anyone in the slack channel how docker can use hostIP, if anyone has quick answer please help. otherwise I can look around.
----
2018-05-17 21:12:20 UTC - Matteo Merli: It should be something like : 

```
docker run apachepulsar/pulsar -e advertisedAddress=1.2.3.4:3181 -p 3181:3181 "bin/apply-config-from-env.py conf/bookkeeper.conf &amp;&amp; bin/pulsar bookie"```
+1 : Sijie Guo
----
2018-05-17 21:13:09 UTC - Matteo Merli: Haven’t really tried the precise command..
----
2018-05-17 21:14:07 UTC - Matteo Merli: point is that you just need to pass env variable inside the container and then you can use the `apply-config-from-env.py` to have the values replaced in the config files
----
2018-05-17 22:34:16 UTC - Igor Zubchenok: We've just released our Pulsar based solution to production.
----
2018-05-17 22:34:21 UTC - Igor Zubchenok: :slightly_smiling_face:
passenger_ship : Matteo Merli, Ali Ahmed, Sijie Guo, Jerry Peng, Jon Bock, Guillaume LECROC
----
2018-05-17 22:35:39 UTC - Ali Ahmed: Would you like to add a logo to apache pulsar’s website ?
----
2018-05-17 22:38:23 UTC - Igor Zubchenok: Why not, could you send svg?
----
2018-05-17 22:41:28 UTC - Ali Ahmed: it’s in the pulsar repo “./site/img/pulsar.svg”
----
2018-05-17 22:41:45 UTC - Ali Ahmed: @Ali Ahmed uploaded a file: <https://apache-pulsar.slack.com/files/U6EHQ91KM/FARUUQSSW/pulsar.svg|pulsar.svg>
----
2018-05-18 01:05:17 UTC - Karthik Palanivelu: @Sijie Guo @Matteo Merli Will try this option and will get back to you. Thanks and Appreciate your time.
----
2018-05-18 04:24:23 UTC - Anand Ranganathan: @Anand Ranganathan has joined the channel
----
2018-05-18 08:00:46 UTC - Marco Didonna: @Marco Didonna has joined the channel
----
2018-05-18 08:18:44 UTC - Vasily Yanov: @Sijie Guo I have zabbix but nothing is strange I saw regarding the disks/CPU state at mentioned moment. Ok. Will continue my investigation
----