You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2018/05/15 09:11:02 UTC
Slack digest for #general - 2018-05-15

2018-05-14 13:29:22 UTC - Byron: Good morning folks. This is my first time running Bookkeeper.. and my 3-node test cluster ran out of space on two of the nodes (i believe the ledgers directory). So the two nodes are failing to startup as a result which makes the cluster basically inaccessible. I am curious how one can recover from this situation if I am unable to increase the volume size for those two nodes?
----
2018-05-14 14:16:30 UTC - Ivan Kelly: they aren't starting in readonly mode?
----
2018-05-14 14:17:23 UTC - Byron: Does not appear to be..
----
2018-05-14 14:17:29 UTC - Byron: @Byron uploaded a file: <https://apache-pulsar.slack.com/files/UACD54WB1/FAQCADZQX/-.txt|Untitled>
----
2018-05-14 14:18:41 UTC - Ivan Kelly: it's the journal directory that's full
----
2018-05-14 14:18:53 UTC - Byron: correct
----
2018-05-14 14:18:56 UTC - Ivan Kelly: could you upload the whole log somewhere?
----
2018-05-14 14:19:09 UTC - Byron: error log?
----
2018-05-14 14:19:15 UTC - Ivan Kelly: bookie.log
----
2018-05-14 14:20:06 UTC - Ivan Kelly: this looks similar to something else we saw recently, and i recall the root cause was logged earlier in log
----
2018-05-14 14:20:55 UTC - Byron: hm. ok i am running in kubernetes.. with a readwriteonly volume. i will see if i can remount the volume to get the log
----
2018-05-14 14:22:13 UTC - Ivan Kelly: how did you get that snippet? I'm not overly familiar with k8s
----
2018-05-14 14:22:26 UTC - Byron: that is the stderr log
----
2018-05-14 14:22:51 UTC - Byron: its possible bookie.log is redirected there?
----
2018-05-14 14:23:36 UTC - Byron: @Byron uploaded a file: <https://apache-pulsar.slack.com/files/UACD54WB1/FAPB1AHLJ/-.sh|Untitled>
----
2018-05-14 14:23:50 UTC - Byron: that is the full start to end output
----
2018-05-14 14:24:55 UTC - Byron: the bookie is being started from the `apachepulsar/pulsar` docker image in case that is relevant
----
2018-05-14 14:27:46 UTC - Byron: i see `readOnlyModeEnabled=true` in the default bookkeeper.conf file. maybe something weird is happening with the env variables overriding the config
----
2018-05-14 14:29:41 UTC - Ivan Kelly: looking
----
2018-05-14 14:33:02 UTC - Ivan Kelly: what version of pulsar is this?
----
2018-05-14 14:34:29 UTC - Ivan Kelly: <https://github.com/apache/bookkeeper/issues/1349> &lt;- yup, there's an outstanding issue for this in bookkeeper. i guess you don't have access to the disk in question?
----
2018-05-14 16:02:56 UTC - Byron: 1.22
----
2018-05-14 16:11:12 UTC - Byron: i see bookkeeper on this image is 4.3.1.91
----
2018-05-14 16:14:06 UTC - Matteo Merli: @Byron the read-only mode only applies to the “storage” device. When that disk if full (actually, when it reaches 95%) the bookie turn itself into read-only mode. 

For Journal device unfortunately there’s currently no such check. The main reason is that typically the storage amount on journal device is fixed (~10GB) and doesn’t grow above that.
----
2018-05-14 16:14:49 UTC - Matteo Merli: In your case, do you have both directories on the same disk ?
----
2018-05-14 16:15:43 UTC - Byron: I have a separate ledgers and journal volume
----
2018-05-14 16:16:07 UTC - Matteo Merli: Good, and how big the journal volume?
----
2018-05-14 16:16:13 UTC - Matteo Merli: One thing to note is that by default bookkeeper keeps the last 5 journals, even though all the data was already flushed and indexed
----
2018-05-14 16:16:52 UTC - Byron: only 5 Gi for journal and 10 Gi for ledgers per node (3)
----
2018-05-14 16:16:58 UTC - Matteo Merli: that can be configured `journalMaxBackups=5`
----
2018-05-14 16:17:20 UTC - Matteo Merli: ok, if you set `journalMaxBackups=0` that 5Gb should not get filled up
----
2018-05-14 16:17:23 UTC - Byron: again this is a test instance.. but i am more interested in figuring how to deal with these issues now before going to production
----
2018-05-14 16:17:39 UTC - Byron: ok
----
2018-05-14 16:17:44 UTC - Matteo Merli: to get out of the woods: you can delete few of the old journal files
----
2018-05-14 16:17:55 UTC - Byron: and then add more bookies presumably?
----
2018-05-14 16:18:05 UTC - Byron: to distribute the data?
----
2018-05-14 16:18:34 UTC - Matteo Merli: you don’t necessarely need more bookies
----
2018-05-14 16:19:03 UTC - Byron: i am just saying if i wanted to support more storage in the future
----
2018-05-14 16:19:04 UTC - Matteo Merli: if you change the setting to `journalMaxBackups=0` and restart, the bookies should be fine
----
2018-05-14 16:19:15 UTC - Byron: not to fix the current problem
----
2018-05-14 16:19:27 UTC - Matteo Merli: oh, then sure
----
2018-05-14 16:19:44 UTC - Byron: alright going to set that config and restart the pods
----
2018-05-14 16:22:56 UTC - Byron: hm still failing to start up due to the out of space error
----
2018-05-14 16:23:16 UTC - Byron: i wonder if it is trying to do writes before checking that option and purging old data
----
2018-05-14 16:23:17 UTC - Matteo Merli: yes, you need to delete some of the old journals
----
2018-05-14 16:23:55 UTC - Matteo Merli: with the previous config, it was trying to keep up to 5 journal files (each is 2GB)
----
2018-05-14 16:24:13 UTC - Byron: ok, so changing the config will not autopurge existing ones
----
2018-05-14 16:24:14 UTC - Matteo Merli: that data is already flushed, so there’s no risk
----
2018-05-14 16:24:37 UTC - Matteo Merli: &gt; ok, so changing the file will not autopurge existing ones

I think that only works once it’s up :slightly_smiling_face:
----
2018-05-14 16:24:41 UTC - Byron: right
----
2018-05-14 16:25:39 UTC - Byron: hm. i guess this is challenge with persistent volumes.. how to access them outside of main pod. i guess i can attach them to a different temp pod, delete the backups then spin up the existing pods
----
2018-05-14 16:26:55 UTC - Matteo Merli: ouch, good point. you can try to change the spec to add a sleep 300 before the actual command
----
2018-05-14 16:27:33 UTC - Byron: is there a bookie shell command to run?
----
2018-05-14 16:27:45 UTC - Byron: i can change the pod command to use that instead of starting the bookie server
----
2018-05-14 16:27:52 UTC - Byron: as a one-off
----
2018-05-14 16:28:05 UTC - Byron: rather.. container command
----
2018-05-14 16:30:28 UTC - Byron: or can i just delete the files in the `journal/` directory
----
2018-05-14 16:31:03 UTC - Matteo Merli: delete the files, you can just delete the oldest `1323213.txn` file
----
2018-05-14 16:31:23 UTC - Matteo Merli: since that’s just a backup file
----
2018-05-14 16:42:13 UTC - Byron: hm. now i am getting an exception that the journal file is missing and it can’t recover
----
2018-05-14 16:42:21 UTC - Byron: @Byron uploaded a file: <https://apache-pulsar.slack.com/files/UACD54WB1/FAQB79DLN/-.txt|Untitled>
----
2018-05-14 16:42:30 UTC - Byron: but the one-off command worked at least
----
2018-05-14 16:42:37 UTC - Matteo Merli: Uhm, how many files were there?
----
2018-05-14 16:43:23 UTC - Byron: there were 4 or 5 `.txn` files
----
2018-05-14 16:43:40 UTC - Byron: i deleted all but the most recent. probably a bad idea. i assumed they were independent
----
2018-05-14 16:44:06 UTC - Matteo Merli: probably the last 2 were the ones still used
----
2018-05-14 16:45:13 UTC - Matteo Merli: it’s based on a marker file `lastMark` in the storage directory
----
2018-05-14 16:46:10 UTC - Byron: ah and there is a `shell lastmark` command
----
2018-05-14 16:46:30 UTC - Matteo Merli: yes, I forgot about that one
----
2018-05-14 16:46:49 UTC - Byron: ok good to know for the future. but the data is clearly borked now. is there a way to reset a bookie?
----
2018-05-14 16:47:14 UTC - Byron: bookieformat
----
2018-05-14 16:47:15 UTC - Byron: ?
----
2018-05-14 16:47:21 UTC - Matteo Merli: `bin/bookkeeper shell bookieformat -deleteCookie``
----
2018-05-14 16:47:25 UTC - Byron: cool
----
2018-05-14 16:47:26 UTC - Byron: thanks
----
2018-05-14 16:49:26 UTC - Matteo Merli: No problem. In any case I think we should refuse to start the bookie at the very beginning, if the disk size is &lt; (2GB * 5)
----
2018-05-14 16:58:57 UTC - Byron: back in business
----
2018-05-14 20:41:14 UTC - Guillaume LECROC: @Guillaume LECROC has joined the channel
----
2018-05-15 07:13:42 UTC - Sachin: @Sachin has joined the channel
----