You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/06/21 17:52:32 UTC

[GitHub] [pulsar] pushkar-engagio opened a new issue #7328: Bookie cluster gradually fails when one of the bookie node goes down

pushkar-engagio opened a new issue #7328:
URL: https://github.com/apache/pulsar/issues/7328

**Describe the bug**
Long term failure of single bookie causes the entire cluster to go down.

**To Reproduce**
Steps to reproduce the behavior:
I had a 6 bookies in the cluster(500Gb journal storage, 1TB ledger storage). One of the bookie failed and could not start. Once the cluster detected downed bookie, it kicked in recovery process for underreplicated ledgers. The ledgers replicated fine for few minutes but during the recovery process another bookie went down(the service was running on the bookie but the bookie was dropped from cluster ie. did not show up in read only or read write bookie list). This cause additional ledgers to be underreplicated. This process continued until i was down a single bookkeeper node, taking down the entire cluster.

**Expected behavior**
A clear and concise description of what you expected to happen.
The bookkeeper failure, should replicate under replicated ledgers from the down bookies, so that another bookkeeper node can be added to replace the downed bookie.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
- OS: [e.g. iOS]
- Pulsar version: 2.3.0
- Operating system: Amazon linux 2
- Java version: openjdk version "1.8.0_222"

**Additional context**
Add any other context about the problem here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] wmccarley commented on issue #7328: Bookie cluster gradually fails when one of the bookie node goes down

Posted by GitBox <gi...@apache.org>.

wmccarley commented on issue #7328:
URL: https://github.com/apache/pulsar/issues/7328#issuecomment-649042142

@pushkar-engagio You mention you are using Amazon linux 2 but you don't specifically mention which EC2 instance type you are using. I have noticed that when running bookkeeper cluster in AWS if the instance type + config used is *just barely* suitable for the IO throughput during normal operation then sudden spikes can cause processes to crash. Once a bookie crashes auto-recovery generates additional IO on the peers and you get cascading failures. Even if you have the bookie process setup as a service and it comes back up the bouncing up and down will create tons of under-replicated ledgers and you will be hosed. The easiest solution is to make sure your bookies are powerful enough to handle whatever you can throw at them and then some.

You also mention you are using 500Gb for your journal and 1Tb for your ledger, I assume EBS volumes. 500Gb is probably much more than you need for journal since the default configurations of bookkeeper are: journalMaxSizeMB = 2048 and journalMaxBackups = 5 effectively the max size of the journal directory is 12Gb (5 2Gb backup journals plus the current 2Gb journal)

FWIW i3.xlarge instances work well as bookkeepers, they come with a 950 Gb attached NVMe SSD. You can put the journal directory and the ledger directory on the same device (use two separate partitions.) This config would get you _close_ to the 1Tb-per-bookie setup you are testing with but you'll have 99% write latencies of 3ms or less.

Also I have noticed that the default setting for: _compactionRate_ of 1000 seems way too low. If you run a moderately sized bookie cluster with that setting your compactions will take a really long time and old data will hang around much longer than it needs to. Personally I run at 10X that (10000) and it works fine. Finally, if you do use an i3.xlarge it comes with 30Gb of RAM so you can tune the bookie JVM settings in bkenv.sh to make better use of those additional resources.

Hope that helps.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] pushkar-engagio edited a comment on issue #7328: Bookie cluster gradually fails when one of the bookie node goes down

Posted by GitBox <gi...@apache.org>.

pushkar-engagio edited a comment on issue #7328:
URL: https://github.com/apache/pulsar/issues/7328#issuecomment-647891125


   @sijie Yes i understand that aspect of it but I am not losing two bookies.
   
   I lose first bookie due to hardware failure. I can not get the bookie to recover.
   The rest of the bookies gradually fail. Although the bookkeeper process is running on them but they are not part of the cluster. The thing i don't understand is why the other bookies are removed from the cluster. This happened over a period of few hours.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] pushkar-engagio commented on issue #7328: Bookie cluster gradually fails when one of the bookie node goes down

Posted by GitBox <gi...@apache.org>.

pushkar-engagio commented on issue #7328:
URL: https://github.com/apache/pulsar/issues/7328#issuecomment-649461626


   @wmccarley Thank you for your guidance and recommendations.
   I will review the IOPS usage during that time. The CPU usually goes up and stays around 100% for the brokers which are replicating data but will check on the journal and ledgers iops.
   
   In a nutshell, the cluster can handle iops for the regular operations but may not have enough iops available for added iops during cluster recovery.
   
   I am using c5.xlarge for both bookkeeper and broker cluster. We barely hit around 2.5% of journal cluster but i have kept the cluster at around 500Gb to keep the additional iops that come with it. 3% makes sense with backup. ~12.5Gb)
   During initial evaluation, i had considered i3.xlarge instances with 1TB GP2 volumes for ledgers but we went with 500Gb GP2 for journal and 1TB st1 for ledgers on c5.xlarge instances. 
   
   I will evaluate i3.xlarge instance size again. I will also look into the compaction rate 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] pushkar-engagio edited a comment on issue #7328: Bookie cluster gradually fails when one of the bookie node goes down

Posted by GitBox <gi...@apache.org>.

pushkar-engagio edited a comment on issue #7328:
URL: https://github.com/apache/pulsar/issues/7328#issuecomment-647891125


   @sijie Yes i understand that aspect of it but I am not losing two bookies. The second and subsequent bookies are lost due to some sort of failure on pulsar side.
   
   I lose first bookie due to hardware failure. I can not get the bookie to recover.
   The rest of the bookies gradually fail. Although the bookkeeper process is running on them but they are not part of the cluster. The thing i don't understand is why the other bookies are removed from the cluster. This happened over a period of few hours.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] wmccarley commented on issue #7328: Bookie cluster gradually fails when one of the bookie node goes down

Posted by GitBox <gi...@apache.org>.

wmccarley commented on issue #7328:
URL: https://github.com/apache/pulsar/issues/7328#issuecomment-649620891


   @pushkar-engagio Also if you are not already doing it, think about running [node exporter](https://github.com/prometheus/node_exporter) on your bookies because the IO stats AWS exposes through Cloudwatch are insufficient to get the whole picture. Node exporter will give you more granular IOPS broken down between Journal and Ledger as well as FS utilization and aggregate IO time stats.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] sijie commented on issue #7328: Bookie cluster gradually fails when one of the bookie node goes down

Posted by GitBox <gi...@apache.org>.

sijie commented on issue #7328:
URL: https://github.com/apache/pulsar/issues/7328#issuecomment-647868405


   @pushkar-engagio based on your description, it seems that you lose two bookies around the same time. The default replication setting for ledgers is 2. In this case, there will be some ledgers unavailable. In order to handle this case, you might consider increasing the number of replicas to 3 to get higher availability.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] pushkar-engagio commented on issue #7328: Bookie cluster gradually fails when one of the bookie node goes down

Posted by GitBox <gi...@apache.org>.

pushkar-engagio commented on issue #7328:
URL: https://github.com/apache/pulsar/issues/7328#issuecomment-647891125


   @sijie Yes i understand that aspect of it but I am not loosing two bookies.
   
   I loose first bookie due to hardware failure. I can not get the bookie to recover.
   The rest of the bookies gradually fail. Although the bookkeeper process is running on them but they are not part of the cluster. The thing i don't understand is why the other bookies are removed from the cluster. This happened over a period of few hours.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org