You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2022/09/02 10:42:07 UTC

[GitHub] [bookkeeper] RaulGracia opened a new issue, #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

RaulGracia opened a new issue, #3466:
URL: https://github.com/apache/bookkeeper/issues/3466

   **BUG REPORT**
   
   ***Describe the bug***
   
   During our tests of Pravega (https://cncf.pravega.io/), we have detected a regression after upgrading to Bookkeeper 4.15 (https://github.com/pravega/pravega/pull/6676). The regression is consistently reproducible when doing a fault injection test (30% network packet loss). Note that this kind of test consistently pass using Bookkeeper 4.14, for which we believe that is a regression or a change in the behavior of the new Bookkeeper release.
   
   A reproduction of this issue goes as follows:
   
   1. Pravega creates a ledger, in this case ledger `3246`:
   ```
   2022-07-20T14:45:25.706422066+00:00 stdout F 2022-07-20 14:45:25,706 7043990 [ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x2016bc13e0a001c, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 127,1  replyHeader:: 127,38654707364,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,#426f6f6b69654d65746164617461466f726d617456657273696f6e933affffff8c18210218020ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff12813238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f674964122313360ffffffe2ffffffe5ffffffcdffffff8bffffff81ffffff93ffffffd8ffffff9c10,v{s{31,s{'world,'anyone}}},0  response:: '/pravega/pravega/bookkeeper/ledgers/00/0000
 /L3246
   2022-07-20T14:45:25.706687124+00:00 stdout F 2022-07-20 14:45:25,706 7043990 [ZKC-connect-executor-0-EventThread] INFO  o.a.b.client.LedgerCreateOp - Ensemble: [bookkeeper-bookie-2-11580, bookkeeper-bookie-0-28816] for ledger: 3246
   2022-07-20T14:45:25.706725535+00:00 stdout F 2022-07-20 14:45:25,706 7043990 [core-13] INFO  i.p.s.s.i.bookkeeper.BookKeeperLog - Log[13]: Created Ledger 3246.
   2022-07-20T14:45:47.426112317+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x2016bc13e0a001c, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 145,4  replyHeader:: 145,38654707423,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,F  response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933affffff8c18210218020ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff12813238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f674964122313360ffffffe2ffffffe5ffffffcdffffff8bffffff81ffffff93ffffffd8ffffff9c10,s{38654707364,38654707364,1658328325290,1658328325290,0,0,0,0,172,0,386
 54707364}
   2022-07-20T14:45:47.426207948+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [ZKC-connect-executor-0-EventThread] DEBUG o.a.b.meta.AbstractZkLedgerManager - Registered ledger metadata listener ReadOnlyLedgerHandle(lid = 3246, id = 672259987) on ledger 3246.
   ```
   
   2. Probably after some network failures induced by the test, we see that this ledger is unregistered (has it been actually removed from Zookeeper?): 
   ```
   2022-07-20T14:45:47.426475848+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x2016bc13e0a001c, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 146,4  replyHeader:: 146,38654707423,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,T  response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933affffff8c18210218020ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff12813238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f674964122313360ffffffe2ffffffe5ffffffcdffffff8bffffff81ffffff93ffffffd8ffffff9c10,s{38654707364,38654707364,1658328325290,1658328325290,0,0,0,0,172,0,386
 54707364}
   2022-07-20T14:45:47.426637438+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.b.client.PendingReadLacOp - Read LAC complete with enough validResponse for ledger: 3246 LAC: -1
   2022-07-20T14:45:47.426737033+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [core-13] DEBUG o.a.b.meta.AbstractZkLedgerManager - Unregistered ledger metadata listener ReadOnlyLedgerHandle(lid = 3246, id = 672259987) on ledger 3246.
   2022-07-20T14:46:16.612224510+00:00 stdout F 2022-07-20 14:46:16,612 7094896 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.b.proto.PerChannelBookieClient - Could not write Add request to bookie 192.169.1.60/192.169.1.60:3181 for ledger 3246, entry 0
   2022-07-20T14:46:16.612271701+00:00 stdout F 2022-07-20 14:46:16,612 7094896 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR o.a.bookkeeper.client.PendingAddOp - Write of ledger entry to quorum failed: L3246 E0
   2022-07-20T14:46:16.616440402+00:00 stdout F 2022-07-20 14:46:16,616 7094900 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.b.proto.PerChannelBookieClient - Could not write ReadLAC request to bookie null for ledger 3246, entry -1
   ```
   
   3. After that, Pravega still assumes that the ledger is there (as Pravega itself has not deleted it), but any further operation on this ledger fails: 
   ```
   2022-07-20T14:46:16.630253410+00:00 stdout F 2022-07-20 14:46:16,629 7094913 [core-30] DEBUG o.a.bookkeeper.client.LedgerHandle - Failed to close ledger 3246 :
   2022-07-20T14:46:16.630795088+00:00 stdout F 2022-07-20 14:46:16,630 7094914 [core-30] ERROR i.p.s.s.i.bookkeeper.BookKeeperLog - Log[13]: Unable to close LedgerHandle for Ledger 3246.
   2022-07-20T14:46:16.630795088+00:00 stdout F io.pravega.segmentstore.storage.DurableDataLogException: Unable to close ledger 3246.
   ```
   
   4. This makes that, when finding the ledger missing, that Segment Container in Pravega is stuck and cannot recover:
   ```
   2022-07-20T15:59:48.793802385+00:00 stdout F 2022-07-20 15:59:48,793 11507077 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.bookkeeper.client.PendingReadOp - No such entry found on bookie.  L3246 E0 bookie: bookkeeper-bookie-0-28816
   2022-07-20T15:59:48.793815178+00:00 stdout F 2022-07-20 15:59:48,793 11507077 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR o.a.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L3246 E0-E0, Sent to [bookkeeper-bookie-0-28816, bookkeeper-bookie-2-11580], Heard from [] : bitset = {}, Error = 'No such ledger exists on Bookies'. First unread entry is (-1, rc = null)
   2022-07-20T15:59:48.793906376+00:00 stdout F 2022-07-20 15:59:48,793 11507077 [core-1] INFO  i.p.s.s.c.StreamSegmentContainerMetadata - SegmentContainer[13]: Exit RecoveryMode.
   ...
   2022-07-20T15:59:48.794030332+00:00 stdout F 2022-07-20 15:59:48,794 11507078 [core-8] INFO  i.p.s.s.r.StreamSegmentReadIndex - ReadIndex[13-123]: Closed.
   2022-07-20T15:59:48.794605613+00:00 stdout F 2022-07-20 15:59:48,794 11507078 [core-1] ERROR i.p.s.server.logs.DurableLog - DurableLog[13] Recovery FAILED.
   2022-07-20T15:59:48.794605613+00:00 stdout F org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsException: No such ledger exists on Bookies
   2022-07-20T15:59:48.794605613+00:00 stdout F 	at org.apache.bookkeeper.client.BKException.create(BKException.java:74)
   2022-07-20T15:59:48.794605613+00:00 stdout F 	at org.apache.bookkeeper.client.PendingReadOp.submitCallback(PendingReadOp.java:654)
   2022-07-20T15:59:48.794605613+00:00 stdout F 	at org.apache.bookkeeper.client.PendingReadOp$LedgerEntryRequest.fail(PendingReadOp.java:176)
   2022-07-20T15:59:48.794605613+00:00 stdout F 	at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.sendNextRead(PendingReadOp.java:397)
   2022-07-20T15:59:48.794605613+00:00 stdout F 	at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.logErrorAndReattemptRead(PendingReadOp.java:440)
   ```
   
   5. At the Bookies side, we can see that the fence request is received by one Bookie (bookie-2) at `2022-07-20T14:48:11`:
   ```
   2-bookie-9c88965c-2022-07-20T14-49-58Z.log.gz:2022-07-20T14:48:11.157496389+00:00 stdout F 14:48:11,157 INFO  Ledger fence request received for ledger: 3246 from address: /192.168.0.228:57782
   2-bookie-9c88965c-2022-07-20T14-49-58Z.log.gz:2022-07-20T14:48:11.160872045+00:00 stdout F 14:48:11,160 INFO  Ledger fence request received for ledger: 3246 from address: /192.168.0.228:57782
   ```
   
   6. And that ledger seems to be present in the metadata of ledgers, at least for some time (look for `L3246`):
   ```
   2-bookie-9c88965c-2022-07-20T14-55-24Z.log.gz:2022-07-20T14:51:48.716729331+00:00 stdout F 14:51:48,716 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000 finished:false header:: 36287,8  replyHeader:: 36287,38654709412,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000,F  response:: v{'L3195,'L3070,'L3071,'L3192,'L2784,'L3078,'L3199,'L3079,'L3076,'L3197,'L3077,'L3198,'L3081,'L0377,'L3089,'L3087,'L0372,'L3052,'L3173,'L3053,'L3295,'L3050,'L3171,'L3172,'L3293,'L3290,'L3170,'L3291,'L0463,'L3058,'L0464,'L3059,'L3056,'L3177,'L3057,'L3178,'L0461,'L3054,'L3175,'L3055,'L3176,'L3297,'L3063,'L3184,'L3064,'L3061,'L3182,'L3062,'L3183,'L3181,'L3069,'L3067,'L3188,'L3068,'L3065,'L3066,'L3310,'L1814,'L3317,'L3203,'L3324,'L3322,'L3202,'L3323,'L3200,'L0734,'L1705,'L0737,'L3209,'L0736,'L3207,'L3328,'L3208,'L3329,'L3205,'L0732,'L3327,'L0739,'L0738,'L3096,'L3094,'L3095,'L3092,'
 L3093,'L3091,'L0388,'L0702,'L0704,'L0703,'L2327,'L0709,'L0706,'L0705,'L0708,'L0707,'L0390,'L3302,'L2452,'L0713,'L0712,'L3308,'L0714,'L3306,'L3307,'L0711,'L0710,'L3305,'L3115,'L3236,'L3116,'L3237,'L0762,'L3113,'L3234,'L3114,'L3233,'L3230,'L3110,'L3231,'L0768,'L0404,'L0767,'L0769,'L0764,'L3119,'L0763,'L3238,'L0765,'L3118,'L0771,'L3126,'L1983,'L3127,'L3248,'L3245,'L3125,'L3246,'L3122,'L3243,'L3123,'L3244,'L3120,'L3000,'L0418,'L2957,'L3009,'L2955,'L3007,'L3214,'L0740,'L2485,'L3211,'L3330,'L0745,'L3218,'L0741,'L0744,'L3104,'L3105,'L3226,'L3102,'L3103,'L3224,'L3101,'L2130,'L0757,'L0756,'L2816,'L0759,'L0758,'L3108,'L3107,'L3030,'L3151,'L3031,'L3152,'L3273,'L3150,'L3038,'L3159,'L2984,'L3278,'L2982,'L3037,'L3279,'L3155,'L2980,'L3156,'L3277,'L2064,'L3032,'L3153,'L3274,'L3033,'L3154,'L3275,'L2989,'L0444,'L2988,'L0447,'L2985,'L2986,'L2073,'L3041,'L3162,'L3283,'L3042,'L3163,'L3160,'L3281,'L3040,'L3161,'L3280,'L0694,'L0696,'L2992,'L3047,'L3048,'L3169,'L3166,'L3167,'L0692,'L3043,'L3164,'L3165,'L29
 99,'L2996,'L3130,'L3137,'L3017,'L3014,'L3135,'L3015,'L3136,'L3012,'L3134,'L3010,'L3131,'L3252,'L3132,'L2969,'L2966,'L3018,'L3139,'L3019,'L3140,'L3020,'L3141,'L3148,'L3269,'L0671,'L2610,'L2973,'L3149,'L3025,'L2971,'L3147,'L3024,'L3145,'L3142,'L3143,'L2979,'L2735,'L2974,'L3029}
   2-bookie-9c88965c-2022-07-20T14-55-24Z.log.gz:2022-07-20T14:51:48.736027964+00:00 stdout F 14:51:48,735 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 36521,4  replyHeader:: 36521,38654709412,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,F  response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933a7c8210218ffffff9f92002833238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f6749641223133600,s{38654707364,38654708391,1658328325290,1658328491164,2,0,0,0,155,0,38654707364}
   
   2-bookie-9c88965c-2022-07-20T14-55-52Z.log.gz:2022-07-20T14:55:21.070507213+00:00 stdout F 14:55:21,070 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000 finished:false header:: 36739,8  replyHeader:: 36739,38654709965,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000,F  response:: v{'L3195,'L3070,'L3071,'L3192,'L2784,'L3078,'L3199,'L3079,'L3076,'L3197,'L3077,'L3198,'L3081,'L0377,'L3089,'L3087,'L0372,'L3052,'L3053,'L3295,'L3050,'L3171,'L3172,'L3293,'L3170,'L3291,'L0463,'L3058,'L0464,'L3059,'L3056,'L3177,'L3057,'L3178,'L0461,'L3054,'L3175,'L3055,'L3176,'L3063,'L3184,'L3064,'L3061,'L3182,'L3062,'L3183,'L3181,'L3069,'L3067,'L3188,'L3068,'L3065,'L3066,'L1814,'L3317,'L3203,'L3324,'L3322,'L3202,'L3323,'L3200,'L0734,'L1705,'L0737,'L3209,'L0736,'L3207,'L3208,'L3329,'L3205,'L0732,'L3327,'L0739,'L0738,'L3096,'L3094,'L3095,'L3092,'L3093,'L3091,'L0388,'L0702,'L0704,'
 L0703,'L2327,'L0709,'L0706,'L0705,'L0708,'L0707,'L0390,'L3302,'L2452,'L0713,'L0712,'L3308,'L0714,'L3306,'L3307,'L0711,'L0710,'L3305,'L3350,'L3115,'L3236,'L3116,'L3237,'L3358,'L0762,'L3113,'L3234,'L3114,'L3353,'L3233,'L3354,'L3230,'L3351,'L3110,'L3231,'L3352,'L0768,'L0404,'L0767,'L0769,'L0764,'L3119,'L0763,'L3238,'L0765,'L3118,'L3360,'L0771,'L3126,'L3368,'L1983,'L3248,'L3245,'L3366,'L3125,'L3246,'L3122,'L3243,'L3364,'L3123,'L3244,'L3365,'L3120,'L3000,'L3363,'L0418,'L2957,'L3009,'L2955,'L3007,'L3214,'L0740,'L2485,'L3211,'L3330,'L0745,'L3218,'L0741,'L0744,'L3338,'L3104,'L3346,'L3105,'L3226,'L3347,'L3102,'L3103,'L3224,'L3345,'L3101,'L2130,'L3341,'L0757,'L0756,'L2816,'L0759,'L0758,'L3108,'L3348,'L3107,'L3349,'L3030,'L3151,'L3031,'L3152,'L3273,'L3150,'L3390,'L3038,'L3159,'L2984,'L3278,'L2982,'L3037,'L3279,'L3155,'L2980,'L3156,'L3277,'L2064,'L3032,'L3153,'L3274,'L3033,'L3154,'L3275,'L2989,'L0444,'L2988,'L0447,'L2985,'L2986,'L2073,'L3041,'L3162,'L3042,'L3163,'L3160,'L3281,'L3040,'L3161,'L32
 80,'L0694,'L0696,'L2992,'L3047,'L3048,'L3169,'L3166,'L3167,'L0692,'L3043,'L3164,'L3165,'L2999,'L2996,'L3130,'L3372,'L3370,'L3137,'L3379,'L3017,'L3014,'L3135,'L3377,'L3015,'L3136,'L3378,'L3012,'L3375,'L3134,'L3010,'L3131,'L3252,'L3373,'L3132,'L3374,'L2969,'L2966,'L3018,'L3139,'L3019,'L3140,'L3382,'L3020,'L3141,'L3383,'L3380,'L3381,'L3148,'L3269,'L0671,'L2610,'L2973,'L3149,'L3025,'L3388,'L2971,'L3147,'L3389,'L3024,'L3387,'L3142,'L3384,'L3143,'L3385,'L2979,'L2735,'L2974,'L3029}
   2-bookie-9c88965c-2022-07-20T14-55-52Z.log.gz:2022-07-20T14:55:21.090055084+00:00 stdout F 14:55:21,090 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 36970,4  replyHeader:: 36970,38654709965,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,F  response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933a7c8210218ffffff9f92002833238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f6749641223133600,s{38654707364,38654708391,1658328325290,1658328491164,2,0,0,0,155,0,38654707364}
   2-bookie-9c88965c-2022-07-20T14-55-52Z.log.gz:2022-07-20T14:55:21.095978533+00:00 stdout F 14:55:21,095 INFO  Following ledgers: [3329, 771, 3076, 3077, 3081, 3338, 2064, 3089, 3346, 3091, 3092, 3348, 1814, 3094, 3350, 2327, 3095, 3351, 3096, 3352, 2073, 3354, 3101, 3102, 3104, 3105, 3107, 3108, 3114, 3115, 3116, 3118, 3119, 2610, 3122, 3123, 3125, 3130, 3131, 3132, 3134, 3135, 3136, 3139, 3140, 3141, 3143, 3147, 3148, 3150, 3151, 3152, 3153, 3155, 3156, 3159, 3160, 3162, 3163, 3164, 3165, 3166, 3167, 3169, 3171, 3172, 3175, 3176, 3177, 3182, 3183, 3184, 372, 3199, 3202, 388, 3205, 390, 3208, 3209, 3211, 2957, 3214, 2452, 3224, 3226, 2971, 3230, 671, 3231, 3233, 3236, 1705, 3243, 2988, 3244, 3245, 3246, 2735, 692, 2485, 694, 2999, 3000, 702, 447, 703, 704, 707, 708, 710, 3014, 712, 713, 3273, 714, 3274, 3277, 3278, 463, 3025, 3281, 3029, 3031, 3032, 3033, 3291, 732, 3037, 734, 3038, 3295, 736, 3040, 737, 3041, 738, 3042, 739, 3043, 740, 741, 3047, 744, 745, 3305, 3050, 3306, 3307,
  3052, 3308, 3053, 3054, 3055, 3057, 3058, 756, 3061, 3317, 758, 759, 3063, 3064, 3065, 3066, 763, 3068, 765, 3069, 767, 3071, 3327] of bookie: [bookkeeper-bookie-0-28816] are identified as underreplicated
   
   2-bookie-9c88965c-2022-07-20T15-00-53Z.log.gz:2022-07-20T14:56:19.912265378+00:00 stdout F 14:56:19,912 DEBUG Reading reply session id: 0x1016bac9b630005, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000 finished:false header:: 6,8  replyHeader:: 6,38654710371,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000,F  response:: v{'L3195,'L3070,'L3071,'L3192,'L2784,'L3078,'L3199,'L3079,'L3076,'L3197,'L3077,'L3198,'L3081,'L3401,'L0377,'L3400,'L3089,'L3087,'L0372,'L3407,'L3408,'L3405,'L3406,'L3403,'L3404,'L3052,'L3053,'L3295,'L3050,'L3171,'L3172,'L3293,'L3170,'L3291,'L0463,'L3058,'L0464,'L3059,'L3056,'L3177,'L3057,'L3178,'L0461,'L3054,'L3175,'L3055,'L3176,'L3063,'L3184,'L3064,'L3061,'L3182,'L3062,'L3183,'L3181,'L3069,'L3067,'L3188,'L3068,'L3065,'L3066,'L1814,'L3317,'L3203,'L3324,'L3322,'L3202,'L3323,'L3200,'L0734,'L1705,'L0737,'L3209,'L0736,'L3207,'L3208,'L3205,'L0732,'L3327,'L0739,'L0738,'L3096,'L3094,'L3095,'L
 3092,'L3093,'L3091,'L3412,'L3413,'L0388,'L3410,'L3411,'L0702,'L0704,'L3418,'L0703,'L3419,'L2327,'L3416,'L3417,'L3414,'L3415,'L0709,'L0706,'L0705,'L0708,'L0707,'L0390,'L3302,'L3423,'L3424,'L3421,'L3422,'L2452,'L3420,'L0713,'L0712,'L3308,'L0714,'L3306,'L3307,'L0711,'L3425,'L0710,'L3305,'L3426,'L3350,'L3115,'L3236,'L3116,'L3237,'L3358,'L0762,'L3113,'L3234,'L3114,'L3353,'L3233,'L3354,'L3230,'L3351,'L3110,'L3231,'L3352,'L0768,'L0404,'L0767,'L0769,'L0764,'L3119,'L0763,'L3238,'L0765,'L3118,'L3360,'L0771,'L3126,'L3368,'L1983,'L3248,'L3245,'L3366,'L3125,'L3246,'L3122,'L3243,'L3364,'L3123,'L3244,'L3365,'L3120,'L3000,'L3363,'L0418,'L2957,'L3009,'L2955,'L3007,'L3214,'L0740,'L2485,'L3211,'L3330,'L0745,'L3218,'L0741,'L0744,'L3338,'L3104,'L3346,'L3105,'L3226,'L3347,'L3102,'L3103,'L3224,'L3345,'L3101,'L2130,'L3341,'L0757,'L0756,'L2816,'L0759,'L0758,'L3108,'L3348,'L3107,'L3349,'L3030,'L3151,'L3393,'L3031,'L3152,'L3273,'L3391,'L3150,'L3392,'L3038,'L3159,'L2984,'L3278,'L3399,'L2982,'L3037,'L3279,'L315
 5,'L2980,'L3156,'L3277,'L3398,'L2064,'L3032,'L3153,'L3274,'L3395,'L3033,'L3154,'L3275,'L3396,'L2989,'L0444,'L2988,'L0447,'L2985,'L2986,'L2073,'L3041,'L3162,'L3042,'L3163,'L3160,'L3281,'L3040,'L3161,'L3280,'L0694,'L0696,'L2992,'L3047,'L3048,'L3169,'L3166,'L3167,'L0692,'L3043,'L3164,'L3165,'L2999,'L2996,'L3130,'L3372,'L3137,'L3017,'L3014,'L3135,'L3015,'L3136,'L3378,'L3012,'L3134,'L3010,'L3131,'L3252,'L3132,'L3374,'L2969,'L2966,'L3018,'L3139,'L3019,'L3140,'L3382,'L3020,'L3141,'L3148,'L3269,'L0671,'L2610,'L2973,'L3149,'L3025,'L3388,'L2971,'L3147,'L3389,'L3024,'L3142,'L3384,'L3143,'L3385,'L2979,'L2735,'L2974,'L3029}
   2-bookie-9c88965c-2022-07-20T15-00-53Z.log.gz:2022-07-20T14:56:19.932194546+00:00 stdout F 14:56:19,932 DEBUG Reading reply session id: 0x1016bac9b630005, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000 finished:false header:: 12,8  replyHeader:: 12,38654710371,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000,F  response:: v{'L3195,'L3070,'L3071,'L3192,'L2784,'L3078,'L3199,'L3079,'L3076,'L3197,'L3077,'L3198,'L3081,'L3401,'L0377,'L3400,'L3089,'L3087,'L0372,'L3407,'L3408,'L3405,'L3406,'L3403,'L3404,'L3052,'L3053,'L3295,'L3050,'L3171,'L3172,'L3293,'L3170,'L3291,'L0463,'L3058,'L0464,'L3059,'L3056,'L3177,'L3057,'L3178,'L0461,'L3054,'L3175,'L3055,'L3176,'L3063,'L3184,'L3064,'L3061,'L3182,'L3062,'L3183,'L3181,'L3069,'L3067,'L3188,'L3068,'L3065,'L3066,'L1814,'L3317,'L3203,'L3324,'L3322,'L3202,'L3323,'L3200,'L0734,'L1705,'L0737,'L3209,'L0736,'L3207,'L3208,'L3205,'L0732,'L3327,'L0739,'L0738,'L3096,'L3094,'L3095,
 'L3092,'L3093,'L3091,'L3412,'L3413,'L0388,'L3410,'L3411,'L0702,'L0704,'L3418,'L0703,'L3419,'L2327,'L3416,'L3417,'L3414,'L3415,'L0709,'L0706,'L0705,'L0708,'L0707,'L0390,'L3302,'L3423,'L3424,'L3421,'L3422,'L2452,'L3420,'L0713,'L0712,'L3308,'L0714,'L3306,'L3307,'L0711,'L3425,'L0710,'L3305,'L3426,'L3350,'L3115,'L3236,'L3116,'L3237,'L3358,'L0762,'L3113,'L3234,'L3114,'L3353,'L3233,'L3354,'L3230,'L3351,'L3110,'L3231,'L3352,'L0768,'L0404,'L0767,'L0769,'L0764,'L3119,'L0763,'L3238,'L0765,'L3118,'L3360,'L0771,'L3126,'L3368,'L1983,'L3248,'L3245,'L3366,'L3125,'L3246,'L3122,'L3243,'L3364,'L3123,'L3244,'L3365,'L3120,'L3000,'L3363,'L0418,'L2957,'L3009,'L2955,'L3007,'L3214,'L0740,'L2485,'L3211,'L3330,'L0745,'L3218,'L0741,'L0744,'L3338,'L3104,'L3346,'L3105,'L3226,'L3347,'L3102,'L3103,'L3224,'L3345,'L3101,'L2130,'L3341,'L0757,'L0756,'L2816,'L0759,'L0758,'L3108,'L3348,'L3107,'L3349,'L3030,'L3151,'L3393,'L3031,'L3152,'L3273,'L3391,'L3150,'L3392,'L3038,'L3159,'L2984,'L3278,'L3399,'L2982,'L3037,'L3279,'L3
 155,'L2980,'L3156,'L3277,'L3398,'L2064,'L3032,'L3153,'L3274,'L3395,'L3033,'L3154,'L3275,'L3396,'L2989,'L0444,'L2988,'L0447,'L2985,'L2986,'L2073,'L3041,'L3162,'L3042,'L3163,'L3160,'L3281,'L3040,'L3161,'L3280,'L0694,'L0696,'L2992,'L3047,'L3048,'L3169,'L3166,'L3167,'L0692,'L3043,'L3164,'L3165,'L2999,'L2996,'L3130,'L3372,'L3137,'L3017,'L3014,'L3135,'L3015,'L3136,'L3378,'L3012,'L3134,'L3010,'L3131,'L3252,'L3132,'L3374,'L2969,'L2966,'L3018,'L3139,'L3019,'L3140,'L3382,'L3020,'L3141,'L3148,'L3269,'L0671,'L2610,'L2973,'L3149,'L3025,'L3388,'L2971,'L3147,'L3389,'L3024,'L3142,'L3384,'L3143,'L3385,'L2979,'L2735,'L2974,'L3029}
   
   2-bookie-9c88965c-2022-07-20T15-00-53Z.log.gz:2022-07-20T14:56:32.051382550+00:00 stdout F 14:56:32,051 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 40548,4  replyHeader:: 40548,38654710417,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,F  response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933a7c8210218ffffff9f92002833238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f6749641223133600,s{38654707364,38654708391,1658328325290,1658328491164,2,0,0,0,155,0,38654707364}
   ```
   
   7. But the ledger does not exist when we attempt to read from it:
   ```
   0-bookie-adfe5832-2022-07-20T18-23-58Z.log.gz:2022-07-20T15:59:48.625362817+00:00 stdout F 15:59:48,625 INFO  No ledger found while reading entry: 0 from ledger: 3246
   2-bookie-9c88965c-2022-07-20T17-26-08Z.log.gz:2022-07-20T15:59:48.624934442+00:00 stdout F 15:59:48,624 INFO  No ledger found while reading entry: 0 from ledger: 3246
   ...
   2-bookie-9c88965c-2022-07-20T17-26-08Z.log.gz:2022-07-20T16:39:42.753841016+00:00 stdout F 16:39:42,753 INFO  No ledger found while reading entry: 0 from ledger: 3246
   ```
   
   In my understanding, from the Pravega point of view ledger `3246` is created and not explicitly deleted by Pravega itself, which means that there has been a data loss, most likely induced by some corner case exposed via network packet loss. In normal conditions, where there are no induced packet loss, things work fine. But it does not mean that, in a real cluster, a similar network failure could happen and the same issue could be reproduced. Again, we have been doing fault injection tests with Bookkeeper 4.14 and probably previous versions as well (e.g., Bookkeeper 4.11), and this is the first time we encounter this behavior.
   
   ***To Reproduce***
   
   Steps to reproduce the behavior:
   1. Write to Bookkeeper using Pravega (https://github.com/pravega/pravega).
   2. Introduce 30% network packet loss with LitmusChaos (see https://www.youtube.com/watch?v=EU_g8cKa1G8 for a demo on how to reproduce this experiment)
   5. Consistently, Pravega gets a `No Such Ledger Exception` in some containers using Bookkeeper 4.15 (with Bookkeeper 4.14, the system is able to work under such conditions)
   
   ***Expected behavior***
   
   Bookkeeper 4.15 should offer a similar behavior in how ledgers are managed compared to Bookkeeper 4.14 (and previous versions).
   
   ***Screenshots***
   
   n/a
   
   ***Additional context***
   
   At the moment, there are 2 changes that could be related to this issue (but I cannot confirm that they actually cause the problem):
   1. Upgrade of Zookeeper library from 3.6.x to 3.8.x.
   2. This PR in 4.15 and not in 4.14 seems related to how ledgers are managed: https://github.com/apache/bookkeeper/pull/3361


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] hangc0276 commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1240163221

   We found the same issue on the Pulsar side. https://github.com/apache/pulsar/issues/17520
   We tested bk 4.15.0, and the unit test passed, but for bk 4.15.1, the unit test failed.
   I revert PR https://github.com/apache/bookkeeper/pull/3239 in my local env, the unit test still failed with another exception.
   ```
   2022-09-08T10:57:49,000 - ERROR - [ForkJoinPool.commonPool-worker-6:ServerCnx@1312] - Try add schema failed, remote address /127.0.0.1:65225, topic persistent://pulsar/ns/testReplicationWithSchema-54c4f655-5443-49d0-8ef2-c65eb8a70c74, producerId 1
   java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.schema.exceptions.SchemaException: Error while recovering ledger -  ledger=3 - operation=Failed to open ledger
   	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) ~[?:?]
   	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) ~[?:?]
   	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1141) ~[?:?]
   	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
   	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
   	at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$openLedger$40(BookkeeperSchemaStorage.java:596) ~[classes/:?]
   	at org.apache.bookkeeper.client.LedgerOpenOp.openComplete(LedgerOpenOp.java:248) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.client.LedgerOpenOp$1.safeOperationComplete(LedgerOpenOp.java:216) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.client.LedgerOpenOp$1.safeOperationComplete(LedgerOpenOp.java:206) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.util.OrderedGenericCallback.operationComplete(OrderedGenericCallback.java:62) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.proto.BookkeeperInternalCallbacks$TimedGenericCallback.operationComplete(BookkeeperInternalCallbacks.java:189) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.client.ReadOnlyLedgerHandle.lambda$recover$5(ReadOnlyLedgerHandle.java:295) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
   	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
   	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
   	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
   	at org.apache.bookkeeper.client.LedgerRecoveryOp.submitCallback(LedgerRecoveryOp.java:159) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.client.LedgerRecoveryOp.access$100(LedgerRecoveryOp.java:37) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.client.LedgerRecoveryOp$1.readLastConfirmedDataComplete(LedgerRecoveryOp.java:136) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.client.ReadLastConfirmedOp.readEntryComplete(ReadLastConfirmedOp.java:155) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion$1.readEntryComplete(PerChannelBookieClient.java:1880) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion.lambda$errorOut$0(PerChannelBookieClient.java:1896) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue$1.safeRun(PerChannelBookieClient.java:1689) ~[bookkeeper-server-4.15.1.jar:4.15.1]
   	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[bookkeeper-common-4.15.1.jar:4.15.1]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
   	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.77.Final.jar:4.1.77.Final]
   	at java.lang.Thread.run(Thread.java:833) ~[?:?]
   Caused by: org.apache.pulsar.broker.service.schema.exceptions.SchemaException: Error while recovering ledger -  ledger=3 - operation=Failed to open ledger
   	at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.bkException(BookkeeperSchemaStorage.java:702) ~[classes/:?]
   	... 23 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] RaulGracia commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1251935747

   @zymap thanks a lot for your analysis. Yes, you are right, we are using 4.14.2 concretely, as the last commit we have from mainline is [](https://github.com/apache/bookkeeper/commit/fb15b07892ceaa0955756a4c0f34149e1a142e14). And we do not have that PR #2794 when Bookkeeper works well with our fault injection tests. Maybe we could try another test with latest 4.14 commit, to see if the same issue is reproduced. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] zymap commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
zymap commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1251816693

   @RaulGracia 
   I saw the logs in step 2:
   
   ```
   2022-07-20T14:45:47.426475848+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x2016bc13e0a001c, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 146,4  replyHeader:: 146,38654707423,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,T  response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933affffff8c18210218020ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff12813238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f674964122313360ffffffe2ffffffe5ffffffcdffffff8bffffff81ffffff93ffffffd8ffffff9c10,s{38654707364,38654707364,1658328325290,1658328325290,0,0,0,0,172,0,386
 54707364}
   2022-07-20T14:45:47.426637438+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.b.client.PendingReadLacOp - Read LAC complete with enough validResponse for ledger: 3246 LAC: -1
   2022-07-20T14:45:47.426737033+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [core-13] DEBUG o.a.b.meta.AbstractZkLedgerManager - Unregistered ledger metadata listener ReadOnlyLedgerHandle(lid = 3246, id = 672259987) on ledger 3246.
   2022-07-20T14:46:16.612224510+00:00 stdout F 2022-07-20 14:46:16,612 7094896 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.b.proto.PerChannelBookieClient - Could not write Add request to bookie 192.169.1.60/192.169.1.60:3181 for ledger 3246, entry 0
   2022-07-20T14:46:16.612271701+00:00 stdout F 2022-07-20 14:46:16,612 7094896 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR o.a.bookkeeper.client.PendingAddOp - Write of ledger entry to quorum failed: L3246 E0
   2022-07-20T14:46:16.616440402+00:00 stdout F 2022-07-20 14:46:16,616 7094900 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG o.a.b.proto.PerChannelBookieClient - Could not write ReadLAC request to bookie null for ledger 3246, entry -1
   ```
   
   There have two things I noticed:
   
   1. The ledger seems closed after it received the ReadLAC response.
   
   ```
   Read LAC complete with enough validResponse for ledger: 3246 LAC: -1
   2022-07-20T14:45:47.426737033+00:00 stdout F 2022-07-20 14:45:47,426 7065710 [core-13] DEBUG o.a.b.meta.AbstractZkLedgerManager - Unregistered ledger metadata listener ReadOnlyLedgerHandle(lid = 3246, id = 672259987) on ledger 3246
   ```
   
   2. Bookie channel is null in the callback. That looks like when the callback is triggered, the channel is close.
   
   ```
   Could not write ReadLAC request to bookie null for ledger 3246, entry -1
   ```
   
   And looks like in the Pravega, the ledger handler closes again so the error happens again. (step 2 closed and step 3 closed again)
   
   ```
   2022-07-20T14:46:16.630253410+00:00 stdout F 2022-07-20 14:46:16,629 7094913 [core-30] DEBUG o.a.bookkeeper.client.LedgerHandle - Failed to close ledger 3246 :
   2022-07-20T14:46:16.630795088+00:00 stdout F 2022-07-20 14:46:16,630 7094914 [core-30] ERROR i.p.s.s.i.bookkeeper.BookKeeperLog - Log[13]: Unable to close LedgerHandle for Ledger 3246.
   2022-07-20T14:46:16.630795088+00:00 stdout F io.pravega.segmentstore.storage.DurableDataLogException: Unable to close ledger 3246.
   ```
   
   -----
   
   I saw in the previous, you are using 4.14.1, I found there has [a change](https://github.com/apache/bookkeeper/pull/2794/files) related to this which was released in 4.14.3. 
   
   Could you please also run the test against with the 4.14.1 and 4.14.3?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] eolivelli commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1241941899

   it is also interesting the the Writer tried to write entry 0
   > Could not write Add request to bookie 192.169.1.60/192.169.1.60:3181 for ledger 3246, entry 0
   
   but it wasn't able to even send the write request.
   That should lead to a "ensemble change", and to choose a new bookie.
   If you have 4 bookies then we should see something more in the logs and the operation should move forward and the BK client should have tried to write the same entry to another set of bookies.
   In the LedgerMetadata we should have multiple ensembles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] eolivelli commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1241932411

   @RaulGracia do you have "auto-recovery" running into your cluster?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] eolivelli commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1280485726

   I am working on this.
   I am preparing a set of Pravega docker images to pass to Dell-EMC Quality Eng team who can run the tests.
   
   The. docker images contain a build of Pravega compatible with BK 4.15 (thanks to @RaulGracia ) but with the BK client compiled at different points in the GIT history of 4.15.0.
   
   This way we will see which commit broke the compatibility.
   
   It will take much time because those system tests take at least 24h before failing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] RaulGracia commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1261911099

   FYI @zymap @eolivelli @dlg99 we have performed the same test again with Bookkeeper 4.14.3 client as suggested by @zymap, but we have not been able to reproduce the issue this time. So, at the moment, with the evidences we have we cannot say that #2794 is the root cause of the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] RaulGracia commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1259066353

   @dlg99 thanks Andrey. While this situation can indeed happen (specially inducing failure injection at the network level), my understanding is that that would happen to any version of Bookkeeper, right? However, our observation is that when executing this kind of tests, only Pravega using Bookkeeper 4.15 client gets into this situation where ledgers supposed to be there are not; when using Bookkeeper 4.14.2, the issue does not reproduce. Note that we have done this test many times with Bookkeeper 4.14.2, while the few times we executed that with Bookkeeper 4.15 client we always got into that situation. 
   
   We are currently executing the same test using Bookkeeper client 4.14.3, to validate the hypothesis of @zymap https://github.com/apache/bookkeeper/issues/3466#issuecomment-1251816693


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] RaulGracia commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1324993830

   To wrap up this issue: We have done fault injection tests of Pravega and several Bookkeeper client versions and we have not been able to reproduce the original issue ([see details here](https://pravega-io.slack.com/archives/C0151LSM46L/p1663902150322149)). We have also upgraded Pravega `master` (future 0.14 release) to Bookkeeper 4.15.3. If the issue is reproduced again, we will reopen the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] eolivelli commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1241930295

   I have done some testing on master branch.
   I have created some tests that simulate a "timeout" while performing recovery operations. And in every case I tests the BK client returns `BKLedgerRecoveryException`
   
   
   After re-reading again the logs in this issue it seems that:
   - the LAC on the ledger 3246 is at least 0 but no bookie has the entry
   - in the same process (assuming that the logs come from the same JVM) we have a ReadOnlyLedgerHandle on ledger 3246 but we are also writing to the ledger, so we have a WriteHandle
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] RaulGracia commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1235540195

   Thanks @eolivelli, reposting the answer in Slack for better visibility: https://pravega-io.slack.com/archives/C0151LSM46L/p1662115749173509 
   
   > is the ledger znode present on ZK ?
   
   The cluster is not in state anymore, but if the Bookkeeper Client gets a `BKException$BKNoSuchLedgerExistsException`, my understanding is that it should not be there. However, I see that the ledger `3246 `is created at `2022-07-20 14:45:25,706` and then we can see in Bookie logs some queries on these ledger metadata ZNodes:
   ```
   2-bookie-9c88965c-2022-07-20T14-55-24Z.log.gz:2022-07-20T14:51:48.716729331+00:00 stdout F 14:51:48,716 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000 finished:false header:: 36287,8  replyHeader:: 36287,38654709412,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000,F  response:: v{'L3195,'L3070,'L3071,'L3192,'L2784,'L3078,'L3199,'L3079,'L3076,'L3197,'L3077,'L3198,'L3081,'L0377,'L3089,'L3087,'L0372,'L3052,'L3173,'L3053,'L3295,'L3050,'L3171,'L3172,'L3293,'L3290,'L3170,'L3291,'L0463,'L3058,'L0464,'L3059,'L3056,'L3177,'L3057,'L3178,'L0461,'L3054,'L3175,'L3055,'L3176,'L3297,'L3063,'L3184,'L3064,'L3061,'L3182,'L3062,'L3183,'L3181,'L3069,'L3067,'L3188,'L3068,'L3065,'L3066,'L3310,'L1814,'L3317,'L3203,'L3324,'L3322,'L3202,'L3323,'L3200,'L0734,'L1705,'L0737,'L3209,'L0736,'L3207,'L3328,'L3208,'L3329,'L3205,'L0732,'L3327,'L0739,'L0738,'L3096,'L3094,'L3095,'L3092,'
 L3093,'L3091,'L0388,'L0702,'L0704,'L0703,'L2327,'L0709,'L0706,'L0705,'L0708,'L0707,'L0390,'L3302,'L2452,'L0713,'L0712,'L3308,'L0714,'L3306,'L3307,'L0711,'L0710,'L3305,'L3115,'L3236,'L3116,'L3237,'L0762,'L3113,'L3234,'L3114,'L3233,'L3230,'L3110,'L3231,'L0768,'L0404,'L0767,'L0769,'L0764,'L3119,'L0763,'L3238,'L0765,'L3118,'L0771,'L3126,'L1983,'L3127,'L3248,'L3245,'L3125,'L3246,'L3122,'L3243,'L3123,'L3244,'L3120,'L3000,'L0418,'L2957,'L3009,'L2955,'L3007,'L3214,'L0740,'L2485,'L3211,'L3330,'L0745,'L3218,'L0741,'L0744,'L3104,'L3105,'L3226,'L3102,'L3103,'L3224,'L3101,'L2130,'L0757,'L0756,'L2816,'L0759,'L0758,'L3108,'L3107,'L3030,'L3151,'L3031,'L3152,'L3273,'L3150,'L3038,'L3159,'L2984,'L3278,'L2982,'L3037,'L3279,'L3155,'L2980,'L3156,'L3277,'L2064,'L3032,'L3153,'L3274,'L3033,'L3154,'L3275,'L2989,'L0444,'L2988,'L0447,'L2985,'L2986,'L2073,'L3041,'L3162,'L3283,'L3042,'L3163,'L3160,'L3281,'L3040,'L3161,'L3280,'L0694,'L0696,'L2992,'L3047,'L3048,'L3169,'L3166,'L3167,'L0692,'L3043,'L3164,'L3165,'L29
 99,'L2996,'L3130,'L3137,'L3017,'L3014,'L3135,'L3015,'L3136,'L3012,'L3134,'L3010,'L3131,'L3252,'L3132,'L2969,'L2966,'L3018,'L3139,'L3019,'L3140,'L3020,'L3141,'L3148,'L3269,'L0671,'L2610,'L2973,'L3149,'L3025,'L2971,'L3147,'L3024,'L3145,'L3142,'L3143,'L2979,'L2735,'L2974,'L3029}
   2-bookie-9c88965c-2022-07-20T14-55-24Z.log.gz:2022-07-20T14:51:48.736027964+00:00 stdout F 14:51:48,735 DEBUG Reading reply session id: 0x3016baeab7f000f, packet:: clientPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 serverPath:/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 finished:false header:: 36521,4  replyHeader:: 36521,38654709412,0  request:: '/pravega/pravega/bookkeeper/ledgers/00/0000/L3246,F  response:: #426f6f6b69654d65746164617461466f726d617456657273696f6e933a7c8210218ffffff9f92002833238a19626f6f6b6b65657065722d626f6f6b69652d322d3131353830a19626f6f6b6b65657065722d626f6f6b69652d302d32383831361003834204825a16ab6170706c69636174696f6e127507261766567615a15af426f6f6b4b65657065724c6f6749641223133600,s{38654707364,38654708391,1658328325290,1658328491164,2,0,0,0,155,0,38654707364}
   ```
   I don't know if the fact that this request queries `/pravega/pravega/bookkeeper/ledgers/00/0000/L3246 `and gets some actual payload in response means that it actually exists in Zookeeper.
   
   > which are your replication parameters: ES, WQ, AQ ?
   
   In this experiment, the configuration is `ensembleSize=2, writeQuorumSize=2, ackQuorumSize=2`.
   
   > how many bookies do you have ?
   
   The Bookkeeper service is configured to keep 4 Bookies.
   
   > how many entries are supposed to be in the ledger ? only 1 ?
   
   Pravega rolls over ledger when they reach their max size (1GB default) or when there is some container recovery (due to fencing). So, this particular ledger was created but contained 0 entries at the moment when the issue happened. It would have contained many more in case the issue wouldn't had appeared.
   
   > at a first glance it looks to me that N bookies are answering that they do not know the ledger
   
   Could be, according to this log, the ledger should be created using 2 Bookies:
   `2022-07-20T14:45:25.706687124+00:00 stdout F 2022-07-20 14:45:25,706 7043990 [ZKC-connect-executor-0-EventThread] INFO  o.a.b.client.LedgerCreateOp - Ensemble: [bookkeeper-bookie-2-11580, bookkeeper-bookie-0-28816] for ledger: 3246`
   But in the Bookie logs, I mainly see `2-bookie-9c88965c` with logs related to ledger `3246`. The test inducing network drops may be exposing this potential issue that is hard to discover otherwise.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] eolivelli commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1235456382

   another question:
   - how many entries are supposed to be in the ledger ? only 1 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] dlg99 commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
dlg99 commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1259720850

   @RaulGracia My thought process was along the lines of updated ZK client making the problem easier to repro if that's ZK related.
   I experimented locally and tried to inject errors in unit tests but had no luck.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] RaulGracia closed issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
RaulGracia closed issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15
URL: https://github.com/apache/bookkeeper/issues/3466


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] RaulGracia commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1265261133

   Thanks a lot @dlg99, I can certainly review  #3513. But after the latest experiments we did with Bookkeeper 4.14.3, which includes #2794, we couldn't reproduce the issue, so I'm not sure that that PR is contributing to the issue described here. In any case, it is independent from the value that #3513 brings by itself. 
   
   > Can the lost ledger handle might be caused by the way Pravega handles errors in ledger open?
   
   The point is that the very same Pravega code has no issues using Bookkeeper 4.14.2 client and injecting network failures, and we consistently get issues like the one describe here when upgrading to Bookkeeper 4.15 client. For this reason, I thought that the problem may be related to some change in the Bookkeeper client within `branch-4.15`.
   
   I think that to discar that the issue is present in the whole `branch-4.14`,  we will reproduce fault injection tests with latest 4.14 release; if the issue is cannot be reproduced, we know that `branch-4.14` is fine in this regard (and we do not need to inspect cherry picked PRs). If the issue is reproduced, the number of commits to analyze would be much smaller.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] eolivelli commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1241931206

   I did a "git diff" between `branch-4.14` and `master` branch on the `bookkeeper-server/src/main/java/org/apache/bookkeeper/client` directory and I can't find any change that can be related to this problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] RaulGracia commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1241692383

   @eolivelli @zhaohaidao @hangc0276 We have done another set of failure injection tests to narrow down the issue. These are the results:
   - Bookkeeper 4.14 server and Bookkeeper 4.14 client: Tests pass
   - Bookkeeper 4.15 server and Bookkeeper 4.14 client: Tests pass
   - Bookkeeper 4.15 server and Bookkeeper 4.15 client: Issue reproduced
   - Bookkeeper 4.14 server and Bookkeeper 4.15 client: Issue reproduced
   
   With this, I think we can narrow down the regression to be in the Bookkeeper client only in Bookkeeper 4.15. Hope this helps us to shortlist the set of PRs that can be related to this problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] dlg99 commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
dlg99 commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1263923092

   @RaulGracia I am trying to wrap my head around how https://github.com/apache/bookkeeper/pull/2794 could result in a missing ledger handle.
   
   One thing off with the PR is that it calls blocking lh.close() from the callback running on the main threadpool. AFAICT there are no deadlocks because whatever else happens runs on the ZK. Zk can be quite slow because of injected network errors.
   So here is the proposed change: https://github.com/apache/bookkeeper/pull/3513 (waiting for the tests to finish up)
   
   Can the lost ledger handle might be caused by the way Pravega handles errors in ledger open?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] RaulGracia commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1341079895

   Hi all, I think we could have another reproduction of this problem, this time using Bookkeeper 4.14.2 client/server (therefore, the issue reported here may be present in other versions of Bookkeeper, not only in Bookkeeper 4.15).
   
   This is the information I have been able to recollect so far (the missing ledger this time is `3538`):
   
   1. Ledger is created correctly: 
   ```
   2022-12-05 15:21:11,541 8491908 [core-23] INFO  i.p.s.s.i.bookkeeper.BookKeeperLog - Log[29]: Created Ledger 3538.
   ``` 
   
   2. Pravega can write and process properly up to entry `52`:
   ```
   2022-12-05 15:26:53,197 8833564 [BookKeeperClientWorker-OrderedExecutor-2-0] DEBUG o.a.bookkeeper.client.PendingAddOp - Submit callback (lid:3538, eid: 52). rc:0
   2022-12-05 15:26:53,197 8833564 [core-17] DEBUG i.p.s.s.logs.OperationProcessor - OperationProcessor[29]: CommitSuccess (TxnId = 51, LastFullySerializedSN = 505834, LastStartedSN = 505834, Address = Sequence = 17179869235, LedgerId = 3538, EntryId = 51, Length = 35244).
   2022-12-05 15:26:53,197 8833564 [core-28] DEBUG i.p.s.s.logs.OperationProcessor - OperationProcessor[29]: CommitSuccess (TxnId = 52, LastFullySerializedSN = 506120, LastStartedSN = 506120, Address = Sequence = 17179869236, LedgerId = 3538, EntryId = 52, Length = 783197).
   ```
   
   3. Post that, we see that, probably due to the network fault-injection test induced, there is a ensamble change that impacts entry `53` onwards for that ledger:
   ```
   2022-12-05 15:27:07,801 8848168 [BookKeeperClientWorker-OrderedExecutor-2-0] DEBUG o.a.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:3538, change-id:0000000001)][attempt:1] changing ensemble from: [bookkeeper-bookie-2-13057, bookkeeper-bookie-1-21445] to: [bookkeeper-bookie-2-13057, bookkeeper-bookie-0-27781] starting at entry: 53
   ```
   
   4. We keep getting some errors and acknowledgements writing entries from `53` onwards, until we get this error:
   ```
   2022-12-05 15:27:15,431 8855798 [ZKC-connect-executor-0-EventThread] ERROR o.a.b.client.MetadataUpdateLoop - UpdateLoop(ledgerId=3538,loopId=1ec0500f) Error writing metadata to store
     Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /pravega/pravega/bookkeeper/ledgers/00/0000/L3538
   2022-12-05 15:27:15,432 8855799 [BookKeeperClientWorker-OrderedExecutor-2-0] WARN  o.a.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:3538, change-id:0000000001)][attempt:1] Exception changing ensemble
     Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /pravega/pravega/bookkeeper/ledgers/00/0000/L3538
   2022-12-05 15:27:15,433 8855800 [BookKeeperClientWorker-OrderedExecutor-2-0] ERROR o.a.bookkeeper.client.LedgerHandle - Closing ledger 3538 due to ZKException: Error while using ZooKeeper
   ```
   
   5. After that, the zNode for that very same ledger `3538` seems to be deleted:
   ```
   2022-12-05 15:27:19,429 8859796 [ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Got WatchedEvent state:SyncConnected type:NodeDeleted path:/pravega/pravega/bookkeeper/ledgers/00/0000/L3538 for session id 0x3004cfb266c0003
   2022-12-05 15:27:19,429 8859796 [ZKC-connect-executor-0-EventThread] DEBUG o.a.b.meta.AbstractZkLedgerManager - Received watched event WatchedEvent state:SyncConnected type:NodeDeleted path:/pravega/pravega/bookkeeper/ledgers/00/0000/L3538 from zookeeper based ledger manager.
   2022-12-05 15:27:19,429 8859796 [ZKC-connect-executor-0-EventThread] DEBUG o.a.b.meta.AbstractZkLedgerManager - No ledger metadata listeners to remove from ledger 3538 after it's deleted.
   ```
   
   6. And it seems deleted because, after that, we just see `NoNodeException` when we try to access the Zookeeper metadata of this ledger:
   ```
   2022-12-05 15:27:23,218 8863585 [ZKC-connect-executor-0-EventThread] DEBUG o.a.b.meta.AbstractZkLedgerManager - No such ledger: 3538
   segmentstore_2022-12-05_15.640.log.gz:org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /pravega/pravega/bookkeeper/ledgers/00/0000/L3538
   2022-12-05 15:27:23,218 8863585 [ZKC-connect-executor-0-EventThread] ERROR o.a.b.client.MetadataUpdateLoop - UpdateLoop(ledgerId=3538,loopId=4e9e5246) Failed to read metadata from store
   2022-12-05 15:27:23,219 8863586 [core-19] ERROR i.p.s.s.i.bookkeeper.BookKeeperLog - Log[29]: Unable to close LedgerHandle for Ledger 3538.
      io.pravega.segmentstore.storage.DurableDataLogException: Unable to close ledger 3538.
   ```
   
   Once the issue was reported, I tried to dig a little more with the Pravega and Bookkeeper CLIs:
   
   - From Pravega side, if we inspect the `BookkeeperLog[29]`, we clearly see that ledger `3538` is not accessible:
   ```
   > bk details 29
   {
     "key": "log_summary",
     "value": {
       "logId": 29,
       "epoch": 9,
       "version": 511,
       "enabled": true,
       "ledgers": 4,
       "truncation": "Sequence = 12884902323, LedgerId = 3491, EntryId = 435"
     }
   }
   {
     "key": "ledger_details",
     "value": {
       "ledger": 3491,
       "seq": 3,
       "status": "NotEmpty",
       "lac": 971,
       "length": 0,
       "numBookies": 3,
       "numFragments": 2,
       "ensembleSize": 2,
       "writeQuorumSize": 2,
       "ackQuorumSize": 2,
       "ensembles": "0: [bookkeeper-bookie-0-27781,bookkeeper-bookie-1-21445],972: [bookkeeper-bookie-3-9007,bookkeeper-bookie-1-21445]"
     }
   }
   Exception executing BK details command: Unable to open-read ledger 3538.
   	Ledger 3538: Seq = 4, Status = NotEmpty. BK: Unable to open-read ledger 3538.
   {
     "key": "ledger_details",
     "value": {
       "ledger": 3720,
       "seq": 8,
       "status": "Empty",
       "lac": -1,
       "length": 0,
       "numBookies": 2,
       "numFragments": 1,
       "ensembleSize": 2,
       "writeQuorumSize": 2,
       "ackQuorumSize": 2,
       "ensembles": "0: [bookkeeper-bookie-2-13057,bookkeeper-bookie-3-9007]"
     }
   }
   {
     "key": "ledger_details",
     "value": {
       "ledger": 3738,
       "seq": 9,
       "status": "Unknown",
       "lac": -1,
       "length": 0,
       "numBookies": 2,
       "numFragments": 1,
       "ensembleSize": 2,
       "writeQuorumSize": 2,
       "ackQuorumSize": 2,
       "ensembles": "0: [bookkeeper-bookie-0-27781,bookkeeper-bookie-3-9007]"
     }
   }
   ```
   
   - Using the Bookkeeper CLI, if we list all the ledgers in the system, we can see ledger `3538`:
   ```
   bin/bookkeeper shell listledgers
   ...
   17:39:21,474 INFO  ledgerID: 3537
   17:39:21,474 INFO  ledgerID: 3538     <<<<<< HERE IT IS
   17:39:21,474 INFO  ledgerID: 3539
   ...
   ```
   - However, it is not possible to get its contents or metadata:
   
   ```
   ===== LEDGER: 3538 =====
   master key  : 850bf1071c5e3d8c24235676f8816ae0cbe2f14f
   size         : 0
   entries      : 0
   isFenced     : true
   ===== LEDGER: 3538 =====
   ```
   
   ```
   bin/bookkeeper shell readledger -ledgerid 3538
   
   17:37:18,747 INFO  Successfully connected to bookie: bookkeeper-bookie-1-21445 [id: 0x45c77aa8, L:/10.128.0.116:33870 - R:10.129.0.171/10.129.0.171:3181]
   17:37:18,748 INFO  connection [id: 0x45c77aa8, L:/10.128.0.116:33870 - R:10.129.0.171/10.129.0.171:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
   17:37:18,755 ERROR Read of ledger entry failed: L3538 E0-E0, Sent to [bookkeeper-bookie-2-13057, bookkeeper-bookie-1-21445], Heard from [] : bitset = {}, Error = 'No such ledger exists on Bookies'. First unread entry is (-1, rc = null)
   17:37:18,761 ERROR Error reading entry 0 from ledger 3538
   org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsException: No such ledger exists on Bookies
   	at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83) ~[org.apache.bookkeeper-bookkeeper-server-4.14.3-build-223.jar:4.14.3-build-223]
   ```
   
   ```
   bin/bookkeeper shell ledgermetadata -ledgerid 3538
   
    17:42:00,323 WARN  An exception was thrown while closing send thread for session 0x2004d6096890377.
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x2004d6096890377, likely server has closed socket
   	at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
   	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
   	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) [org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
   Exception in thread "main" com.google.common.util.concurrent.UncheckedExecutionException: java.util.concurrent.ExecutionException: org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsOnMetadataServerException: No such ledger exists on Metadata Server
   ```
   
   How inducing delays at the network level in the pods leads very rarely to this situation is something that I still don't know. I'm also attaching the relevant log lines from Pravega Segment Store for ledger `3538` ([ledger-3538.log](https://github.com/apache/bookkeeper/files/10177104/ledger-3538.log)).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] eolivelli commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1235391427

   A couple of questions:
   - is the ledger znode present on ZK ?
   - which are your replication parameters: ES, WQ, AQ ?
   - how many bookies do you have ?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] zhaohaidao commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
zhaohaidao commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1238939421

   Not sure if it has something to do with this pr: ISSUE 3220-Autorecovery does not process underreplicated empty ledgers
   https://github.com/apache/bookkeeper/pull/3239
   If network packet loss causes a bookie to be determined to be unavailable. the ledgers on this unavailable bookie will become underreplicated and this ledgers will be closed by the autorecovery service.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] dlg99 commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
dlg99 commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1258858021

   @RaulGracia is there any chance that lossy network caused some kind of split brain or data inconsistency on ZK side?
   e.g. I recently dealt with the situation when out of 5 ZK nodes one had missed specific node, so the clients that were lucky enough to read form that node didn't see the data. 
   Can you browse the BK metadata using zkCli on all zk nodes to confirm that metadata for the missing ledger is either present (and the same) or missing on all nodes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] eolivelli commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1241938958

   > Not sure if it has something to do with this pr: ISSUE 3220-Autorecovery does not process underreplicated empty ledgers
   > #3239
   > If network packet loss causes a bookie to be determined to be unavailable. the ledgers on this unavailable bookie will become underreplicated and this ledgers will be closed by the autorecovery service.
   
   @zhaohaidao your idea is interesting.
   
   The fact that the client tries to read from entry 0 even if the ledger is empty means that something CLOSED the ledger and set the LAC to 0 (and not -1).
   
   I am saying that LAC should be at least 0 because of these logs:
   `2022-07-20T15:59:48.794605613+00:00 stdout F 2022-07-20 15:59:48,794 11507078 [core-1] ERROR i.p.s.server.logs.DurableLog - DurableLog[13] Recovery FAILED.`
   It seems (please @RaulGracia confirm) that DurableLog is scanning the ledger, and it receives an error during a "readEntries" operation.
   `readEntries` fails fast if you try to read after the local value of LAC, but here we see that the BK client is trying to perform the PendingReadOp and so locally LAC is >= 0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org