You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by John Logan <Jo...@texture.com> on 2016/11/08 21:42:49 UTC

Read cache contents being continuously sent from primary to cold standby?

Hi,

I'm trying to track down what seems to be a problem in my
Sling cold standby configuration.  I wrote an original email
to the Sling users mailing list a week ago.  I've included it at
the end, as it includes my OSGi module configuration for
Oak and cold standby.

Since the time of the original questions, it appears that my
cold standby has made progress, but now I'm seeing a behavior
that I didn't expect.

The tarmk.log (below) for my cold standby shows that the head is being
updated every 5-10 hours, with the longer intervals corresponding
to 3pm to midnight local time.

The primary's JMX standby metrics show that the number of transferred
segments is not changing, while the number of transferred binaries
continues to increase.

I did some other checks on the read caches and the shared S3 bucket,
and it looks like read cache contents are continually being
transferred from the primary to the secondary, over and over
again.  Is this expected behavior?  If not, how do I troubleshoot
this?

Thanks!  
John Logan

tarmk.log:

2016-11-05 23:30:05,045 sending head request
2016-11-05 23:30:05,045 did send head request
2016-11-05 23:30:05,081 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09
2016-11-06 10:09:25,044 sending head request
2016-11-06 10:09:25,044 did send head request
2016-11-06 10:09:25,092 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09
2016-11-06 15:07:50,045 sending head request
2016-11-06 15:07:50,045 did send head request
2016-11-06 15:07:50,082 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09
2016-11-06 20:10:20,044 sending head request
2016-11-06 20:10:20,045 did send head request
2016-11-06 20:10:20,120 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09
2016-11-07 06:26:50,044 sending head request
2016-11-07 06:26:50,044 did send head request
2016-11-07 06:26:50,085 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09
2016-11-07 11:23:15,045 sending head request
2016-11-07 11:23:15,045 did send head request
2016-11-07 11:23:15,140 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09
2016-11-07 17:07:00,045 sending head request
2016-11-07 17:07:00,045 did send head request
2016-11-07 17:07:00,141 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09
2016-11-07 22:01:40,052 sending head request
2016-11-07 22:01:40,052 did send head request
2016-11-07 22:01:40,099 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09
2016-11-08 08:26:35,044 sending head request
2016-11-08 08:26:35,044 did send head request
2016-11-08 08:26:35,183 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09
2016-11-08 14:05:50,046 sending head request
2016-11-08 14:05:50,046 did send head request
2016-11-08 14:05:50,159 updating current head to 4cbe6d89-284c-4c4b-ac34-498068a9bcca.fe09


On Wed, 2016-11-02 at 17:28 +0000, John Logan wrote:
> Hi,
> 
> I'm setting up a TarMK cold standby for a repository for the first time, and
> have a couple of questions regarding synchronization and administration.
> I've included the configuration and current dump of the primary and standby
> MBeans below.  The primary and standby are in peered VPCs in AWS, using a
> shared S3 bucket for blob storage.
> 
> 1.) I'm curious as to how long I should expect to wait for the standby
> to establish synchronization.  How much data gets moved over the wire? 
> I'm seeing a steady stream of read cache invalidations on the standby -
> does this mean that all of the blob data must be transferred, even
> though the two repositories use shared storage?
> 
> 2.) I see in the logs a period where there are read cache invalidations,
> and then there is a 12 hour period where nothing is logged, followed
> by a "org.apache.jackrabbit.oak.plugins.segment.standby.client.SegmentLoaderHandler timeout"
> message.  The quiet period is consistent with my setting
> standby.readtimeout=I"43200000".  Would it make sense to choose a
> shorter timeout to lessen the impact of occasional network issues?
> At what point might the timeout value be "too short"?
> 
> 3.) Is there a definitive way to know that the standby is synced? 
> The SyncEndTimestamp value below corresponds to 2016-11-02T09:26:18+00:00,
> which corresponds exactly to the timestamp of the
> "SegmentLoaderHandler timeout" message.  This suggests that this
> value doesn't really tell me that the standby is synchronized. 
> When I tried with small repositories, it appears that synchronization
> was done when the tarmk.log file started outputting the same repository
> head every 5 seconds ("interval" setting).
> 
> 4.) Assuming that the standby eventually becomes synchronized,
> is there a documented procedure by which I could "split the mirror";
> that is, convert the standby into an new, independent primary
> containing a replica of the original?  If the current primary
> and standby are referring to S3 bucket "P", could I shut down
> both instances, copy the contents of bucket "P" to a new bucket
> "S", update the standby Oak S3 configuration to refer to the new
> bucket "S", and restart what was the standby as a new primary? 
> Are there other steps I would need to take?
> 
> Thanks!  John
> 
> 
> CONFIG VALUES FOR BOTH INSTANCES
> 
> 
> STANDBY CONFIG:
> 
> 
> /var/lib/sling/install/install.standby/org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.config:
> org.apache.sling.installer.configuration.persist=B"false"
> port=I"8023"
> secure=B"true"
> mode="standby"
> primary.host="john-proto.dev"
> interval=I"5"
> standby.readtimeout=I"43200000"
> 
> 
> PRIMARY CONFIG:
> 
> 
> /var/lib/sling/install/install.primary/org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.config
> org.apache.sling.installer.configuration.persist=B"false"
> port=I"8023"
> secure=B"true"
> mode="primary"
> primary.allowed-client-ip-ranges=["0.0.0.0-255.255.255.255"]
> 
> 
> OAK S3 CONFIG:
> 
> 
> /var/lib/sling/install/oak_s3/org.apache.jackrabbit.oak.plugins.blob.datastore.SharedS3DataStore.config:
> accessKey=""
> secretKey=""
> s3Bucket="my-primary-bucket"
> s3Region="us-west-2"
> s3EndPoint="s3-us-west-2.amazonaws.com"
> connectionTimeout="120000"
> socketTimeout="120000"
> maxConnections="40"
> writeThreads="30"
> maxErrorRetry="10"
> 
> 
> JMX MBEANS
> 
> 
> STANDBY:
> 
> 
> #mbean = org.apache.jackrabbit.oak:id="fa2b9a7c-fc69-4a0c-aa7e-b0cfc61bd1c6",name=Status,type="Standby":
> FailedRequests = 0;
> 
> SecondsSinceLastSuccess = 24269;
> 
> SyncStartTimestamp = 1478021232280;
> 
> SyncEndTimestamp = 1478078778813;
> 
> Status = running;
> 
> Running = true;
> 
> Mode = client: fa2b9a7c-fc69-4a0c-aa7e-b0cfc61bd1c6;
> 
> 
> PRIMARY:
> 
> #mbean = org.apache.jackrabbit.oak:id=8023,name=Status,type="Standby":
> Status = got message;
> 
> Running = true;
> 
> Mode = primary;
> 
> #mbean = org.apache.jackrabbit.oak:id="Client fa2b9a7c-fc69-4a0c-aa7e-b0cfc61bd1c6",name=Status,type="Standby":
> RemotePort = 44322;
> 
> RemoteAddress = 10.16.12.44;
> 
> LastSeenTimestamp = Wed Nov 02 13:48:59 UTC 2016;
> 
> TransferredSegments = 186780;
> 
> TransferredSegmentBytes = 1198693232;
> 
> TransferredBinaries = 5579;
> 
> TransferredBinariesBytes = 170312256398;
> 
> LastRequest = b.678851bb77bec68db82c6bda37aca8e763d8a32e#655084301;
> 
> Name = fa2b9a7c-fc69-4a0c-aa7e-b0cfc61bd1c6;


Re: Read cache contents being continuously sent from primary to cold standby?

Posted by John Logan <Jo...@texture.com>.
Is there a different list that is better suited for this question?

Thanks!

John Logan

On Tue, 2016-11-08 at 21:43 +0000, John Logan wrote:
> Hi,
>
> Is there anyone who can provide guidance on a cold standby issue?
>
> It appears that my cold standby has made progress since my original
> email (below), but now I'm seeing a behavior that I didn't expect.
>
...