You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@sling.apache.org by lancedolan <la...@gmail.com> on 2017/01/12 02:04:31 UTC

Not-sticky sessions with Sling?

The only example code I can find to authenticate to Sling will use the JEE
servlet container's "j_security_check" which then stores the authenticated
session in App Server memory. A load-balancer without sticky-sessions
enabled will cause an unstable experience for users, in which they are
suddenly unauthenticated.

-Does Sling already offer a mechanism for authenticating without storing
that JCR session in Servlet Container Session? 
-Do any of you avoid sticky sessions without writing custom code?

I'm thinking that this problem *must* be solved already. Either there's an
authenticationhandler in Sling that I haven't found yet, or there's an
open-source example that somebody could share with me :) 

If I must write this myself, is this the best place to start? 
https://sling.apache.org/documentation/the-sling-engine/authentication/authentication-authenticationhandler.html
https://sling.apache.org/apidocs/sling8/org/apache/sling/auth/core/spi/AuthenticationHandler.html

... as usual, thanks guys. I realize I'm really dominating the mail list
lately. I've got a lot to solve :)




--
View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530.html
Sent from the Sling - Users mailing list archive at Nabble.com.

RE: Not-sticky sessions with Sling?

Posted by lancedolan <la...@gmail.com>.

Jason Bailey wrote
> Couldn't this be simplified to simply stating that the sticky session
> cookie only lasts for x amount of seconds? 


WHOAAA!! 

Bertrand, probably hold the phone on everything else I suggested in my last
post - this solution is insanely simple, embarrassingly obvious in
hindsight, and us architects on our side can see no problem with this
solution.

We actually had no idea that there is a expiration by seconds setting in AWS
elastic load balancer. We just checked the interface and found the setting.
Obviously in the good old days of F5 we could do whatever we want, but we're
married to AWS now and had no idea we could do this.

Thank you Jason, you might have just saved me some unsavory development
task, whilst helping me Keep It Simple, Stupid.



--
View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069731.html
Sent from the Sling - Users mailing list archive at Nabble.com.

Re: Not-sticky sessions with Sling?

Posted by Felix Meschberger <fm...@adobe.com>.

Hi Lance

Ok, so being as it is — eventual consistent repo replicating the Oak login token and not able to use sticky sessions, I suggest you go with something else, which does *not* need the repository for persistence.

This means you might want to investigate your own authentication handler or look at other options here at Sling — for example the old Form based login (not sure what its state is, though). Or good ol’ HTTP Basic (at some other prices like no support for „logout“)

Regards
Felix

> Am 18.01.2017 um 02:43 schrieb lancedolan <la...@gmail.com>:
> 
> lancedolan wrote
>> I must know what determines the duration of this revision catch-up time
>> ... 
> 
> While I don't know where to look in src code to answer this, I did run a
> very revealing experiment.
> 
> It pretty much always takes 1 second exactly for a Sling instance to get the
> latest revision, and thus the latest data. When not 1 second, it takes 2
> seconds exactly. If you increase load on the server, the likelihood of
> taking 2 seconds increases, and you also begin to see it take exactly 3
> seconds in some rare cases. Increasing load increases the number of seconds
> before a "sync," however it's always near-exactly a second interval.
> 
> It seems impossible for this to be a natural coincidence - I smell a setting
> somewhere (or perhaps hardcode value) which is telling Sling to check the
> latest JCR revision on 1 second intervals. When that window can't be hit, it
> checks on the next second interval, and so on. 
> 
> Is there a Sling dev who can tell me whether this is configurable? I have a
> load of questions about this discovery:
> 
> - Am I wrong? (I'll be shocked)
> - Perhaps we can speed it up? 
> - What event is causing it to "miss the window" and wait until the next 1
> second synch interval?
> - If we do decrease the interval, will that just increase the likelihood of
> taking more intervals anyhow?
> - Is there a maximum number of 1 second intervals before the things just
> gets the latest??
> 
> progress.
> 
> 
> 
> --
> View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069711.html
> Sent from the Sling - Users mailing list archive at Nabble.com.

Re: Not-sticky sessions with Sling?

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi Lance,

On Wed, Jan 18, 2017 at 11:21 PM, lancedolan <la...@gmail.com> wrote:
> ...Bertrand, I'd feel selfish taking you up on your offer to build this for me.
> Yet I'd be a fool to not at least partner with you to get it done. Should we
> correspond outside this mail list?...

I understand you're probably looking at a different solution now but
just wanted to clarify this: the Sling dev list would be the place to
discuss such things, no need for off-list communications.

-Bertrand

Re: Not-sticky sessions with Sling?

Posted by lancedolan <la...@gmail.com>.

Chetan is making things crystal clear for us. 

Our next steps are:

1) Learn what the MAXIMUM "inconsistency window" could be. 
Is it possible to delay past 5 seconds? 10 Seconds? 60? What determines
this? Only server load? I'll ask on the JCR forum and also experiment. 

2) Design and test a solution almost exactly as Bertrand described. 
Sling responds to POST/PUT/DELETE with a JCR revision. Sling will behave
differently when the Request contains a JCR revision more recent than it's
current. I have no idea what I'm getting into or how hard this will be. 

Bertrand, I'd feel selfish taking you up on your offer to build this for me.
Yet I'd be a fool to not at least partner with you to get it done. Should we
correspond outside this mail list? 
Perhaps you could point me to the files you would edit to get this done and
I could try to do it myself? I imagine a solution where you can configure,
through OSGI, whether Sling will do one of the following:

A) Ignore JCR revision in Request, and function as it does today (Default
setting)
B) Block until it has caught up to JCR revision in Request
C) Call some other custom handler? This way we can do custom things like
send a redirect to enhance the user experience during a block. In a product
like ours, 5 or 10 second blocks aren't acceptable without user feedback. 

I also don't know how to determine the current Sling instance's Revision, or
how to compute whether one revision is "more recent" than another.

---------

Responding to a couple other minor points:


Felix Meschberger-3 wrote
> I suggest you go with something else, which does *not* need the repository
> for persistence. This means you might want to investigate your own
> authentication handler ...

Thank you Felix :) I've actually done this work recently and it's working
great! We have "stateless" authentication now, but are now dealing with the
unacceptable inconsistency that Chetan warned about. 
That's the question on the table: In a write-operation-heavy application,
how do we provide a "read-your-writes" consistent experience on an
eventually-consistent solution (Sling cluster), when traditional 
sticky-sessions are an invalid solution because your userbase is large
enough to demand server-scaling several times throughout the day.


chetan mehrotra wrote
> I can understand issue around when existing Sling server is removed
> from the pool. However adding a new instance should not cause existing
> users to be reassigned

When adding an instance, we purposely invalidate all sticky sessions and
users will get re-assigned to a new Sling instance, so that the new server
actually improves performance.
Imagine a farm of 4 app servers that has been SLAMMED and isn't performing
well. Adding 1 or 100 new servers to that farm won't improve performance if
every user is "stuck" to the previous 4 servers.
If we don't do this invalidation and re-assignment on scaling-up, it can
takes hours potentially for a scale-up to positively impact an overloaded
cluster. 


Bertrand Delacretaz wrote
> But Lance could patch [1] to experiment with different values, right?
> ....
> [1]
> http://svn.apache.org/repos/asf/jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/DocumentNodeStore.java

Thank you for pointing me to the code Bertrand :) On new information from
Chetan, I'm losing interest in changing that value. Perhaps setting
aSyncDelay to 0 or some small number will cause it to perform slower but be
more consistent... 
However, my tentative assessment is that the interval would just be
"checked" more often, but it will also get skipped more often, due to "local
cache invalidation, computing the external changes for observation" as
Chetan put it. 
I would love to be wrong about this and I'll ask on the JCR forum.



--
View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069730.html
Sent from the Sling - Users mailing list archive at Nabble.com.

Re: Not-sticky sessions with Sling?

Posted by Bertrand Delacretaz <bd...@apache.org>.

On Wed, Jan 18, 2017 at 12:48 PM, Chetan Mehrotra
<ch...@gmail.com> wrote:
> ...there is a "asyncDelay" setting in DocumentNodeStore which
> defaults to 1 sec. Currently its not possible to modify it via OSGi
> config though....

But Lance could patch [1] to experiment with different values, right?
And then replace the oak-core bundle in Sling, starting with the right
version for patching, the one his Sling instance currently uses.

-Bertrand

[1] http://svn.apache.org/repos/asf/jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/DocumentNodeStore.java

RE: Not-sticky sessions with Sling?

Posted by Jason Bailey <Ja...@sas.com>.

Couldn't this be simplified to simply stating that the sticky session cookie only lasts for x amount of seconds? 

I like this idea, but I'm not sure this is really a sling solution rather than an API management or proxy solution. When you take an instance out of the pool, you would need to state that it's not available for new requests, but still honor it for x amount of time for those with the sticky session cookie that says they should go there. 

-Jason


-----Original Message-----
From: Chetan Mehrotra [mailto:chetan.mehrotra@gmail.com] 
Sent: Wednesday, January 18, 2017 6:49 AM
To: users@sling.apache.org
Subject: Re: Not-sticky sessions with Sling?

> Each time we remove an
> instance, those users will go to a new Sling instance, and experience 
> the inconsistency. Each time we add an instance, we will invalidate 
> all stickiness and users will get re-assigned to a new Sling instance, 
> and experience the inconsistency.

I can understand issue around when existing Sling server is removed from the pool. However adding a new instance should not cause existing users to be reassigned

Now to your queries
---------------------------

> 1) When a brand new Sling instance discovers an existing JCR (Mongo), does it automatically and immediately go to the latest head revision?

It sees the latest head revision

>  Increasing load increases the number of seconds before a "sync," however it's always near-exactly a second interval.

Yes there is a "asyncDelay" setting in DocumentNodeStore which defaults to 1 sec. Currently its not possible to modify it via OSGi config though.

>- What event is causing it to "miss the window" and wait until the next 1 second synch interval?

this periodic read also involves some other work. Like local cache invalidation, computing the external changes for observation etc which cause this time to increase. More the changes done more would be the time spent on that kind of work

Stickyness and Eventual Consistency
-------------------------------------------------

There are multiple level of eventual consistency [1]. If we go for sticky session then we are trying for "Session Consistency". However what we require in most cases is read-your-write consistency.

We can discuss ways to do that efficiently with current Oak architecture. Something like this is best discuss on oak-dev though.
One possible approach can be to use a temporary issued sticky cookie.
Under this model

1. Sling cluster maintains a cluster wide service which records the current head revision of each cluster node and computes the minimum revision of them.

2. A Sling client (web browser) is free to connect to any server untill it performs a state change operation like POST or PUT

3. If it performs a state change operation then the server which performs that operation issues a cookie which is set to be sticky i.e.
Load balancer is configured to treat that as cookie used to determine stickiness. So from now on all request from this browser would go to same server. This cookie lets say record the current head revision

4. In addition the Sling server would constantly get notified of minimum revision which is visible cluster wide. Once that revision becomes older than revision in #3 it removes the cookie on next response sent to that browser

This state can be used to determine if server is safe to be taken out of the cluster or not.

This is just a rough thought experiment which may or may not work and would require broader discussion!


Chetan Mehrotra
[1] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

Re: Not-sticky sessions with Sling?

Posted by Chetan Mehrotra <ch...@gmail.com>.

> Each time we remove an
> instance, those users will go to a new Sling instance, and experience the
> inconsistency. Each time we add an instance, we will invalidate all
> stickiness and users will get re-assigned to a new Sling instance, and
> experience the inconsistency.

I can understand issue around when existing Sling server is removed
from the pool. However adding a new instance should not cause existing
users to be reassigned

Now to your queries
---------------------------

> 1) When a brand new Sling instance discovers an existing JCR (Mongo), does it automatically and immediately go to the latest head revision?

It sees the latest head revision

>  Increasing load increases the number of seconds before a "sync," however it's always near-exactly a second interval.

Yes there is a "asyncDelay" setting in DocumentNodeStore which
defaults to 1 sec. Currently its not possible to modify it via OSGi
config though.

>- What event is causing it to "miss the window" and wait until the next 1 second synch interval?

this periodic read also involves some other work. Like local cache
invalidation, computing the external changes for observation etc which
cause this time to increase. More the changes done more would be the
time spent on that kind of work

Stickyness and Eventual Consistency
-------------------------------------------------

There are multiple level of eventual consistency [1]. If we go for
sticky session then we are trying for "Session Consistency". However
what we require in most cases is read-your-write consistency.

We can discuss ways to do that efficiently with current Oak
architecture. Something like this is best discuss on oak-dev though.
One possible approach can be to use a temporary issued sticky cookie.
Under this model

1. Sling cluster maintains a cluster wide service which records the
current head revision of each cluster node and computes the minimum
revision of them.

2. A Sling client (web browser) is free to connect to any server
untill it performs a state change operation like POST or PUT

3. If it performs a state change operation then the server which
performs that operation issues a cookie which is set to be sticky i.e.
Load balancer is configured to treat that as cookie used to determine
stickiness. So from now on all request from this browser would go to
same server. This cookie lets say record the current head revision

4. In addition the Sling server would constantly get notified of
minimum revision which is visible cluster wide. Once that revision
becomes older than revision in #3 it removes the cookie on next
response sent to that browser

This state can be used to determine if server is safe to be taken out
of the cluster or not.

This is just a rough thought experiment which may or may not work and
would require broader discussion!


Chetan Mehrotra
[1] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

Re: Not-sticky sessions with Sling?

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi Lance,

On Wed, Jan 18, 2017 at 2:43 AM, lancedolan <la...@gmail.com> wrote:
> ...It pretty much always takes 1 second exactly for a Sling instance to get the
> latest revision, and thus the latest data. When not 1 second, it takes 2
> seconds exactly....

I don't know enough about Oak internals to give your a precise answer
here but this 1 second increment vaguely rings a bell, based on
discussions with Chetan when working on our adaptTo demo [1].

Chetan is one of the few Sling committers who's deep into Oak as well,
hopefully he can comment on this but otherwise best would be to ask on
the Oak dev list about that specific issue, as I think this delay is
entirely Oak dependent.

Apart from that, handling such things at the client level could be
valid - as you say if you had a way to send the current revision
number to the client (in an opaque way probably) it could add a header
to its next request saying that it wants to see that revision, and
Sling/Oak could block that request until that revision is available. I
suppose a one or two second delay that happens only rarely is
acceptable if it makes your system easier to scale, and hopefully that
1-second cycle can be configured to be shorter. I'm willing to help
make this functionality available if you don't find a better way, as I
think it can be generally useful.

-Bertrand

[1] https://github.com/bdelacretaz/sling-adaptto-2016

Re: Not-sticky sessions with Sling?

Posted by lancedolan <la...@gmail.com>.

lancedolan wrote
> I must know what determines the duration of this revision catch-up time
> ...

While I don't know where to look in src code to answer this, I did run a
very revealing experiment.

It pretty much always takes 1 second exactly for a Sling instance to get the
latest revision, and thus the latest data. When not 1 second, it takes 2
seconds exactly. If you increase load on the server, the likelihood of
taking 2 seconds increases, and you also begin to see it take exactly 3
seconds in some rare cases. Increasing load increases the number of seconds
before a "sync," however it's always near-exactly a second interval.

It seems impossible for this to be a natural coincidence - I smell a setting
somewhere (or perhaps hardcode value) which is telling Sling to check the
latest JCR revision on 1 second intervals. When that window can't be hit, it
checks on the next second interval, and so on.

Is there a Sling dev who can tell me whether this is configurable? I have a
load of questions about this discovery:

- Am I wrong? (I'll be shocked)
- Perhaps we can speed it up?
- What event is causing it to "miss the window" and wait until the next 1
second synch interval?
- If we do decrease the interval, will that just increase the likelihood of
taking more intervals anyhow?
- Is there a maximum number of 1 second intervals before the things just
gets the latest??

progress.

--
View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069711.html
Sent from the Sling - Users mailing list archive at Nabble.com.

RE: Not-sticky sessions with Sling?

Posted by lancedolan <la...@gmail.com>.

Thissss is tempting, but I know in my dev-instinct that we won't have the
time to solve all the unsolved in that effort. Thank you for suggesting
though :)



--
View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069712.html
Sent from the Sling - Users mailing list archive at Nabble.com.

RE: Not-sticky sessions with Sling?

Posted by Stefan Seifert <ss...@pro-vision.de>.

not sure if this is of any help for your usecase - but do you need the full JCR features and complexity underneath sling, or only a sling cluster + storage in mongodb?

if you need only basic resource read and write features via the Sling API you might bypass JCR completely and directly use a NoSQL resource provider for MongoDB, see [1] and [2].

but please be aware that:
1. the code might not be production-ready for heavy usages yet (not sure how much it is used)
2. it does not add any support for cluster synchronization etc. if your multiple nodes write to the same path you have to take care of concurrency yourself
3. the code is not yet migrated to the latest resourceprovider SPI from sling 9-SNAPSHOT, but should still run with it
4. it has not built-in support for ACLs etc., you have to take care of this yourself

this resource provider is only a thin layer above the MongoDB java client, so it should be possible to have full control what mongodb features are used in which way.

stefan

[1] http://sling.apache.org/documentation/bundles/nosql-resource-providers.html
[2] https://github.com/apache/sling/tree/trunk/contrib/nosql

Re: Not-sticky sessions with Sling?

Posted by Bertrand Delacretaz <bd...@apache.org>.

On Tue, Jan 17, 2017 at 10:49 PM, lancedolan <la...@gmail.com> wrote:
> ...I've got almost every dev in the office all
> excited about this now haha....

This needs to make the New York Times front page: "almost everyone in
a developer's office excited about the same thing, which is not a
JavaScript library" ;-)

-Bertrand

Re: Not-sticky sessions with Sling?

Posted by lancedolan <la...@gmail.com>.

Bertrand Delacretaz wrote
> That would be a pity, as I suppose you're starting to like Sling now ;-)

Mannnn you have no idea haha! I've got almost every dev in the office all
excited about this now haha. However, it seems our hands are tied.

I wrote local consistency test scripts which POST and immediately GET a
property, checking for consistency.

Results on a 2-member Sling cluster and localhost mongodb:

-0% consistency with 50ms delay between POST and GET
-35% to 50% consistency with 1 second delay between POST and GET
-90% consistency with 2 second delay
-98% to 100% consistency after 3 seconds delay.

So yes, you are all correct.

True, we could use sticky sessions to avoid inconsistency... but only until
we scale our server-farm up or down, which we do daily.... So sticky
sessions doesn't really solve anything for us.

If you already understand how scaling nullifies the benefit of sticky
sessions, you can skip past this paragraph and move onto the next:
Each time we scale, users will lose their "stickiness." We have thousands of
write users ("authors"). Hundreds concurrently. Compare that to typical AEM
projects have less than 10 authors, and rarely more than 1 concurrently
(I've got several global-scale AEM implementations under my belt). For us,
it's a requirement that we add or remove app servers multiple times per day,
optimizing between AWS costs and performance. Each time we remove an
instance, those users will go to a new Sling instance, and experience the
inconsistency. Each time we add an instance, we will invalidate all
stickiness and users will get re-assigned to a new Sling instance, and
experience the inconsistency. If we don't do this invalidation and
re-assignment on scaling-up, it can takes hours potentially for a scale-up
to positively impact an overloaded cluster where all users are permanently
stuck to their current app server instance.

As you can see, we need to deal with the inconsistency problem, regardless
of whether we use sticky sessions.

I have some ideas, but none are appealing, and would benefit greatly from
your guys' knowledge:

1) Race condition
If this delay to "catch up" to latest revision is mostly predictable, it
doesn't grow as the repo grows in size, or if it doesn't change due to other
variables, we can measure it and then account for it reliably with
user-feedback (loading screen or whatever). This *might* be a race condition
we can live with.

My results above show as much as 3 or 4 seconds to "catch up." I must know
what determines the duration of this revision catch-up time. Is it a
function of repo size? Does the delay grow as the repo size grows? Does the
delay grow as usage increases? Does the delay grow as the number of Sling
instances in the cluster grow? Does the delay grow as network latency grows
(I'm testing all on the same machine with practically no latency compared to
a distributed production deployment). Is there any Sling dev, who is
familiar with the algorithm that Sling uses to select a "newer" revision,
who could answer this for me? ... perhaps it's just polling on a predictable
time period! :)

2) Browser knows what revision it's on.
The browser could know what JCR Revision it's on, learning that revision
after every POST or PUT, perhaps in some response header. When its future
requests are sent to a Sling instance on an older revision, it could wait
until that instance "catches up." This sounds like a horrible example of
client code operating on knowledge of underlying implementation details, and
we're not at all excited about the chaos to implement it. That being said,
can we programmatically check the revision that the current Sling instance
is reading from?

3) "Pause" during scale-up or scale-down.
Each time we add or remove a sling instance, all users experience a "pause"
screen while their new Sling Instance "catches up." This is essentially the
same as the race condition in #1, except we'd constrain users to only
experience this when we scale up or down. However, we are *extremely*
unhappy to impact our users just because we're scaling up or down,
especially when we must do so frequently.

Anybody have any other ideas?

Re: Not-sticky sessions with Sling?

Posted by Jörg Hoh <jh...@googlemail.com>.

My bad:
CAP = consistency, availability and partition-tolerance.

Jörg

2017-01-17 19:35 GMT+01:00 Jörg Hoh <jh...@googlemail.com>:

> HI Lance,
>
> 2017-01-17 19:19 GMT+01:00 lancedolan <la...@gmail.com>:
>
>> ...
>>
>> If "being eventual" is the reason we can't go stateless, then how is adobe
>> getting away with it if we know their architecture is also eventual?? What
>> am I missing? I understand that the documentation I linked is a
>> distributed
>> segment store architecture and mine is a share documentstore datastore,
>> but
>> what is the REASON for them allowing a stateless (not sticky)
>> architecture,
>> if the REASON is not eventual consistency ? Both architectures are
>> eventual.
>>
>>
> It depends a lot on your usecase. For example Facebook is also eventually
> consistent (I sometimes think that the timeline is different on every
> reload). Also the CAP theorem says, that you can choose only 2 of
> "consistency, atomicity and partition-tolerance".
>
> In the case of independent segment stores (in Adobe speak: publish
> instances, stateless loadbalancing) you have a lot of individual requests
> from multiple users. So you as an individual cannot decide if another gets
> the very same content as you. And as long as this eventual consistency is
> not causing annoyances and friction on and end-user side (e.g. you hit a
> intra-side link, which returns in a 404), I would not consider it as a
> problem. And these problems occur so rarely, that many (including me and
> many other users of AEM) ignore it for daily work. But this is only valid
> for a readonly usecase!
>
> The situation is different on the clustered documentNodeStore (in Adobe
> speak: authoring, sticky connections). Due to write skew write operations
> will be visible with a small delay on all cluster nodes. But because there
> it matters that a user sees the changes he just did. And to overcome this
> limitation with the write skew, the recommendation is to use
> sticky-sessions.
>
>
>
> Jörg
>
>
> --
> Cheers,
> Jörg Hoh,
>
> http://cqdump.wordpress.com
> Twitter: @joerghoh
>



-- 
Cheers,
Jörg Hoh,

http://cqdump.wordpress.com
Twitter: @joerghoh

Re: Not-sticky sessions with Sling?

Posted by Jörg Hoh <jh...@googlemail.com>.

HI Lance,

2017-01-17 19:19 GMT+01:00 lancedolan <la...@gmail.com>:

> ...
>
> If "being eventual" is the reason we can't go stateless, then how is adobe
> getting away with it if we know their architecture is also eventual?? What
> am I missing? I understand that the documentation I linked is a distributed
> segment store architecture and mine is a share documentstore datastore, but
> what is the REASON for them allowing a stateless (not sticky) architecture,
> if the REASON is not eventual consistency ? Both architectures are
> eventual.
>
>
It depends a lot on your usecase. For example Facebook is also eventually
consistent (I sometimes think that the timeline is different on every
reload). Also the CAP theorem says, that you can choose only 2 of
"consistency, atomicity and partition-tolerance".

In the case of independent segment stores (in Adobe speak: publish
instances, stateless loadbalancing) you have a lot of individual requests
from multiple users. So you as an individual cannot decide if another gets
the very same content as you. And as long as this eventual consistency is
not causing annoyances and friction on and end-user side (e.g. you hit a
intra-side link, which returns in a 404), I would not consider it as a
problem. And these problems occur so rarely, that many (including me and
many other users of AEM) ignore it for daily work. But this is only valid
for a readonly usecase!

The situation is different on the clustered documentNodeStore (in Adobe
speak: authoring, sticky connections). Due to write skew write operations
will be visible with a small delay on all cluster nodes. But because there
it matters that a user sees the changes he just did. And to overcome this
limitation with the write skew, the recommendation is to use
sticky-sessions.



Jörg


-- 
Cheers,
Jörg Hoh,

http://cqdump.wordpress.com
Twitter: @joerghoh

Re: Not-sticky sessions with Sling?

Posted by lancedolan <la...@gmail.com>.

Ok First of all - I GENUINELY appreciate the heck out of your time, and
patience!! 

... and THIS is really interesting:

If THIS is true:


chetan mehrotra wrote
> If you are running a cluster with Sling on Oak/Mongo then sticky 
> sessions would be required due to eventual consistent nature of 
> repository.

and THIS is true:


chetan mehrotra wrote
> Cluster which involves multiple datastores (tar) 
> is also eventually consistent. 

Then why is adobe recommending it's multi-million-dollar projects to go
stateless with the encapsulated token here, if those architectures are
*also* eventually:
https://docs.adobe.com/docs/en/aem/6-1/administer/security/encapsulated-token.html

If "being eventual" is the reason we can't go stateless, then how is adobe
getting away with it if we know their architecture is also eventual?? What
am I missing? I understand that the documentation I linked is a distributed
segment store architecture and mine is a share documentstore datastore, but
what is the REASON for them allowing a stateless (not sticky) architecture,
if the REASON is not eventual consistency ? Both architectures are eventual.

Again, thanks for your patience and sticking with me on this one... whoa
pun!



--
View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069698.html
Sent from the Sling - Users mailing list archive at Nabble.com.

Re: Not-sticky sessions with Sling?

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Tue, Jan 17, 2017 at 1:46 AM, lancedolan <la...@gmail.com> wrote:
> It's ironic that the cluster which involves multiple datastores (tar), and
> thus should have a harder time being consistent, is the one that can
> accomplish consistency..

Thats not how it is. Cluster which involves multiple datastores (tar)
is also eventually consistent. Changes are either "pushed" to each tar
instance via some replication or changes done on one of the cluster
node surfaces on other via reverse replication. In either case change
done is not immediately visible on other cluster nodes

> More importantly, is it a function of Repo size, or repo activity?
> If the repo grows in size (number of nodes) and grows in use (number of
> writes/sec) does this impact how frequently Sling Cluster instances grab the
> most recent revision?

Its somewhat related to number of writes and is not dependent on repo size

> Less importantly... Myself and colleagues are really curious as to why
> jackrabbit is implemented this way. Is there a performance benefit to being
> eventually, when the shared datastore is actually consistent? What's the
> reasoning for not always hitting the latest data?  Also... Is there any way
> to force all reads to read the most recent revision, perhaps through some
> configuration?

Thats a question best suited for discussion on oak-dev mailing list
(oak-dev@jackrabbit.apache.org)

Chetan Mehrotra

Re: Not-sticky sessions with Sling?

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi,

On Mon, Jan 16, 2017 at 9:16 PM, lancedolan <la...@gmail.com> wrote:
> ...this probably shoots down our entire Sling
> proof of concept project...

That would be a pity, as I suppose you're starting to like Sling now ;-)

> ...Is there any way
> to force all reads to read the most recent revision, perhaps through some
> configuration?...

As Chetan say that's a question for the Oak dev list, but from a Sling
point of view having that option would be useful IMO.

If the clustered Sling instances can get consensus on what the most
recent revision is (*), having the option for Oak to block until it
sees that revision sounds useful in some cases. That should probably
happen either on opening a JCR Session or when Session.refresh() is
called.

-Bertrand

(*) which might require an additional consensus mechanism, maybe via
Mongo if that's what you're using?

Re: Not-sticky sessions with Sling?

Posted by lancedolan <la...@gmail.com>.

This is really disappointing for us. Through this revisioning, Oak has turned
a datastore that is consistent by default into a datastore that is not :p
It's ironic that the cluster which involves multiple datastores (tar), and
thus should have a harder time being consistent, is the one that can
accomplish consistency... and the cluster that involves a single shared
source of truth (mongo/rdbms), and should have the easiest time being
consistent, is not. Hehe. Ahh this probably shoots down our entire Sling
proof of concept project.

Our next step is to measure the consequences of moving forward with
Sling+Oak+Mongo and not-sticky sessions. I'm going to try to test this, and
get an empirical answer, by deploying to some AWS instances. I'll develop a
custom AuthenticationHandler so that authentication is stateless and then
we'll try to see how bad the "delay" might be. However, I would love a
theoretical answer as well, if you've got one :)

chetan mehrotra wrote
> sticky
> ... sticky sessions would be required due to eventual consistent nature of
> repository.

Okay, but if we disable stick sessions ANYHOW (because in our environment we
must), how much time delay are we talking, do you think, in realistic
practice? We might be able to solve this by giving user-feedback that covers
up for the sync delay. When a user clicks save, they might just go to a
different screen, providing enough time for things to sync up. It might be a
race condition, but that might be acceptable if we can choose that
architecture on good information. I think that, in theory, the answer to
"worst case scenario" for eventual consistency is always "forever," but
really... How long could a Sling instance take to get to the latest
revision? More importantly, is it a function of Repo size, or repo activity?
If the repo grows in size (number of nodes) and grows in use (number of
writes/sec) does this impact how frequently Sling Cluster instances grab the
most recent revision?

Less importantly... Myself and colleagues are really curious as to why
jackrabbit is implemented this way. Is there a performance benefit to being
eventually, when the shared datastore is actually consistent? What's the
reasoning for not always hitting the latest data? Also... Is there any way
to force all reads to read the most recent revision, perhaps through some
configuration? A performance cost for this might be tolerable

--
View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069661.html
Sent from the Sling - Users mailing list archive at Nabble.com.

Re: Not-sticky sessions with Sling?

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Sat, Jan 14, 2017 at 2:08 AM, lancedolan <la...@gmail.com> wrote:
> To be honest, however, I don't understand fully
> what you said in your last post and I also know that AEM 6.1 can do what I'd
> like, which is really just Sling+Oak. If they can do it, I don't understand
> why we can't.
>
> ref:
> https://docs.adobe.com/docs/en/aem/6-1/administer/security/encapsulated-token.html

That links talks about scaling of publish instance which are in most
cases based on Segment/Tar setup and hence not forming a "homegenous"
cluster. Each cluster node has separate segment store and only
potentially shares the DataStore

> B) There are separate versions of that property stored in Mongo (perhaps
> this is what you meant by the word revision) and it's possible for a
> sling-instance to be reading an old version of a property from Mongo.

Thats bit closer to whats happening. [1] talks about the data model
being used for persistence in Mongo/RDB. For example if there is a
property 'prop' on root node i.e. /@prop then its stored in somewhat
following form in Mongo

{
"_id" : "0:/",
 "prop" : {
       "r13fcda91720-0-1" : "\"foo\"",
       "r13fcda919eb-0-1" : "\"bar\"",
    }
}

The value for this property is function of revision at which read
operation is performed. So 'prop' value is 'foo' at rev r1 and 'bar'
at rev r2. These revisions are based on timestamp. Now each cluster
node also has a "head" revision. So any read call on that cluster node
would only see those values whose revision are <= '"head" revision.
This head revision is updated periodically via background read. Due to
this snapshot isolation model you see the write skew [2]

Chetan Mehrotra
[1] https://jackrabbit.apache.org/oak/docs/nodestore/documentmk.html
[2] https://jackrabbit.apache.org/oak/docs/architecture/transactional-model.html

Re: Not-sticky sessions with Sling?

Posted by lancedolan <la...@gmail.com>.

Alright, this is a deal breaker for our business (if sling absolutely
requires sticky sessions). I hope you're not offended that I'm not 100%
convinced yet. I understand you do development on the sling project and are
well qualified on the topic. To be honest, however, I don't understand fully
what you said in your last post and I also know that AEM 6.1 can do what I'd
like, which is really just Sling+Oak. If they can do it, I don't understand
why we can't.

ref:
https://docs.adobe.com/docs/en/aem/6-1/administer/security/encapsulated-token.html

I'd hate to throw away all the awesome progress we've made with Sling so far
when I know that AEM, which is just sling + jackrabbit, can accomplish
app-server-agnostic authentication, and thus avoid sticky sessions.

Although I don't understand this "head revision" that you've described, and
that's inexperience on my part, I am confident that you're telling me that
when there is only one Mongo instance in existence, and all Sling instances
get data from it, that directly after "sling-instance-1" writes
"myProperty=myValue" to the JCR, then "sling-instances-2" could get the
value of "myProperty" from somewhere else - some old value. This only seems
possible to me if one of the following is true:

A) the Sling instances are caching values from Mongo (perhaps Sling or Oak
is doing that?)
B) There are separate versions of that property stored in Mongo (perhaps
this is what you meant by the word revision) and it's possible for a
sling-instance to be reading an old version of a property from Mongo.
C) Mongo isn't consistent.

We know from mongo documentation that C isn't true - Mongo is consistent
when reading from the primary replica set. So it must be that A or B is
going on? And if so, what is your guess about how AEM 6, which is Sling+Oak,
avoids this pitfall when they very clearly support the stateless
architecture (ie not-sticky) that I'm planning?

--
View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069605.html
Sent from the Sling - Users mailing list archive at Nabble.com.

Re: Not-sticky sessions with Sling?

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Fri, Jan 13, 2017 at 12:20 AM, lancedolan <la...@gmail.com> wrote:
> In an architecture with
> only one Mongo instance, the moment one instance writes to the JCR, another
> instance will read the same data and agree consistently. It seems to me that
> the JCR state is strongly consistent.

No. DocumentNodeStore in each Sling node which are part of cluster
would periodically poll the backend root node state revision. If there
is any change detected it would update its head revision to match with
last seen root node revision from Mongo and then it would generate an
external observation event. So any change done on cluster node N1
would be _visible sometime later_ on cluster node N2.

So if you create a node on N1 and immediately try to read it on N2
then that read would fail as that change might not be "visible" on
other cluster node. So any new session opened on N2 would have its
base revision set to current head revision of that cluster node and
which may be older than current head revision in Mongo.

However the writes would still be consistent. So if you modify same
property concurrently from different cluster nodes that one of the
write would succeed and other would fail with a conflict.

Some details are provided at [1]

Chetan Mehrotra
[1] https://jackrabbit.apache.org/oak/docs/architecture/transactional-model.html

Re: Not-sticky sessions with Sling?

Posted by lancedolan <la...@gmail.com>.

Chetan,

I'd like to confirm to what degree that is true for our proposed
architecture. It seems that only the OSGI configurations and bundles would
be "eventually consistent." It seems the only "state" that is stored in
Sling instances are OSGI configurations and OSGI bundles. Everything else is
in the JCR, which Mongo can provide as strongly consistent ( I believe ).
Consider this example and correct me where I'm wrong. I'd hate to shoot
myself in the foot with bad assumptions.

Imagine 3 Sling instances all talking to 1 Mongo instance. In this case, it
seems to me that all REPO state is captured in a single Mongo instance,
which is consistent by default and eventually consistency only happens if
you hit secondary members of a Mongo Replica Set. In an architecture with
only one Mongo instance, the moment one instance writes to the JCR, another
instance will read the same data and agree consistently. It seems to me that
the JCR state is strongly consistent.

However, OSGI configurations seem to propagate to each other through the JCR
only eventually... Additionally, when we deploy a new OSGI bundle to the JCR
(in an install directory or whatever), then those seem to only eventually
propagate to all Sling instances. I'm not totally sure that these are
"eventually," but it seems like the only place that state will only be
"eventual" in this architecture.

So, as long as we're cool with OSGI configurations and bundle installations
being eventual, everything else, stored in the JCR, should be strongly
consistent right?

And then, I believe we can even scale the Mongo instances into a replica set
for better availability and we'll still be strongly consistent so long as
all Sling instances only read from the primary member of the replica set:
[1].

Thanks for your time and thoughts dude!

[1] https://www.mongodb.com/faq#consistency

--
View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069551.html
Sent from the Sling - Users mailing list archive at Nabble.com.

Re: Not-sticky sessions with Sling?

Posted by Chetan Mehrotra <ch...@gmail.com>.

If you are running a cluster with Sling on Oak/Mongo then sticky
sessions would be required due to eventual consistent nature of
repository. Changes done on one cluster node would not be immediately
visible on other cluster node. Hence to provide a consistent user
experience sticky sessions would be required
Chetan Mehrotra


On Thu, Jan 12, 2017 at 7:34 AM, lancedolan <la...@gmail.com> wrote:
> The only example code I can find to authenticate to Sling will use the JEE
> servlet container's "j_security_check" which then stores the authenticated
> session in App Server memory. A load-balancer without sticky-sessions
> enabled will cause an unstable experience for users, in which they are
> suddenly unauthenticated.
>
> -Does Sling already offer a mechanism for authenticating without storing
> that JCR session in Servlet Container Session?
> -Do any of you avoid sticky sessions without writing custom code?
>
> I'm thinking that this problem *must* be solved already. Either there's an
> authenticationhandler in Sling that I haven't found yet, or there's an
> open-source example that somebody could share with me :)
>
> If I must write this myself, is this the best place to start?
> https://sling.apache.org/documentation/the-sling-engine/authentication/authentication-authenticationhandler.html
> https://sling.apache.org/apidocs/sling8/org/apache/sling/auth/core/spi/AuthenticationHandler.html
>
> ... as usual, thanks guys. I realize I'm really dominating the mail list
> lately. I've got a lot to solve :)
>
>
>
>
> --
> View this message in context: http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530.html
> Sent from the Sling - Users mailing list archive at Nabble.com.