You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Carlos Alonso <ca...@cabify.com> on 2017/06/13 17:34:25 UTC

Error getting security objects when configuring new replica

Hi guys!

I continue trying to understand how CouchDB clusters work and trying to
build a compelling administration tool that covers basic operations such as
adding a node to the cluster, moving a shard from one node to another and
so on. It is WIP but already open sourced here:
https://github.com/cabify/couchdb-admin

Testing the scale out procedure (add node, make it replicate some shards,
remove the shard from the previous location) I've seen the following error
:

[error] 2017-06-13T15:58:22.299140Z couchdb@couch-2.couchdb2-replica-admin
<0.2214.3> -------- Error getting security objects for <<"testdb3">>:
{error,no_majority}


Not only mentioning my testdb3 but also with internal ones such as
_global_changes. I mean, I was scaling out testdb3, but errors appeared
referring to testdb3 and also _global_changes, but I wasn't scaling out
_global_changes.


The error appears when I configure a new node as being replica for an
existing shard (by adding it to the by_nodes and by_ranges sections of
document at _dbs/testdb3)


The error appears every few seconds on the new replica logs once for each
of the other replicas (3 for testdb3 and 2 for _global_changes at that
time) and it also appears on the other nodes' logs but just once every few
seconds.


The error stops appearing once I remove the maintenance_mode flag on the
new replica (because before configuring it as replica I enable that flag so
the node doesn't participate in reads. Kudos Adam Kocoloski for your advice
here) once pending_changes messages stop appearing on the new replica.

I think the error is making the catch_up process not to work properly as my
consistency checks fail when this error appears during the procedure
(doesn't happen 100% of the times).

I've seen it both happening when the new replica node was completely empty
but also when it had the data preloaded (via rsync or because it had
previously been a replica).


I hope so many text helps you out :)

Thanks!


-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Error getting security objects when configuring new replica

Posted by Carlos Alonso <ca...@cabify.com>.
Thanks a lot again for your input Adam.

Following on your comments I've just opened a GH issue with the details:
https://github.com/apache/couchdb/issues/602

Regards

On Wed, Jun 14, 2017 at 7:57 PM Adam Kocoloski <ko...@apache.org> wrote:

> Hi Carlos,
>
> Ah, this is an interesting edge case. The "security object” contains the
> “admins” and “members” metadata for a database. For historical reasons it
> is *not* versioned like a normal document. Under normal operating
> circumstances every replica of every shard contains a copy of the security
> object for the database.
>
> When you add a replica for an existing shard that replica does not yet
> have the security object. There is an internal process running in the
> cluster that regularly ensures that the security objects for a database are
> in sync. That process has a safeguard that will cause it to bail out and do
> nothing unless it recovers a simple majority of the security objects for
> all shard replicas of the database in question. Your statement that “having
> less than half of the replica nodes available for the database … raises
> this error” is almost correct; technically, what causes this error is when
> the cluster is unable to contact a majority of the *shard replicas*,
> regardless of which nodes are hosting them.
>
> Hopefully this is an unusual scenario. That said, we could think about
> improving the cluster’s behavior here by allowing the security
> synchronization process to “punch through” maintenance mode and retrieve
> the security objects from those shards for the purposes of establishing a
> majority and subsequently converging all the shards. I think that’s worth
> further discussion in a GitHub issue at least.
>
> Cheers, Adam
>
> > On Jun 14, 2017, at 12:57 PM, Carlos Alonso <ca...@cabify.com>
> wrote:
> >
> > Ok, so I've made some progress on this and I'd like to share it here.
> >
> > So the error says "*Error getting security objects for
> > <<"affected_database_here">> : {error,no_majority}*" and that is actually
> > not related to configuring a new replica node as I was saying before but
> to
> > nodes in maintenance mode when read/write operations happen.
> >
> > In summary, having less than half of the replica nodes available for the
> > database you're working on raises this error. The database is available
> > though (maximum availability by design I guess :))
> >
> > My question then is, what does this error exactly mean? What are the so
> > called security objects? Is it something one has to carefully consider
> > avoiding?
> >
> > Thank you.
> >
> > On Tue, Jun 13, 2017 at 7:34 PM Carlos Alonso <carlos.alonso@cabify.com
> <ma...@cabify.com>>
> > wrote:
> >
> >> Hi guys!
> >>
> >> I continue trying to understand how CouchDB clusters work and trying to
> >> build a compelling administration tool that covers basic operations
> such as
> >> adding a node to the cluster, moving a shard from one node to another
> and
> >> so on. It is WIP but already open sourced here:
> >> https://github.com/cabify/couchdb-admin
> >>
> >> Testing the scale out procedure (add node, make it replicate some
> shards,
> >> remove the shard from the previous location) I've seen the following
> error
> >> :
> >>
> >> [error] 2017-06-13T15:58:22.299140Z
> couchdb@couch-2.couchdb2-replica-admin
> >> <0.2214.3> -------- Error getting security objects for <<"testdb3">>:
> >> {error,no_majority}
> >>
> >>
> >> Not only mentioning my testdb3 but also with internal ones such as
> >> _global_changes. I mean, I was scaling out testdb3, but errors appeared
> >> referring to testdb3 and also _global_changes, but I wasn't scaling out
> >> _global_changes.
> >>
> >>
> >> The error appears when I configure a new node as being replica for an
> >> existing shard (by adding it to the by_nodes and by_ranges sections of
> >> document at _dbs/testdb3)
> >>
> >>
> >> The error appears every few seconds on the new replica logs once for
> each
> >> of the other replicas (3 for testdb3 and 2 for _global_changes at that
> >> time) and it also appears on the other nodes' logs but just once every
> few
> >> seconds.
> >>
> >>
> >> The error stops appearing once I remove the maintenance_mode flag on the
> >> new replica (because before configuring it as replica I enable that
> flag so
> >> the node doesn't participate in reads. Kudos Adam Kocoloski for your
> advice
> >> here) once pending_changes messages stop appearing on the new replica.
> >>
> >> I think the error is making the catch_up process not to work properly as
> >> my consistency checks fail when this error appears during the procedure
> >> (doesn't happen 100% of the times).
> >>
> >> I've seen it both happening when the new replica node was completely
> empty
> >> but also when it had the data preloaded (via rsync or because it had
> >> previously been a replica).
> >>
> >>
> >> I hope so many text helps you out :)
> >>
> >> Thanks!
> >>
> >>
> >> --
> >> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>
> >> *Carlos Alonso*
> >> Data Engineer
> >> Madrid, Spain
> >>
> >> carlos.alonso@cabify.com
> >>
> >> Prueba gratis con este código
> >> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> >> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>
> > --
> > [image: Cabify - Your private Driver] <http://www.cabify.com/ <
> http://www.cabify.com/>>
> >
> > *Carlos Alonso*
> > Data Engineer
> > Madrid, Spain
> >
> > carlos.alonso@cabify.com <ma...@cabify.com>
> >
> > Prueba gratis con este código
> > #CARLOSA6319 <https://cabify.com/i/carlosa6319 <
> https://cabify.com/i/carlosa6319>>
> > [image: Facebook] <http://cbify.com/fb_ES <http://cbify.com/fb_ES>>[image:
> Twitter]
> > <http://cbify.com/tw_ES <http://cbify.com/tw_ES>>[image: Instagram] <
> http://cbify.com/in_ES <http://cbify.com/in_ES>>[image:
> > Linkedin] <https://www.linkedin.com/in/mrcalonso <
> https://www.linkedin.com/in/mrcalonso>>
> >
> > --
> > Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> > destinatario, pudiendo contener información confidencial sometida a
> secreto
> > profesional. No está permitida su reproducción o distribución sin la
> > autorización expresa de Cabify. Si usted no es el destinatario final por
> > favor elimínelo e infórmenos por esta vía.
> >
> > This message and any attached file are intended exclusively for the
> > addressee, and it may be confidential. You are not allowed to copy or
> > disclose it without Cabify's prior written authorization. If you are not
> > the intended recipient please delete it from your system and notify us by
> > e-mail.
>
> --
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Error getting security objects when configuring new replica

Posted by Adam Kocoloski <ko...@apache.org>.
Hi Carlos,

Ah, this is an interesting edge case. The "security object” contains the “admins” and “members” metadata for a database. For historical reasons it is *not* versioned like a normal document. Under normal operating circumstances every replica of every shard contains a copy of the security object for the database.

When you add a replica for an existing shard that replica does not yet have the security object. There is an internal process running in the cluster that regularly ensures that the security objects for a database are in sync. That process has a safeguard that will cause it to bail out and do nothing unless it recovers a simple majority of the security objects for all shard replicas of the database in question. Your statement that “having less than half of the replica nodes available for the database … raises this error” is almost correct; technically, what causes this error is when the cluster is unable to contact a majority of the *shard replicas*, regardless of which nodes are hosting them.

Hopefully this is an unusual scenario. That said, we could think about improving the cluster’s behavior here by allowing the security synchronization process to “punch through” maintenance mode and retrieve the security objects from those shards for the purposes of establishing a majority and subsequently converging all the shards. I think that’s worth further discussion in a GitHub issue at least.

Cheers, Adam

> On Jun 14, 2017, at 12:57 PM, Carlos Alonso <ca...@cabify.com> wrote:
> 
> Ok, so I've made some progress on this and I'd like to share it here.
> 
> So the error says "*Error getting security objects for
> <<"affected_database_here">> : {error,no_majority}*" and that is actually
> not related to configuring a new replica node as I was saying before but to
> nodes in maintenance mode when read/write operations happen.
> 
> In summary, having less than half of the replica nodes available for the
> database you're working on raises this error. The database is available
> though (maximum availability by design I guess :))
> 
> My question then is, what does this error exactly mean? What are the so
> called security objects? Is it something one has to carefully consider
> avoiding?
> 
> Thank you.
> 
> On Tue, Jun 13, 2017 at 7:34 PM Carlos Alonso <carlos.alonso@cabify.com <ma...@cabify.com>>
> wrote:
> 
>> Hi guys!
>> 
>> I continue trying to understand how CouchDB clusters work and trying to
>> build a compelling administration tool that covers basic operations such as
>> adding a node to the cluster, moving a shard from one node to another and
>> so on. It is WIP but already open sourced here:
>> https://github.com/cabify/couchdb-admin
>> 
>> Testing the scale out procedure (add node, make it replicate some shards,
>> remove the shard from the previous location) I've seen the following error
>> :
>> 
>> [error] 2017-06-13T15:58:22.299140Z couchdb@couch-2.couchdb2-replica-admin
>> <0.2214.3> -------- Error getting security objects for <<"testdb3">>:
>> {error,no_majority}
>> 
>> 
>> Not only mentioning my testdb3 but also with internal ones such as
>> _global_changes. I mean, I was scaling out testdb3, but errors appeared
>> referring to testdb3 and also _global_changes, but I wasn't scaling out
>> _global_changes.
>> 
>> 
>> The error appears when I configure a new node as being replica for an
>> existing shard (by adding it to the by_nodes and by_ranges sections of
>> document at _dbs/testdb3)
>> 
>> 
>> The error appears every few seconds on the new replica logs once for each
>> of the other replicas (3 for testdb3 and 2 for _global_changes at that
>> time) and it also appears on the other nodes' logs but just once every few
>> seconds.
>> 
>> 
>> The error stops appearing once I remove the maintenance_mode flag on the
>> new replica (because before configuring it as replica I enable that flag so
>> the node doesn't participate in reads. Kudos Adam Kocoloski for your advice
>> here) once pending_changes messages stop appearing on the new replica.
>> 
>> I think the error is making the catch_up process not to work properly as
>> my consistency checks fail when this error appears during the procedure
>> (doesn't happen 100% of the times).
>> 
>> I've seen it both happening when the new replica node was completely empty
>> but also when it had the data preloaded (via rsync or because it had
>> previously been a replica).
>> 
>> 
>> I hope so many text helps you out :)
>> 
>> Thanks!
>> 
>> 
>> --
>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>> 
>> *Carlos Alonso*
>> Data Engineer
>> Madrid, Spain
>> 
>> carlos.alonso@cabify.com
>> 
>> Prueba gratis con este código
>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>> 
> -- 
> [image: Cabify - Your private Driver] <http://www.cabify.com/ <http://www.cabify.com/>>
> 
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
> 
> carlos.alonso@cabify.com <ma...@cabify.com>
> 
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319 <https://cabify.com/i/carlosa6319>>
> [image: Facebook] <http://cbify.com/fb_ES <http://cbify.com/fb_ES>>[image: Twitter]
> <http://cbify.com/tw_ES <http://cbify.com/tw_ES>>[image: Instagram] <http://cbify.com/in_ES <http://cbify.com/in_ES>>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso <https://www.linkedin.com/in/mrcalonso>>
> 
> -- 
> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
> destinatario, pudiendo contener información confidencial sometida a secreto 
> profesional. No está permitida su reproducción o distribución sin la 
> autorización expresa de Cabify. Si usted no es el destinatario final por 
> favor elimínelo e infórmenos por esta vía. 
> 
> This message and any attached file are intended exclusively for the 
> addressee, and it may be confidential. You are not allowed to copy or 
> disclose it without Cabify's prior written authorization. If you are not 
> the intended recipient please delete it from your system and notify us by 
> e-mail.


Re: Error getting security objects when configuring new replica

Posted by Carlos Alonso <ca...@cabify.com>.
Ok, so I've made some progress on this and I'd like to share it here.

So the error says "*Error getting security objects for
<<"affected_database_here">> : {error,no_majority}*" and that is actually
not related to configuring a new replica node as I was saying before but to
nodes in maintenance mode when read/write operations happen.

In summary, having less than half of the replica nodes available for the
database you're working on raises this error. The database is available
though (maximum availability by design I guess :))

My question then is, what does this error exactly mean? What are the so
called security objects? Is it something one has to carefully consider
avoiding?

Thank you.

On Tue, Jun 13, 2017 at 7:34 PM Carlos Alonso <ca...@cabify.com>
wrote:

> Hi guys!
>
> I continue trying to understand how CouchDB clusters work and trying to
> build a compelling administration tool that covers basic operations such as
> adding a node to the cluster, moving a shard from one node to another and
> so on. It is WIP but already open sourced here:
> https://github.com/cabify/couchdb-admin
>
> Testing the scale out procedure (add node, make it replicate some shards,
> remove the shard from the previous location) I've seen the following error
> :
>
> [error] 2017-06-13T15:58:22.299140Z couchdb@couch-2.couchdb2-replica-admin
> <0.2214.3> -------- Error getting security objects for <<"testdb3">>:
> {error,no_majority}
>
>
> Not only mentioning my testdb3 but also with internal ones such as
> _global_changes. I mean, I was scaling out testdb3, but errors appeared
> referring to testdb3 and also _global_changes, but I wasn't scaling out
> _global_changes.
>
>
> The error appears when I configure a new node as being replica for an
> existing shard (by adding it to the by_nodes and by_ranges sections of
> document at _dbs/testdb3)
>
>
> The error appears every few seconds on the new replica logs once for each
> of the other replicas (3 for testdb3 and 2 for _global_changes at that
> time) and it also appears on the other nodes' logs but just once every few
> seconds.
>
>
> The error stops appearing once I remove the maintenance_mode flag on the
> new replica (because before configuring it as replica I enable that flag so
> the node doesn't participate in reads. Kudos Adam Kocoloski for your advice
> here) once pending_changes messages stop appearing on the new replica.
>
> I think the error is making the catch_up process not to work properly as
> my consistency checks fail when this error appears during the procedure
> (doesn't happen 100% of the times).
>
> I've seen it both happening when the new replica node was completely empty
> but also when it had the data preloaded (via rsync or because it had
> previously been a replica).
>
>
> I hope so many text helps you out :)
>
> Thanks!
>
>
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.