You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Carlos Alonso <ca...@cabify.com> on 2017/10/01 09:42:28 UTC

Trying to understand why a node gets 'frozen'

Hello everyone!!

I'm trying to understand an issue we're experiencing on CouchDB 2.1.0
running on Ubuntu 14.04. The cluster itself is currently replicating from
another source cluster and we have seen that one node gets frozen from time
to time having to restart it to get it to respond again.

Before getting unresponsive, the node throws a lot of {error,
sel_conn_closed}. See an example trace below.

[error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0> --------
gen_server <0.13489.0> terminated with reason:
{checkpoint_commit_failure,<<"Failure on target commit:
{'EXIT',{http_request_failed,\"POST\",\n                             \"
http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
     {error,sel_conn_closed}}}">>}
  last msg: {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
target commit: {'EXIT',{http_request_failed,\"POST\",\n
         \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
               {error,sel_conn_closed}}}">>}}
     state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
https://source_ip/mydb/",nil,[{"Accept","application/json"},{"Authorization","Basic
..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
http://127.0.0.1:5984/mydb/",nil,[{"Accept","application/json"},{"Authorization","Basic
..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}

The particular node is 'responsible' for a replication that has quite many
{mp_parser_died,noproc} errors, which AFAIK is a known bug (
https://github.com/apache/couchdb/issues/745), but I don't know if that may
have any relationship.

When that happens, just restarting the node brings it up and running
properly.

Any help would be really appreciated.

Regards
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Joan Touzet <wo...@apache.org>.
Yes, please add your observations to #745 directly.

-Joan

----- Original Message -----
From: "Carlos Alonso" <ca...@cabify.com>
To: user@couchdb.apache.org, "Joan Touzet" <wo...@apache.org>
Sent: Wednesday, 11 October, 2017 10:11:52 AM
Subject: Re: Trying to understand why a node gets 'frozen'

Hi Joan.

Many thanks for your help. It was almost exactly like what you described
except for the fact that I'm trying to monitor remote nodes whose name is
the internal fqdn, that is only resolvable in the internal network (Google
Cloud).

So I added entries on my local machine's /etc/hosts to map them to
localhost and boom!! Got it connected!!

After having a quick look at the Proceses tab I've identified a ton of
erlang:apply/2 in function couch_httpd_multipart:maybe_send_data/1 with 0
reductions and 0 messages in the queue and also a lot of
mochiweb_acceptor:init/4 in function
couch_doc:-doc_from_multi_part_stream/3-fun-1-/1 Some of them with 1
message on the queue, some of them with 0 and 0 reductions as well...

This makes me think about a relationship with issue
https://github.com/apache/couchdb/issues/745. This particular node is,
unluckily, the responsible for replicating our two databases with
attachments, and we've ween a lot of {mp_parser_died,noproc} errors as
described in that issue.

Also this node has a few erlang:apply/2 processes in function
couch_http_multipart:mp_parse_attrs/2 that could have a relationship with
this all.

I think there's something preventing the processes from exiting and that's
why they pile up until it freezes.

Any idea/workaround? Should I maybe report all this into an issue?

Thanks!

On Tue, Oct 10, 2017 at 11:41 PM Joan Touzet <wo...@apache.org> wrote:

> Hi Carlos,
>
> After compiling CouchDB master, I started a 3-node dev cluster:
>
> $ dev/run -n 3 --with-admin-party-please
>
> Then, in another terminal window, I was able to run the observer
> like this:
>
> $ erl -smp -name observer@127.0.0.1 -hidden -setcookie monster -run
> observer
>
> Since 2.x nodes should be launched with -name you need to give your
> local instance a -name as well, with either a FQDN or ip address
> as the host address. You need to match the cluster cookie as
> well. You'll have to manually add the remote node if you're
> running Observer on a different machine, and you'll need access to
> all the relevant ports (4369 + whatever ephemeral ports you've
> allocated to Couch's BEAM; if not statically specified you may need
> ALL ports open!)
>
> Linked is a screenshot of Observer seeing all of the nodes running on
> that VM, connected to node1, and showing the couch process tree.
>
> https://i.imgur.com/Vb3Jisa.png
>
> -Joan
>
> ----- Original Message -----
> From: "Carlos Alonso" <ca...@cabify.com>
> To: user@couchdb.apache.org, "Joan Touzet" <wo...@apache.org>
> Sent: Tuesday, 10 October, 2017 3:37:20 AM
> Subject: Re: Trying to understand why a node gets 'frozen'
>
>
> Hi Geoffrey.
>
>
> Thank you very much for your post, great step by step!!
>
>
> Unfortunately that's not quite what I need. I have that similar monitoring
> with Datadog, but what I really need is to inspect the Erlang processes
> information.
>
>
> So far I've seen that the process_count metric, extracted from the _system
> endpoint grows continuously on just one of the nodes. The other two have it
> stable at around 1-1.2k processes, but the failing node grows slowly but
> continuously until it reaches about 5.1k that it gets unresponsive.
>
>
> By connecting the observer I'd like to see which kind of processes are
> those to try to get closer to the cause. ATM I've been trying to run both
> etop and the observer but without luck on any of them.
>
>
> Can anyone help me with a few steps on how to inspect Erlang processes on
> a remote server?
>
>
> Thanks!!
>
>
> On Mon, Oct 9, 2017 at 3:39 PM Geoffrey Cox < redgeoff@gmail.com > wrote:
>
>
> Hi Carlos, I wrote a post on monitoring CouchDB using Prometheus:
>
> https://hackernoon.com/monitoring-couchdb-with-prometheus-grafana-and-docker-4693bc8408f0
>
> I’m not sure if it will provide all the metrics you need, but I hope this
> helps
>
> Geoff
> On Mon, Oct 9, 2017 at 3:53 AM Carlos Alonso < carlos.alonso@cabify.com >
> wrote:
>
> > I'd like to connect a diagnosing tool such as etop, observer, ... to see
> > which processes are open there but I cannot seem to have it working.
> >
> > Could anyone please share how to run any of those tools on a remote
> server?
> >
> > Regards
> >
> > On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso < carlos.alonso@cabify.com
> >
> > wrote:
> >
> > > So I could find another relevant symptom. After adding _system endpoint
> > > monitoring I have discovered that the particular node has a different
> > > behaviour than the other ones in terms of Erlang process count.
> > >
> > > The process_count metric of the normal nodes is stable around 1k to
> 1.3k
> > > while the other node's process_count is slowly but continuously growing
> > > until a little above than 5k processes that is when it gets 'frozen'.
> > After
> > > restarting the value comes back to the normal 1k to 1.3k (to
> immediately
> > > start slowly growing again, of course :)).
> > >
> > > Any idea? Thanks!
> > >
> > > On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso <
> carlos.alonso@cabify.com >
> > > wrote:
> > >
> > >> This is one of the complete errors sequences I can see:
> > >>
> > >> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator
> > >> -------- Error in process <0.24558.209> on node
> 'couchdb@couchdb-node-1
> > '
> > >> with exit value:
> > >>
> > >>
> >
> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}
> > >>
> > >>
> >
> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
> > >>
> > >> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1
> <0.5208.204>
> > >> aab326c0bb req_err (2515771787 <(251)%20577-1787>
> <(251)%20577-1787>) badmatch : ok
> > >> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> > >> L295">>,<<"chttpd:handle_request_int/1
> > L231">>,<<"mochiweb_http:headers/6
> > >> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> > >> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1
> > <0.20718.207>
> > >> -------- Replicator, request PUT to "
> > >>
> >
> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false
> > "
> > >> failed due to error {error,
> > >> {'EXIT',
> > >> {{{nocatch,{mp_parser_died,noproc}},
> > >> ...
> > >>
> > >> Regards
> > >>
> > >> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <
> carlos.alonso@cabify.com
> > >
> > >> wrote:
> > >>
> > >>> The 'weird' thing about the mp_parser_died error is that, according
> to
> > >>> the description of the issue 745, the replication never finishes as
> the
> > >>> item that fails once, seems to fail forever, but in my case they
> fail,
> > but
> > >>> then they seem to work (possibly as the replication is retried), as I
> > can
> > >>> find the documents that generated the error (in the logs) in the
> target
> > >>> db...
> > >>>
> > >>> Regards
> > >>>
> > >>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <
> > carlos.alonso@cabify.com >
> > >>> wrote:
> > >>>
> > >>>> So to give some more context this node is responsible for
> replicating
> > a
> > >>>> database that has quite many attachments and it raises the 'famous'
> > >>>> mp_parser_died,noproc error, that I think is this one:
> > >>>> https://github.com/apache/couchdb/issues/745
> > >>>>
> > >>>> What I've identified so far from the logs is that along with the
> error
> > >>>> described above, also this error appears:
> > >>>>
> > >>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1
> > >>>> <0.30012.3408> 520e44b7ae req_err (2515771787 <(251)%20577-1787>
> <(251)%20577-1787>)
> > >>>> badmatch : ok
> > >>>> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> > >>>> L295">>,<<"chttpd:handle_request_int/1
> > L231">>,<<"mochiweb_http:headers/6
> > >>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> > >>>>
> > >>>> Sometimes it appears just after the mp_parser_died error, sometimes
> > the
> > >>>> parser error happens without 'triggering' one of this badmatch ones.
> > >>>>
> > >>>> Then, after a while of this sequence, the initially described
> > >>>> sel_conn_closed error starts raising for all requests and the node
> > gets
> > >>>> frozen. It is not responsive but it is still not removed from the
> > cluster,
> > >>>> holding its replications and, obviously, not replicating anything
> > until it
> > >>>> is restarted.
> > >>>>
> > >>>> I can also see interleaved unauthorized errors, which don't make
> much
> > >>>> sense as I'm the only one accessing this cluster
> > >>>>
> > >>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1
> > >>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are
> > not
> > >>>> authorized to access this db.">>} [{couch_db,open,2
> > >>>>
> > >>>>
> >
> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> > >>>>
> > >>>>
> > >>>> To me, it feels like the mp_parser_died error slowly breaks
> something
> > >>>> that in the end brings the node unresponsive, as those errors happen
> > a lot
> > >>>> in that particular replication.
> > >>>>
> > >>>> Regards and thanks a lot for your help!
> > >>>>
> > >>>>
> > >>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet < wohali@apache.org >
> wrote:
> > >>>>
> > >>>>> Is there more to the error? All this shows us is that the
> replicator
> > >>>>> itself attempted a POST and had the connection closed on it.
> > (Remember
> > >>>>> that the replicator is basically just a custom client that sits
> > >>>>> alongside CouchDB on the same machine.) There should be more to the
> > >>>>> error log that shows why CouchDB hung up the phone.
> > >>>>>
> > >>>>> ----- Original Message -----
> > >>>>> From: "Carlos Alonso" < carlos.alonso@cabify.com >
> > >>>>> To: "user" < user@couchdb.apache.org >
> > >>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
> > >>>>> Subject: Re: Trying to understand why a node gets 'frozen'
> > >>>>>
> > >>>>> Hello, this is happening every day, always on the same node. Any
> > ideas?
> > >>>>>
> > >>>>> Thanks!
> > >>>>>
> > >>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <
> > >>>>> carlos.alonso@cabify.com >
> > >>>>> wrote:
> > >>>>>
> > >>>>> > Hello everyone!!
> > >>>>> >
> > >>>>> > I'm trying to understand an issue we're experiencing on CouchDB
> > 2.1.0
> > >>>>> > running on Ubuntu 14.04. The cluster itself is currently
> > replicating
> > >>>>> from
> > >>>>> > another source cluster and we have seen that one node gets frozen
> > >>>>> from time
> > >>>>> > to time having to restart it to get it to respond again.
> > >>>>> >
> > >>>>> > Before getting unresponsive, the node throws a lot of {error,
> > >>>>> > sel_conn_closed}. See an example trace below.
> > >>>>> >
> > >>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1
> <0.13489.0>
> > >>>>> > -------- gen_server <0.13489.0> terminated with reason:
> > >>>>> > {checkpoint_commit_failure,<<"Failure on target commit:
> > >>>>> > {'EXIT',{http_request_failed,\"POST\",\n
> > >>>>> \"
> > >>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\ ",\n
> > >>>>> > {error,sel_conn_closed}}}">>}
> > >>>>> > last msg:
> > >>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
> > >>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
> > >>>>> > \" http://127.0.0.1:5984/mydb/_ensure_full_commit\ ",\n
> > >>>>> > {error,sel_conn_closed}}}">>}}
> > >>>>> > state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
> > >>>>> > https://source_ip/mydb/
> > >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> > >>>>> >
> > >>>>>
> >
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
> > >>>>> > http://127.0.0.1:5984/mydb/
> > >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> > >>>>> >
> > >>>>>
> >
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
> > >>>>> >
> > >>>>> > The particular node is 'responsible' for a replication that has
> > >>>>> quite many
> > >>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
> > >>>>> > https://github.com/apache/couchdb/issues/745 ), but I don't
> know if
> > >>>>> that
> > >>>>> > may have any relationship.
> > >>>>> >
> > >>>>> > When that happens, just restarting the node brings it up and
> > running
> > >>>>> > properly.
> > >>>>> >
> > >>>>> > Any help would be really appreciated.
> > >>>>> >
> > >>>>> > Regards
> > >>>>> > --
> > >>>>> > [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>>>> >
> > >>>>> > *Carlos Alonso*
> > >>>>> > Data Engineer
> > >>>>> > Madrid, Spain
> > >>>>> >
> > >>>>> > carlos.alonso@cabify.com
> > >>>>> >
> > >>>>> > Prueba gratis con este código
> > >>>>> > #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >>>>> > [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >>>>> > < http://cbify.com/tw_ES >[image: Instagram] <
> http://cbify.com/in_ES
> > >>>>> >[image:
> > >>>>> > Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>>>> >
> > >>>>> --
> > >>>>> [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>>>>
> > >>>>> *Carlos Alonso*
> > >>>>> Data Engineer
> > >>>>> Madrid, Spain
> > >>>>>
> > >>>>> carlos.alonso@cabify.com
> > >>>>>
> > >>>>> Prueba gratis con este código
> > >>>>> #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >>>>> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >>>>> < http://cbify.com/tw_ES >[image: Instagram] <
> http://cbify.com/in_ES
> > >>>>> >[image:
> > >>>>> Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>>>>
> > >>>>> --
> > >>>>> Este mensaje y cualquier archivo adjunto va dirigido
> exclusivamente a
> > >>>>> su
> > >>>>> destinatario, pudiendo contener información confidencial sometida a
> > >>>>> secreto
> > >>>>> profesional. No está permitida su reproducción o distribución sin
> la
> > >>>>> autorización expresa de Cabify. Si usted no es el destinatario
> final
> > >>>>> por
> > >>>>> favor elimínelo e infórmenos por esta vía.
> > >>>>>
> > >>>>> This message and any attached file are intended exclusively for the
> > >>>>> addressee, and it may be confidential. You are not allowed to copy
> or
> > >>>>> disclose it without Cabify's prior written authorization. If you
> are
> > >>>>> not
> > >>>>> the intended recipient please delete it from your system and notify
> > us
> > >>>>> by
> > >>>>> e-mail.
> > >>>>>
> > >>>> --
> > >>>> [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>>>
> > >>>> *Carlos Alonso*
> > >>>> Data Engineer
> > >>>> Madrid, Spain
> > >>>>
> > >>>> carlos.alonso@cabify.com
> > >>>>
> > >>>> Prueba gratis con este código
> > >>>> #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >>>> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >>>> < http://cbify.com/tw_ES >[image: Instagram] <
> http://cbify.com/in_ES
> > >[image:
> > >>>> Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>>>
> > >>> --
> > >>> [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>>
> > >>> *Carlos Alonso*
> > >>> Data Engineer
> > >>> Madrid, Spain
> > >>>
> > >>> carlos.alonso@cabify.com
> > >>>
> > >>> Prueba gratis con este código
> > >>> #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >>> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >>> < http://cbify.com/tw_ES >[image: Instagram] <
> http://cbify.com/in_ES
> > >[image:
> > >>> Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>>
> > >> --
> > >> [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>
> > >> *Carlos Alonso*
> > >> Data Engineer
> > >> Madrid, Spain
> > >>
> > >> carlos.alonso@cabify.com
> > >>
> > >> Prueba gratis con este código
> > >> #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >> < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES
> > >[image:
> > >> Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>
> > > --
> > > [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >
> > > *Carlos Alonso*
> > > Data Engineer
> > > Madrid, Spain
> > >
> > > carlos.alonso@cabify.com
> > >
> > > Prueba gratis con este código
> > > #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > > [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > > < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES
> > >[image:
> > > Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >
> > --
> > [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> >
> > *Carlos Alonso*
> > Data Engineer
> > Madrid, Spain
> >
> > carlos.alonso@cabify.com
> >
> > Prueba gratis con este código
> > #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES
> >[image:
> > Linkedin] < https://www.linkedin.com/in/mrcalonso >
> >
> > --
> > Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> > destinatario, pudiendo contener información confidencial sometida a
> secreto
> > profesional. No está permitida su reproducción o distribución sin la
> > autorización expresa de Cabify. Si usted no es el destinatario final por
> > favor elimínelo e infórmenos por esta vía.
> >
> > This message and any attached file are intended exclusively for the
> > addressee, and it may be confidential. You are not allowed to copy or
> > disclose it without Cabify's prior written authorization. If you are not
> > the intended recipient please delete it from your system and notify us by
> > e-mail.
> >
>
> --
>
>
> Cabify - Your private Driver
>
> Carlos Alonso
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
>
>
> Prueba gratis con este código
> #CARLOSA6319
>         FacebookTwitterInstagramLinkedin
> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> destinatario, pudiendo contener información confidencial sometida a secreto
> profesional. No está permitida su reproducción o distribución sin la
> autorización expresa de Cabify. Si usted no es el destinatario final por
> favor elimínelo e infórmenos por esta vía.
>
> This message and any attached file are intended exclusively for the
> addressee, and it may be confidential. You are not allowed to copy or
> disclose it without Cabify's prior written authorization. If you are not
> the intended recipient please delete it from your system and notify us by
> e-mail.
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Carlos Alonso <ca...@cabify.com>.
Hi Joan.

Many thanks for your help. It was almost exactly like what you described
except for the fact that I'm trying to monitor remote nodes whose name is
the internal fqdn, that is only resolvable in the internal network (Google
Cloud).

So I added entries on my local machine's /etc/hosts to map them to
localhost and boom!! Got it connected!!

After having a quick look at the Proceses tab I've identified a ton of
erlang:apply/2 in function couch_httpd_multipart:maybe_send_data/1 with 0
reductions and 0 messages in the queue and also a lot of
mochiweb_acceptor:init/4 in function
couch_doc:-doc_from_multi_part_stream/3-fun-1-/1 Some of them with 1
message on the queue, some of them with 0 and 0 reductions as well...

This makes me think about a relationship with issue
https://github.com/apache/couchdb/issues/745. This particular node is,
unluckily, the responsible for replicating our two databases with
attachments, and we've ween a lot of {mp_parser_died,noproc} errors as
described in that issue.

Also this node has a few erlang:apply/2 processes in function
couch_http_multipart:mp_parse_attrs/2 that could have a relationship with
this all.

I think there's something preventing the processes from exiting and that's
why they pile up until it freezes.

Any idea/workaround? Should I maybe report all this into an issue?

Thanks!

On Tue, Oct 10, 2017 at 11:41 PM Joan Touzet <wo...@apache.org> wrote:

> Hi Carlos,
>
> After compiling CouchDB master, I started a 3-node dev cluster:
>
> $ dev/run -n 3 --with-admin-party-please
>
> Then, in another terminal window, I was able to run the observer
> like this:
>
> $ erl -smp -name observer@127.0.0.1 -hidden -setcookie monster -run
> observer
>
> Since 2.x nodes should be launched with -name you need to give your
> local instance a -name as well, with either a FQDN or ip address
> as the host address. You need to match the cluster cookie as
> well. You'll have to manually add the remote node if you're
> running Observer on a different machine, and you'll need access to
> all the relevant ports (4369 + whatever ephemeral ports you've
> allocated to Couch's BEAM; if not statically specified you may need
> ALL ports open!)
>
> Linked is a screenshot of Observer seeing all of the nodes running on
> that VM, connected to node1, and showing the couch process tree.
>
> https://i.imgur.com/Vb3Jisa.png
>
> -Joan
>
> ----- Original Message -----
> From: "Carlos Alonso" <ca...@cabify.com>
> To: user@couchdb.apache.org, "Joan Touzet" <wo...@apache.org>
> Sent: Tuesday, 10 October, 2017 3:37:20 AM
> Subject: Re: Trying to understand why a node gets 'frozen'
>
>
> Hi Geoffrey.
>
>
> Thank you very much for your post, great step by step!!
>
>
> Unfortunately that's not quite what I need. I have that similar monitoring
> with Datadog, but what I really need is to inspect the Erlang processes
> information.
>
>
> So far I've seen that the process_count metric, extracted from the _system
> endpoint grows continuously on just one of the nodes. The other two have it
> stable at around 1-1.2k processes, but the failing node grows slowly but
> continuously until it reaches about 5.1k that it gets unresponsive.
>
>
> By connecting the observer I'd like to see which kind of processes are
> those to try to get closer to the cause. ATM I've been trying to run both
> etop and the observer but without luck on any of them.
>
>
> Can anyone help me with a few steps on how to inspect Erlang processes on
> a remote server?
>
>
> Thanks!!
>
>
> On Mon, Oct 9, 2017 at 3:39 PM Geoffrey Cox < redgeoff@gmail.com > wrote:
>
>
> Hi Carlos, I wrote a post on monitoring CouchDB using Prometheus:
>
> https://hackernoon.com/monitoring-couchdb-with-prometheus-grafana-and-docker-4693bc8408f0
>
> I’m not sure if it will provide all the metrics you need, but I hope this
> helps
>
> Geoff
> On Mon, Oct 9, 2017 at 3:53 AM Carlos Alonso < carlos.alonso@cabify.com >
> wrote:
>
> > I'd like to connect a diagnosing tool such as etop, observer, ... to see
> > which processes are open there but I cannot seem to have it working.
> >
> > Could anyone please share how to run any of those tools on a remote
> server?
> >
> > Regards
> >
> > On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso < carlos.alonso@cabify.com
> >
> > wrote:
> >
> > > So I could find another relevant symptom. After adding _system endpoint
> > > monitoring I have discovered that the particular node has a different
> > > behaviour than the other ones in terms of Erlang process count.
> > >
> > > The process_count metric of the normal nodes is stable around 1k to
> 1.3k
> > > while the other node's process_count is slowly but continuously growing
> > > until a little above than 5k processes that is when it gets 'frozen'.
> > After
> > > restarting the value comes back to the normal 1k to 1.3k (to
> immediately
> > > start slowly growing again, of course :)).
> > >
> > > Any idea? Thanks!
> > >
> > > On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso <
> carlos.alonso@cabify.com >
> > > wrote:
> > >
> > >> This is one of the complete errors sequences I can see:
> > >>
> > >> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator
> > >> -------- Error in process <0.24558.209> on node
> 'couchdb@couchdb-node-1
> > '
> > >> with exit value:
> > >>
> > >>
> >
> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}
> > >>
> > >>
> >
> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
> > >>
> > >> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1
> <0.5208.204>
> > >> aab326c0bb req_err (2515771787 <(251)%20577-1787>
> <(251)%20577-1787>) badmatch : ok
> > >> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> > >> L295">>,<<"chttpd:handle_request_int/1
> > L231">>,<<"mochiweb_http:headers/6
> > >> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> > >> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1
> > <0.20718.207>
> > >> -------- Replicator, request PUT to "
> > >>
> >
> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false
> > "
> > >> failed due to error {error,
> > >> {'EXIT',
> > >> {{{nocatch,{mp_parser_died,noproc}},
> > >> ...
> > >>
> > >> Regards
> > >>
> > >> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <
> carlos.alonso@cabify.com
> > >
> > >> wrote:
> > >>
> > >>> The 'weird' thing about the mp_parser_died error is that, according
> to
> > >>> the description of the issue 745, the replication never finishes as
> the
> > >>> item that fails once, seems to fail forever, but in my case they
> fail,
> > but
> > >>> then they seem to work (possibly as the replication is retried), as I
> > can
> > >>> find the documents that generated the error (in the logs) in the
> target
> > >>> db...
> > >>>
> > >>> Regards
> > >>>
> > >>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <
> > carlos.alonso@cabify.com >
> > >>> wrote:
> > >>>
> > >>>> So to give some more context this node is responsible for
> replicating
> > a
> > >>>> database that has quite many attachments and it raises the 'famous'
> > >>>> mp_parser_died,noproc error, that I think is this one:
> > >>>> https://github.com/apache/couchdb/issues/745
> > >>>>
> > >>>> What I've identified so far from the logs is that along with the
> error
> > >>>> described above, also this error appears:
> > >>>>
> > >>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1
> > >>>> <0.30012.3408> 520e44b7ae req_err (2515771787 <(251)%20577-1787>
> <(251)%20577-1787>)
> > >>>> badmatch : ok
> > >>>> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> > >>>> L295">>,<<"chttpd:handle_request_int/1
> > L231">>,<<"mochiweb_http:headers/6
> > >>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> > >>>>
> > >>>> Sometimes it appears just after the mp_parser_died error, sometimes
> > the
> > >>>> parser error happens without 'triggering' one of this badmatch ones.
> > >>>>
> > >>>> Then, after a while of this sequence, the initially described
> > >>>> sel_conn_closed error starts raising for all requests and the node
> > gets
> > >>>> frozen. It is not responsive but it is still not removed from the
> > cluster,
> > >>>> holding its replications and, obviously, not replicating anything
> > until it
> > >>>> is restarted.
> > >>>>
> > >>>> I can also see interleaved unauthorized errors, which don't make
> much
> > >>>> sense as I'm the only one accessing this cluster
> > >>>>
> > >>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1
> > >>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are
> > not
> > >>>> authorized to access this db.">>} [{couch_db,open,2
> > >>>>
> > >>>>
> >
> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> > >>>>
> > >>>>
> > >>>> To me, it feels like the mp_parser_died error slowly breaks
> something
> > >>>> that in the end brings the node unresponsive, as those errors happen
> > a lot
> > >>>> in that particular replication.
> > >>>>
> > >>>> Regards and thanks a lot for your help!
> > >>>>
> > >>>>
> > >>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet < wohali@apache.org >
> wrote:
> > >>>>
> > >>>>> Is there more to the error? All this shows us is that the
> replicator
> > >>>>> itself attempted a POST and had the connection closed on it.
> > (Remember
> > >>>>> that the replicator is basically just a custom client that sits
> > >>>>> alongside CouchDB on the same machine.) There should be more to the
> > >>>>> error log that shows why CouchDB hung up the phone.
> > >>>>>
> > >>>>> ----- Original Message -----
> > >>>>> From: "Carlos Alonso" < carlos.alonso@cabify.com >
> > >>>>> To: "user" < user@couchdb.apache.org >
> > >>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
> > >>>>> Subject: Re: Trying to understand why a node gets 'frozen'
> > >>>>>
> > >>>>> Hello, this is happening every day, always on the same node. Any
> > ideas?
> > >>>>>
> > >>>>> Thanks!
> > >>>>>
> > >>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <
> > >>>>> carlos.alonso@cabify.com >
> > >>>>> wrote:
> > >>>>>
> > >>>>> > Hello everyone!!
> > >>>>> >
> > >>>>> > I'm trying to understand an issue we're experiencing on CouchDB
> > 2.1.0
> > >>>>> > running on Ubuntu 14.04. The cluster itself is currently
> > replicating
> > >>>>> from
> > >>>>> > another source cluster and we have seen that one node gets frozen
> > >>>>> from time
> > >>>>> > to time having to restart it to get it to respond again.
> > >>>>> >
> > >>>>> > Before getting unresponsive, the node throws a lot of {error,
> > >>>>> > sel_conn_closed}. See an example trace below.
> > >>>>> >
> > >>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1
> <0.13489.0>
> > >>>>> > -------- gen_server <0.13489.0> terminated with reason:
> > >>>>> > {checkpoint_commit_failure,<<"Failure on target commit:
> > >>>>> > {'EXIT',{http_request_failed,\"POST\",\n
> > >>>>> \"
> > >>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\ ",\n
> > >>>>> > {error,sel_conn_closed}}}">>}
> > >>>>> > last msg:
> > >>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
> > >>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
> > >>>>> > \" http://127.0.0.1:5984/mydb/_ensure_full_commit\ ",\n
> > >>>>> > {error,sel_conn_closed}}}">>}}
> > >>>>> > state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
> > >>>>> > https://source_ip/mydb/
> > >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> > >>>>> >
> > >>>>>
> >
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
> > >>>>> > http://127.0.0.1:5984/mydb/
> > >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> > >>>>> >
> > >>>>>
> >
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
> > >>>>> >
> > >>>>> > The particular node is 'responsible' for a replication that has
> > >>>>> quite many
> > >>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
> > >>>>> > https://github.com/apache/couchdb/issues/745 ), but I don't
> know if
> > >>>>> that
> > >>>>> > may have any relationship.
> > >>>>> >
> > >>>>> > When that happens, just restarting the node brings it up and
> > running
> > >>>>> > properly.
> > >>>>> >
> > >>>>> > Any help would be really appreciated.
> > >>>>> >
> > >>>>> > Regards
> > >>>>> > --
> > >>>>> > [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>>>> >
> > >>>>> > *Carlos Alonso*
> > >>>>> > Data Engineer
> > >>>>> > Madrid, Spain
> > >>>>> >
> > >>>>> > carlos.alonso@cabify.com
> > >>>>> >
> > >>>>> > Prueba gratis con este código
> > >>>>> > #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >>>>> > [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >>>>> > < http://cbify.com/tw_ES >[image: Instagram] <
> http://cbify.com/in_ES
> > >>>>> >[image:
> > >>>>> > Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>>>> >
> > >>>>> --
> > >>>>> [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>>>>
> > >>>>> *Carlos Alonso*
> > >>>>> Data Engineer
> > >>>>> Madrid, Spain
> > >>>>>
> > >>>>> carlos.alonso@cabify.com
> > >>>>>
> > >>>>> Prueba gratis con este código
> > >>>>> #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >>>>> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >>>>> < http://cbify.com/tw_ES >[image: Instagram] <
> http://cbify.com/in_ES
> > >>>>> >[image:
> > >>>>> Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>>>>
> > >>>>> --
> > >>>>> Este mensaje y cualquier archivo adjunto va dirigido
> exclusivamente a
> > >>>>> su
> > >>>>> destinatario, pudiendo contener información confidencial sometida a
> > >>>>> secreto
> > >>>>> profesional. No está permitida su reproducción o distribución sin
> la
> > >>>>> autorización expresa de Cabify. Si usted no es el destinatario
> final
> > >>>>> por
> > >>>>> favor elimínelo e infórmenos por esta vía.
> > >>>>>
> > >>>>> This message and any attached file are intended exclusively for the
> > >>>>> addressee, and it may be confidential. You are not allowed to copy
> or
> > >>>>> disclose it without Cabify's prior written authorization. If you
> are
> > >>>>> not
> > >>>>> the intended recipient please delete it from your system and notify
> > us
> > >>>>> by
> > >>>>> e-mail.
> > >>>>>
> > >>>> --
> > >>>> [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>>>
> > >>>> *Carlos Alonso*
> > >>>> Data Engineer
> > >>>> Madrid, Spain
> > >>>>
> > >>>> carlos.alonso@cabify.com
> > >>>>
> > >>>> Prueba gratis con este código
> > >>>> #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >>>> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >>>> < http://cbify.com/tw_ES >[image: Instagram] <
> http://cbify.com/in_ES
> > >[image:
> > >>>> Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>>>
> > >>> --
> > >>> [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>>
> > >>> *Carlos Alonso*
> > >>> Data Engineer
> > >>> Madrid, Spain
> > >>>
> > >>> carlos.alonso@cabify.com
> > >>>
> > >>> Prueba gratis con este código
> > >>> #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >>> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >>> < http://cbify.com/tw_ES >[image: Instagram] <
> http://cbify.com/in_ES
> > >[image:
> > >>> Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>>
> > >> --
> > >> [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >>
> > >> *Carlos Alonso*
> > >> Data Engineer
> > >> Madrid, Spain
> > >>
> > >> carlos.alonso@cabify.com
> > >>
> > >> Prueba gratis con este código
> > >> #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > >> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > >> < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES
> > >[image:
> > >> Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >>
> > > --
> > > [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> > >
> > > *Carlos Alonso*
> > > Data Engineer
> > > Madrid, Spain
> > >
> > > carlos.alonso@cabify.com
> > >
> > > Prueba gratis con este código
> > > #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > > [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > > < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES
> > >[image:
> > > Linkedin] < https://www.linkedin.com/in/mrcalonso >
> > >
> > --
> > [image: Cabify - Your private Driver] < http://www.cabify.com/ >
> >
> > *Carlos Alonso*
> > Data Engineer
> > Madrid, Spain
> >
> > carlos.alonso@cabify.com
> >
> > Prueba gratis con este código
> > #CARLOSA6319 < https://cabify.com/i/carlosa6319 >
> > [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter]
> > < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES
> >[image:
> > Linkedin] < https://www.linkedin.com/in/mrcalonso >
> >
> > --
> > Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> > destinatario, pudiendo contener información confidencial sometida a
> secreto
> > profesional. No está permitida su reproducción o distribución sin la
> > autorización expresa de Cabify. Si usted no es el destinatario final por
> > favor elimínelo e infórmenos por esta vía.
> >
> > This message and any attached file are intended exclusively for the
> > addressee, and it may be confidential. You are not allowed to copy or
> > disclose it without Cabify's prior written authorization. If you are not
> > the intended recipient please delete it from your system and notify us by
> > e-mail.
> >
>
> --
>
>
> Cabify - Your private Driver
>
> Carlos Alonso
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
>
>
> Prueba gratis con este código
> #CARLOSA6319
>         FacebookTwitterInstagramLinkedin
> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> destinatario, pudiendo contener información confidencial sometida a secreto
> profesional. No está permitida su reproducción o distribución sin la
> autorización expresa de Cabify. Si usted no es el destinatario final por
> favor elimínelo e infórmenos por esta vía.
>
> This message and any attached file are intended exclusively for the
> addressee, and it may be confidential. You are not allowed to copy or
> disclose it without Cabify's prior written authorization. If you are not
> the intended recipient please delete it from your system and notify us by
> e-mail.
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Joan Touzet <wo...@apache.org>.
Hi Carlos,

After compiling CouchDB master, I started a 3-node dev cluster:

$ dev/run -n 3 --with-admin-party-please

Then, in another terminal window, I was able to run the observer
like this:

$ erl -smp -name observer@127.0.0.1 -hidden -setcookie monster -run observer

Since 2.x nodes should be launched with -name you need to give your
local instance a -name as well, with either a FQDN or ip address
as the host address. You need to match the cluster cookie as
well. You'll have to manually add the remote node if you're
running Observer on a different machine, and you'll need access to
all the relevant ports (4369 + whatever ephemeral ports you've
allocated to Couch's BEAM; if not statically specified you may need
ALL ports open!)

Linked is a screenshot of Observer seeing all of the nodes running on
that VM, connected to node1, and showing the couch process tree.

https://i.imgur.com/Vb3Jisa.png

-Joan

----- Original Message -----
From: "Carlos Alonso" <ca...@cabify.com>
To: user@couchdb.apache.org, "Joan Touzet" <wo...@apache.org>
Sent: Tuesday, 10 October, 2017 3:37:20 AM
Subject: Re: Trying to understand why a node gets 'frozen'


Hi Geoffrey. 


Thank you very much for your post, great step by step!! 


Unfortunately that's not quite what I need. I have that similar monitoring with Datadog, but what I really need is to inspect the Erlang processes information. 


So far I've seen that the process_count metric, extracted from the _system endpoint grows continuously on just one of the nodes. The other two have it stable at around 1-1.2k processes, but the failing node grows slowly but continuously until it reaches about 5.1k that it gets unresponsive. 


By connecting the observer I'd like to see which kind of processes are those to try to get closer to the cause. ATM I've been trying to run both etop and the observer but without luck on any of them. 


Can anyone help me with a few steps on how to inspect Erlang processes on a remote server? 


Thanks!! 


On Mon, Oct 9, 2017 at 3:39 PM Geoffrey Cox < redgeoff@gmail.com > wrote: 


Hi Carlos, I wrote a post on monitoring CouchDB using Prometheus: 
https://hackernoon.com/monitoring-couchdb-with-prometheus-grafana-and-docker-4693bc8408f0 

I’m not sure if it will provide all the metrics you need, but I hope this 
helps 

Geoff 
On Mon, Oct 9, 2017 at 3:53 AM Carlos Alonso < carlos.alonso@cabify.com > 
wrote: 

> I'd like to connect a diagnosing tool such as etop, observer, ... to see 
> which processes are open there but I cannot seem to have it working. 
> 
> Could anyone please share how to run any of those tools on a remote server? 
> 
> Regards 
> 
> On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso < carlos.alonso@cabify.com > 
> wrote: 
> 
> > So I could find another relevant symptom. After adding _system endpoint 
> > monitoring I have discovered that the particular node has a different 
> > behaviour than the other ones in terms of Erlang process count. 
> > 
> > The process_count metric of the normal nodes is stable around 1k to 1.3k 
> > while the other node's process_count is slowly but continuously growing 
> > until a little above than 5k processes that is when it gets 'frozen'. 
> After 
> > restarting the value comes back to the normal 1k to 1.3k (to immediately 
> > start slowly growing again, of course :)). 
> > 
> > Any idea? Thanks! 
> > 
> > On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso < carlos.alonso@cabify.com > 
> > wrote: 
> > 
> >> This is one of the complete errors sequences I can see: 
> >> 
> >> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator 
> >> -------- Error in process <0.24558.209> on node 'couchdb@couchdb-node-1 
> ' 
> >> with exit value: 
> >> 
> >> 
> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595} 
> >> 
> >> 
> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]} 
> >> 
> >> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1 <0.5208.204> 
> >> aab326c0bb req_err (2515771787 <(251)%20577-1787>) badmatch : ok 
> >> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1 
> >> L295">>,<<"chttpd:handle_request_int/1 
> L231">>,<<"mochiweb_http:headers/6 
> >> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>] 
> >> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1 
> <0.20718.207> 
> >> -------- Replicator, request PUT to " 
> >> 
> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false 
> " 
> >> failed due to error {error, 
> >> {'EXIT', 
> >> {{{nocatch,{mp_parser_died,noproc}}, 
> >> ... 
> >> 
> >> Regards 
> >> 
> >> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso < carlos.alonso@cabify.com 
> > 
> >> wrote: 
> >> 
> >>> The 'weird' thing about the mp_parser_died error is that, according to 
> >>> the description of the issue 745, the replication never finishes as the 
> >>> item that fails once, seems to fail forever, but in my case they fail, 
> but 
> >>> then they seem to work (possibly as the replication is retried), as I 
> can 
> >>> find the documents that generated the error (in the logs) in the target 
> >>> db... 
> >>> 
> >>> Regards 
> >>> 
> >>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso < 
> carlos.alonso@cabify.com > 
> >>> wrote: 
> >>> 
> >>>> So to give some more context this node is responsible for replicating 
> a 
> >>>> database that has quite many attachments and it raises the 'famous' 
> >>>> mp_parser_died,noproc error, that I think is this one: 
> >>>> https://github.com/apache/couchdb/issues/745 
> >>>> 
> >>>> What I've identified so far from the logs is that along with the error 
> >>>> described above, also this error appears: 
> >>>> 
> >>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1 
> >>>> <0.30012.3408> 520e44b7ae req_err (2515771787 <(251)%20577-1787>) 
> >>>> badmatch : ok 
> >>>> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1 
> >>>> L295">>,<<"chttpd:handle_request_int/1 
> L231">>,<<"mochiweb_http:headers/6 
> >>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>] 
> >>>> 
> >>>> Sometimes it appears just after the mp_parser_died error, sometimes 
> the 
> >>>> parser error happens without 'triggering' one of this badmatch ones. 
> >>>> 
> >>>> Then, after a while of this sequence, the initially described 
> >>>> sel_conn_closed error starts raising for all requests and the node 
> gets 
> >>>> frozen. It is not responsive but it is still not removed from the 
> cluster, 
> >>>> holding its replications and, obviously, not replicating anything 
> until it 
> >>>> is restarted. 
> >>>> 
> >>>> I can also see interleaved unauthorized errors, which don't make much 
> >>>> sense as I'm the only one accessing this cluster 
> >>>> 
> >>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1 
> >>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are 
> not 
> >>>> authorized to access this db.">>} [{couch_db,open,2 
> >>>> 
> >>>> 
> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}] 
> >>>> 
> >>>> 
> >>>> To me, it feels like the mp_parser_died error slowly breaks something 
> >>>> that in the end brings the node unresponsive, as those errors happen 
> a lot 
> >>>> in that particular replication. 
> >>>> 
> >>>> Regards and thanks a lot for your help! 
> >>>> 
> >>>> 
> >>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet < wohali@apache.org > wrote: 
> >>>> 
> >>>>> Is there more to the error? All this shows us is that the replicator 
> >>>>> itself attempted a POST and had the connection closed on it. 
> (Remember 
> >>>>> that the replicator is basically just a custom client that sits 
> >>>>> alongside CouchDB on the same machine.) There should be more to the 
> >>>>> error log that shows why CouchDB hung up the phone. 
> >>>>> 
> >>>>> ----- Original Message ----- 
> >>>>> From: "Carlos Alonso" < carlos.alonso@cabify.com > 
> >>>>> To: "user" < user@couchdb.apache.org > 
> >>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM 
> >>>>> Subject: Re: Trying to understand why a node gets 'frozen' 
> >>>>> 
> >>>>> Hello, this is happening every day, always on the same node. Any 
> ideas? 
> >>>>> 
> >>>>> Thanks! 
> >>>>> 
> >>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso < 
> >>>>> carlos.alonso@cabify.com > 
> >>>>> wrote: 
> >>>>> 
> >>>>> > Hello everyone!! 
> >>>>> > 
> >>>>> > I'm trying to understand an issue we're experiencing on CouchDB 
> 2.1.0 
> >>>>> > running on Ubuntu 14.04. The cluster itself is currently 
> replicating 
> >>>>> from 
> >>>>> > another source cluster and we have seen that one node gets frozen 
> >>>>> from time 
> >>>>> > to time having to restart it to get it to respond again. 
> >>>>> > 
> >>>>> > Before getting unresponsive, the node throws a lot of {error, 
> >>>>> > sel_conn_closed}. See an example trace below. 
> >>>>> > 
> >>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0> 
> >>>>> > -------- gen_server <0.13489.0> terminated with reason: 
> >>>>> > {checkpoint_commit_failure,<<"Failure on target commit: 
> >>>>> > {'EXIT',{http_request_failed,\"POST\",\n 
> >>>>> \" 
> >>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\ ",\n 
> >>>>> > {error,sel_conn_closed}}}">>} 
> >>>>> > last msg: 
> >>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on 
> >>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n 
> >>>>> > \" http://127.0.0.1:5984/mydb/_ensure_full_commit\ ",\n 
> >>>>> > {error,sel_conn_closed}}}">>}} 
> >>>>> > state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb," 
> >>>>> > https://source_ip/mydb/ 
> >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic 
> >>>>> > 
> >>>>> 
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb," 
> >>>>> > http://127.0.0.1:5984/mydb/ 
> >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic 
> >>>>> > 
> >>>>> 
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}} 
> >>>>> > 
> >>>>> > The particular node is 'responsible' for a replication that has 
> >>>>> quite many 
> >>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug ( 
> >>>>> > https://github.com/apache/couchdb/issues/745 ), but I don't know if 
> >>>>> that 
> >>>>> > may have any relationship. 
> >>>>> > 
> >>>>> > When that happens, just restarting the node brings it up and 
> running 
> >>>>> > properly. 
> >>>>> > 
> >>>>> > Any help would be really appreciated. 
> >>>>> > 
> >>>>> > Regards 
> >>>>> > -- 
> >>>>> > [image: Cabify - Your private Driver] < http://www.cabify.com/ > 
> >>>>> > 
> >>>>> > *Carlos Alonso* 
> >>>>> > Data Engineer 
> >>>>> > Madrid, Spain 
> >>>>> > 
> >>>>> > carlos.alonso@cabify.com 
> >>>>> > 
> >>>>> > Prueba gratis con este código 
> >>>>> > #CARLOSA6319 < https://cabify.com/i/carlosa6319 > 
> >>>>> > [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter] 
> >>>>> > < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES 
> >>>>> >[image: 
> >>>>> > Linkedin] < https://www.linkedin.com/in/mrcalonso > 
> >>>>> > 
> >>>>> -- 
> >>>>> [image: Cabify - Your private Driver] < http://www.cabify.com/ > 
> >>>>> 
> >>>>> *Carlos Alonso* 
> >>>>> Data Engineer 
> >>>>> Madrid, Spain 
> >>>>> 
> >>>>> carlos.alonso@cabify.com 
> >>>>> 
> >>>>> Prueba gratis con este código 
> >>>>> #CARLOSA6319 < https://cabify.com/i/carlosa6319 > 
> >>>>> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter] 
> >>>>> < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES 
> >>>>> >[image: 
> >>>>> Linkedin] < https://www.linkedin.com/in/mrcalonso > 
> >>>>> 
> >>>>> -- 
> >>>>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a 
> >>>>> su 
> >>>>> destinatario, pudiendo contener información confidencial sometida a 
> >>>>> secreto 
> >>>>> profesional. No está permitida su reproducción o distribución sin la 
> >>>>> autorización expresa de Cabify. Si usted no es el destinatario final 
> >>>>> por 
> >>>>> favor elimínelo e infórmenos por esta vía. 
> >>>>> 
> >>>>> This message and any attached file are intended exclusively for the 
> >>>>> addressee, and it may be confidential. You are not allowed to copy or 
> >>>>> disclose it without Cabify's prior written authorization. If you are 
> >>>>> not 
> >>>>> the intended recipient please delete it from your system and notify 
> us 
> >>>>> by 
> >>>>> e-mail. 
> >>>>> 
> >>>> -- 
> >>>> [image: Cabify - Your private Driver] < http://www.cabify.com/ > 
> >>>> 
> >>>> *Carlos Alonso* 
> >>>> Data Engineer 
> >>>> Madrid, Spain 
> >>>> 
> >>>> carlos.alonso@cabify.com 
> >>>> 
> >>>> Prueba gratis con este código 
> >>>> #CARLOSA6319 < https://cabify.com/i/carlosa6319 > 
> >>>> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter] 
> >>>> < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES 
> >[image: 
> >>>> Linkedin] < https://www.linkedin.com/in/mrcalonso > 
> >>>> 
> >>> -- 
> >>> [image: Cabify - Your private Driver] < http://www.cabify.com/ > 
> >>> 
> >>> *Carlos Alonso* 
> >>> Data Engineer 
> >>> Madrid, Spain 
> >>> 
> >>> carlos.alonso@cabify.com 
> >>> 
> >>> Prueba gratis con este código 
> >>> #CARLOSA6319 < https://cabify.com/i/carlosa6319 > 
> >>> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter] 
> >>> < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES 
> >[image: 
> >>> Linkedin] < https://www.linkedin.com/in/mrcalonso > 
> >>> 
> >> -- 
> >> [image: Cabify - Your private Driver] < http://www.cabify.com/ > 
> >> 
> >> *Carlos Alonso* 
> >> Data Engineer 
> >> Madrid, Spain 
> >> 
> >> carlos.alonso@cabify.com 
> >> 
> >> Prueba gratis con este código 
> >> #CARLOSA6319 < https://cabify.com/i/carlosa6319 > 
> >> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter] 
> >> < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES 
> >[image: 
> >> Linkedin] < https://www.linkedin.com/in/mrcalonso > 
> >> 
> > -- 
> > [image: Cabify - Your private Driver] < http://www.cabify.com/ > 
> > 
> > *Carlos Alonso* 
> > Data Engineer 
> > Madrid, Spain 
> > 
> > carlos.alonso@cabify.com 
> > 
> > Prueba gratis con este código 
> > #CARLOSA6319 < https://cabify.com/i/carlosa6319 > 
> > [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter] 
> > < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES 
> >[image: 
> > Linkedin] < https://www.linkedin.com/in/mrcalonso > 
> > 
> -- 
> [image: Cabify - Your private Driver] < http://www.cabify.com/ > 
> 
> *Carlos Alonso* 
> Data Engineer 
> Madrid, Spain 
> 
> carlos.alonso@cabify.com 
> 
> Prueba gratis con este código 
> #CARLOSA6319 < https://cabify.com/i/carlosa6319 > 
> [image: Facebook] < http://cbify.com/fb_ES >[image: Twitter] 
> < http://cbify.com/tw_ES >[image: Instagram] < http://cbify.com/in_ES >[image: 
> Linkedin] < https://www.linkedin.com/in/mrcalonso > 
> 
> -- 
> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
> destinatario, pudiendo contener información confidencial sometida a secreto 
> profesional. No está permitida su reproducción o distribución sin la 
> autorización expresa de Cabify. Si usted no es el destinatario final por 
> favor elimínelo e infórmenos por esta vía. 
> 
> This message and any attached file are intended exclusively for the 
> addressee, and it may be confidential. You are not allowed to copy or 
> disclose it without Cabify's prior written authorization. If you are not 
> the intended recipient please delete it from your system and notify us by 
> e-mail. 
> 

-- 


Cabify - Your private Driver	

Carlos Alonso 
Data Engineer 
Madrid, Spain 	

carlos.alonso@cabify.com 

	

Prueba gratis con este código 
#CARLOSA6319 
	FacebookTwitterInstagramLinkedin
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su destinatario, pudiendo contener información confidencial sometida a secreto profesional. No está permitida su reproducción o distribución sin la autorización expresa de Cabify. Si usted no es el destinatario final por favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the addressee, and it may be confidential. You are not allowed to copy or disclose it without Cabify's prior written authorization. If you are not the intended recipient please delete it from your system and notify us by e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Carlos Alonso <ca...@cabify.com>.
Hi Geoffrey.

Thank you very much for your post, great step by step!!

Unfortunately that's not quite what I need. I have that similar monitoring
with Datadog, but what I really need is to inspect the Erlang processes
information.

So far I've seen that the process_count metric, extracted from the _system
endpoint grows continuously on just one of the nodes. The other two have it
stable at around 1-1.2k processes, but the failing node grows slowly but
continuously until it reaches about 5.1k that it gets unresponsive.

By connecting the observer I'd like to see which kind of processes are
those to try to get closer to the cause. ATM I've been trying to run both
etop and the observer but without luck on any of them.

Can anyone help me with a few steps on how to inspect Erlang processes on a
remote server?

Thanks!!

On Mon, Oct 9, 2017 at 3:39 PM Geoffrey Cox <re...@gmail.com> wrote:

> Hi Carlos, I wrote a post on monitoring CouchDB using Prometheus:
>
> https://hackernoon.com/monitoring-couchdb-with-prometheus-grafana-and-docker-4693bc8408f0
>
> I’m not sure if it will provide all the metrics you need, but I hope this
> helps
>
> Geoff
> On Mon, Oct 9, 2017 at 3:53 AM Carlos Alonso <ca...@cabify.com>
> wrote:
>
> > I'd like to connect a diagnosing tool such as etop, observer, ... to see
> > which processes are open there but I cannot seem to have it working.
> >
> > Could anyone please share how to run any of those tools on a remote
> server?
> >
> > Regards
> >
> > On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso <ca...@cabify.com>
> > wrote:
> >
> > > So I could find another relevant symptom. After adding _system endpoint
> > > monitoring I have discovered that the particular node has a different
> > > behaviour than the other ones in terms of Erlang process count.
> > >
> > > The process_count metric of the normal nodes is stable around 1k to
> 1.3k
> > > while the other node's process_count is slowly but continuously growing
> > > until a little above than 5k processes that is when it gets 'frozen'.
> > After
> > > restarting the value comes back to the normal 1k to 1.3k (to
> immediately
> > > start slowly growing again, of course :)).
> > >
> > > Any idea? Thanks!
> > >
> > > On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso <
> carlos.alonso@cabify.com>
> > > wrote:
> > >
> > >> This is one of the complete errors sequences I can see:
> > >>
> > >> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator
> > >> -------- Error in process <0.24558.209> on node
> 'couchdb@couchdb-node-1
> > '
> > >> with exit value:
> > >>
> > >>
> >
> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}
> > >>
> > >>
> >
> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
> > >>
> > >> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1
> <0.5208.204>
> > >> aab326c0bb req_err(2515771787 <(251)%20577-1787> <(251)%20577-1787>)
> badmatch : ok
> > >>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> > >> L295">>,<<"chttpd:handle_request_int/1
> > L231">>,<<"mochiweb_http:headers/6
> > >> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> > >> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1
> > <0.20718.207>
> > >> -------- Replicator, request PUT to "
> > >>
> >
> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false
> > "
> > >> failed due to error {error,
> > >>     {'EXIT',
> > >>         {{{nocatch,{mp_parser_died,noproc}},
> > >> ...
> > >>
> > >> Regards
> > >>
> > >> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <
> carlos.alonso@cabify.com
> > >
> > >> wrote:
> > >>
> > >>> The 'weird' thing about the mp_parser_died error is that, according
> to
> > >>> the description of the issue 745, the replication never finishes as
> the
> > >>> item that fails once, seems to fail forever, but in my case they
> fail,
> > but
> > >>> then they seem to work (possibly as the replication is retried), as I
> > can
> > >>> find the documents that generated the error (in the logs) in the
> target
> > >>> db...
> > >>>
> > >>> Regards
> > >>>
> > >>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <
> > carlos.alonso@cabify.com>
> > >>> wrote:
> > >>>
> > >>>> So to give some more context this node is responsible for
> replicating
> > a
> > >>>> database that has quite many attachments and it raises the 'famous'
> > >>>> mp_parser_died,noproc error, that I think is this one:
> > >>>> https://github.com/apache/couchdb/issues/745
> > >>>>
> > >>>> What I've identified so far from the logs is that along with the
> error
> > >>>> described above, also this error appears:
> > >>>>
> > >>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1
> > >>>> <0.30012.3408> 520e44b7ae req_err(2515771787 <(251)%20577-1787>
> <(251)%20577-1787>)
> > >>>> badmatch : ok
> > >>>>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> > >>>> L295">>,<<"chttpd:handle_request_int/1
> > L231">>,<<"mochiweb_http:headers/6
> > >>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> > >>>>
> > >>>> Sometimes it appears just after the mp_parser_died error, sometimes
> > the
> > >>>> parser error happens without 'triggering' one of this badmatch ones.
> > >>>>
> > >>>> Then, after a while of this sequence, the initially described
> > >>>> sel_conn_closed error starts raising for all requests and the node
> > gets
> > >>>> frozen. It is not responsive but it is still not removed from the
> > cluster,
> > >>>> holding its replications and, obviously, not replicating anything
> > until it
> > >>>> is restarted.
> > >>>>
> > >>>> I can also see interleaved unauthorized errors, which don't make
> much
> > >>>> sense as I'm the only one accessing this cluster
> > >>>>
> > >>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1
> > >>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are
> > not
> > >>>> authorized to access this db.">>} [{couch_db,open,2
> > >>>>
> > >>>>
> >
> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> > >>>>
> > >>>>
> > >>>> To me, it feels like the mp_parser_died error slowly breaks
> something
> > >>>> that in the end brings the node unresponsive, as those errors happen
> > a lot
> > >>>> in that particular replication.
> > >>>>
> > >>>> Regards and thanks a lot for your help!
> > >>>>
> > >>>>
> > >>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <wo...@apache.org>
> wrote:
> > >>>>
> > >>>>> Is there more to the error? All this shows us is that the
> replicator
> > >>>>> itself attempted a POST and had the connection closed on it.
> > (Remember
> > >>>>> that the replicator is basically just a custom client that sits
> > >>>>> alongside CouchDB on the same machine.) There should be more to the
> > >>>>> error log that shows why CouchDB hung up the phone.
> > >>>>>
> > >>>>> ----- Original Message -----
> > >>>>> From: "Carlos Alonso" <ca...@cabify.com>
> > >>>>> To: "user" <us...@couchdb.apache.org>
> > >>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
> > >>>>> Subject: Re: Trying to understand why a node gets 'frozen'
> > >>>>>
> > >>>>> Hello, this is happening every day, always on the same node. Any
> > ideas?
> > >>>>>
> > >>>>> Thanks!
> > >>>>>
> > >>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <
> > >>>>> carlos.alonso@cabify.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>> > Hello everyone!!
> > >>>>> >
> > >>>>> > I'm trying to understand an issue we're experiencing on CouchDB
> > 2.1.0
> > >>>>> > running on Ubuntu 14.04. The cluster itself is currently
> > replicating
> > >>>>> from
> > >>>>> > another source cluster and we have seen that one node gets frozen
> > >>>>> from time
> > >>>>> > to time having to restart it to get it to respond again.
> > >>>>> >
> > >>>>> > Before getting unresponsive, the node throws a lot of {error,
> > >>>>> > sel_conn_closed}. See an example trace below.
> > >>>>> >
> > >>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1
> <0.13489.0>
> > >>>>> > -------- gen_server <0.13489.0> terminated with reason:
> > >>>>> > {checkpoint_commit_failure,<<"Failure on target commit:
> > >>>>> > {'EXIT',{http_request_failed,\"POST\",\n
> > >>>>>  \"
> > >>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
> > >>>>> >        {error,sel_conn_closed}}}">>}
> > >>>>> >   last msg:
> > >>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
> > >>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
> > >>>>> >          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
> > >>>>> >                  {error,sel_conn_closed}}}">>}}
> > >>>>> >      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
> > >>>>> > https://source_ip/mydb/
> > >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> > >>>>> >
> > >>>>>
> >
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
> > >>>>> > http://127.0.0.1:5984/mydb/
> > >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> > >>>>> >
> > >>>>>
> >
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
> > >>>>> >
> > >>>>> > The particular node is 'responsible' for a replication that has
> > >>>>> quite many
> > >>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
> > >>>>> > https://github.com/apache/couchdb/issues/745), but I don't know
> if
> > >>>>> that
> > >>>>> > may have any relationship.
> > >>>>> >
> > >>>>> > When that happens, just restarting the node brings it up and
> > running
> > >>>>> > properly.
> > >>>>> >
> > >>>>> > Any help would be really appreciated.
> > >>>>> >
> > >>>>> > Regards
> > >>>>> > --
> > >>>>> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
> > >>>>> >
> > >>>>> > *Carlos Alonso*
> > >>>>> > Data Engineer
> > >>>>> > Madrid, Spain
> > >>>>> >
> > >>>>> > carlos.alonso@cabify.com
> > >>>>> >
> > >>>>> > Prueba gratis con este código
> > >>>>> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > >>>>> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > >>>>> > <http://cbify.com/tw_ES>[image: Instagram] <
> http://cbify.com/in_ES
> > >>>>> >[image:
> > >>>>> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
> > >>>>> >
> > >>>>> --
> > >>>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> > >>>>>
> > >>>>> *Carlos Alonso*
> > >>>>> Data Engineer
> > >>>>> Madrid, Spain
> > >>>>>
> > >>>>> carlos.alonso@cabify.com
> > >>>>>
> > >>>>> Prueba gratis con este código
> > >>>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > >>>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > >>>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> > >>>>> >[image:
> > >>>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> > >>>>>
> > >>>>> --
> > >>>>> Este mensaje y cualquier archivo adjunto va dirigido
> exclusivamente a
> > >>>>> su
> > >>>>> destinatario, pudiendo contener información confidencial sometida a
> > >>>>> secreto
> > >>>>> profesional. No está permitida su reproducción o distribución sin
> la
> > >>>>> autorización expresa de Cabify. Si usted no es el destinatario
> final
> > >>>>> por
> > >>>>> favor elimínelo e infórmenos por esta vía.
> > >>>>>
> > >>>>> This message and any attached file are intended exclusively for the
> > >>>>> addressee, and it may be confidential. You are not allowed to copy
> or
> > >>>>> disclose it without Cabify's prior written authorization. If you
> are
> > >>>>> not
> > >>>>> the intended recipient please delete it from your system and notify
> > us
> > >>>>> by
> > >>>>> e-mail.
> > >>>>>
> > >>>> --
> > >>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> > >>>>
> > >>>> *Carlos Alonso*
> > >>>> Data Engineer
> > >>>> Madrid, Spain
> > >>>>
> > >>>> carlos.alonso@cabify.com
> > >>>>
> > >>>> Prueba gratis con este código
> > >>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > >>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > >>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> > >[image:
> > >>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> > >>>>
> > >>> --
> > >>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> > >>>
> > >>> *Carlos Alonso*
> > >>> Data Engineer
> > >>> Madrid, Spain
> > >>>
> > >>> carlos.alonso@cabify.com
> > >>>
> > >>> Prueba gratis con este código
> > >>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > >>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > >>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> > >[image:
> > >>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> > >>>
> > >> --
> > >> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> > >>
> > >> *Carlos Alonso*
> > >> Data Engineer
> > >> Madrid, Spain
> > >>
> > >> carlos.alonso@cabify.com
> > >>
> > >> Prueba gratis con este código
> > >> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > >> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > >> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> > >[image:
> > >> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> > >>
> > > --
> > > [image: Cabify - Your private Driver] <http://www.cabify.com/>
> > >
> > > *Carlos Alonso*
> > > Data Engineer
> > > Madrid, Spain
> > >
> > > carlos.alonso@cabify.com
> > >
> > > Prueba gratis con este código
> > > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> > >[image:
> > > Linkedin] <https://www.linkedin.com/in/mrcalonso>
> > >
> > --
> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >
> > *Carlos Alonso*
> > Data Engineer
> > Madrid, Spain
> >
> > carlos.alonso@cabify.com
> >
> > Prueba gratis con este código
> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >
> > --
> > Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> > destinatario, pudiendo contener información confidencial sometida a
> secreto
> > profesional. No está permitida su reproducción o distribución sin la
> > autorización expresa de Cabify. Si usted no es el destinatario final por
> > favor elimínelo e infórmenos por esta vía.
> >
> > This message and any attached file are intended exclusively for the
> > addressee, and it may be confidential. You are not allowed to copy or
> > disclose it without Cabify's prior written authorization. If you are not
> > the intended recipient please delete it from your system and notify us by
> > e-mail.
> >
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Geoffrey Cox <re...@gmail.com>.
Hi Carlos, I wrote a post on monitoring CouchDB using Prometheus:
https://hackernoon.com/monitoring-couchdb-with-prometheus-grafana-and-docker-4693bc8408f0

I’m not sure if it will provide all the metrics you need, but I hope this
helps

Geoff
On Mon, Oct 9, 2017 at 3:53 AM Carlos Alonso <ca...@cabify.com>
wrote:

> I'd like to connect a diagnosing tool such as etop, observer, ... to see
> which processes are open there but I cannot seem to have it working.
>
> Could anyone please share how to run any of those tools on a remote server?
>
> Regards
>
> On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso <ca...@cabify.com>
> wrote:
>
> > So I could find another relevant symptom. After adding _system endpoint
> > monitoring I have discovered that the particular node has a different
> > behaviour than the other ones in terms of Erlang process count.
> >
> > The process_count metric of the normal nodes is stable around 1k to 1.3k
> > while the other node's process_count is slowly but continuously growing
> > until a little above than 5k processes that is when it gets 'frozen'.
> After
> > restarting the value comes back to the normal 1k to 1.3k (to immediately
> > start slowly growing again, of course :)).
> >
> > Any idea? Thanks!
> >
> > On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso <ca...@cabify.com>
> > wrote:
> >
> >> This is one of the complete errors sequences I can see:
> >>
> >> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator
> >> -------- Error in process <0.24558.209> on node 'couchdb@couchdb-node-1
> '
> >> with exit value:
> >>
> >>
> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}
> >>
> >>
> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
> >>
> >> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1 <0.5208.204>
> >> aab326c0bb req_err(2515771787 <(251)%20577-1787>) badmatch : ok
> >>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> >> L295">>,<<"chttpd:handle_request_int/1
> L231">>,<<"mochiweb_http:headers/6
> >> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> >> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1
> <0.20718.207>
> >> -------- Replicator, request PUT to "
> >>
> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false
> "
> >> failed due to error {error,
> >>     {'EXIT',
> >>         {{{nocatch,{mp_parser_died,noproc}},
> >> ...
> >>
> >> Regards
> >>
> >> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <carlos.alonso@cabify.com
> >
> >> wrote:
> >>
> >>> The 'weird' thing about the mp_parser_died error is that, according to
> >>> the description of the issue 745, the replication never finishes as the
> >>> item that fails once, seems to fail forever, but in my case they fail,
> but
> >>> then they seem to work (possibly as the replication is retried), as I
> can
> >>> find the documents that generated the error (in the logs) in the target
> >>> db...
> >>>
> >>> Regards
> >>>
> >>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <
> carlos.alonso@cabify.com>
> >>> wrote:
> >>>
> >>>> So to give some more context this node is responsible for replicating
> a
> >>>> database that has quite many attachments and it raises the 'famous'
> >>>> mp_parser_died,noproc error, that I think is this one:
> >>>> https://github.com/apache/couchdb/issues/745
> >>>>
> >>>> What I've identified so far from the logs is that along with the error
> >>>> described above, also this error appears:
> >>>>
> >>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1
> >>>> <0.30012.3408> 520e44b7ae req_err(2515771787 <(251)%20577-1787>)
> >>>> badmatch : ok
> >>>>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> >>>> L295">>,<<"chttpd:handle_request_int/1
> L231">>,<<"mochiweb_http:headers/6
> >>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> >>>>
> >>>> Sometimes it appears just after the mp_parser_died error, sometimes
> the
> >>>> parser error happens without 'triggering' one of this badmatch ones.
> >>>>
> >>>> Then, after a while of this sequence, the initially described
> >>>> sel_conn_closed error starts raising for all requests and the node
> gets
> >>>> frozen. It is not responsive but it is still not removed from the
> cluster,
> >>>> holding its replications and, obviously, not replicating anything
> until it
> >>>> is restarted.
> >>>>
> >>>> I can also see interleaved unauthorized errors, which don't make much
> >>>> sense as I'm the only one accessing this cluster
> >>>>
> >>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1
> >>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are
> not
> >>>> authorized to access this db.">>} [{couch_db,open,2
> >>>>
> >>>>
> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> >>>>
> >>>>
> >>>> To me, it feels like the mp_parser_died error slowly breaks something
> >>>> that in the end brings the node unresponsive, as those errors happen
> a lot
> >>>> in that particular replication.
> >>>>
> >>>> Regards and thanks a lot for your help!
> >>>>
> >>>>
> >>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <wo...@apache.org> wrote:
> >>>>
> >>>>> Is there more to the error? All this shows us is that the replicator
> >>>>> itself attempted a POST and had the connection closed on it.
> (Remember
> >>>>> that the replicator is basically just a custom client that sits
> >>>>> alongside CouchDB on the same machine.) There should be more to the
> >>>>> error log that shows why CouchDB hung up the phone.
> >>>>>
> >>>>> ----- Original Message -----
> >>>>> From: "Carlos Alonso" <ca...@cabify.com>
> >>>>> To: "user" <us...@couchdb.apache.org>
> >>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
> >>>>> Subject: Re: Trying to understand why a node gets 'frozen'
> >>>>>
> >>>>> Hello, this is happening every day, always on the same node. Any
> ideas?
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <
> >>>>> carlos.alonso@cabify.com>
> >>>>> wrote:
> >>>>>
> >>>>> > Hello everyone!!
> >>>>> >
> >>>>> > I'm trying to understand an issue we're experiencing on CouchDB
> 2.1.0
> >>>>> > running on Ubuntu 14.04. The cluster itself is currently
> replicating
> >>>>> from
> >>>>> > another source cluster and we have seen that one node gets frozen
> >>>>> from time
> >>>>> > to time having to restart it to get it to respond again.
> >>>>> >
> >>>>> > Before getting unresponsive, the node throws a lot of {error,
> >>>>> > sel_conn_closed}. See an example trace below.
> >>>>> >
> >>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
> >>>>> > -------- gen_server <0.13489.0> terminated with reason:
> >>>>> > {checkpoint_commit_failure,<<"Failure on target commit:
> >>>>> > {'EXIT',{http_request_failed,\"POST\",\n
> >>>>>  \"
> >>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
> >>>>> >        {error,sel_conn_closed}}}">>}
> >>>>> >   last msg:
> >>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
> >>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
> >>>>> >          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
> >>>>> >                  {error,sel_conn_closed}}}">>}}
> >>>>> >      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
> >>>>> > https://source_ip/mydb/
> >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> >>>>> >
> >>>>>
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
> >>>>> > http://127.0.0.1:5984/mydb/
> >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> >>>>> >
> >>>>>
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
> >>>>> >
> >>>>> > The particular node is 'responsible' for a replication that has
> >>>>> quite many
> >>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
> >>>>> > https://github.com/apache/couchdb/issues/745), but I don't know if
> >>>>> that
> >>>>> > may have any relationship.
> >>>>> >
> >>>>> > When that happens, just restarting the node brings it up and
> running
> >>>>> > properly.
> >>>>> >
> >>>>> > Any help would be really appreciated.
> >>>>> >
> >>>>> > Regards
> >>>>> > --
> >>>>> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>>>> >
> >>>>> > *Carlos Alonso*
> >>>>> > Data Engineer
> >>>>> > Madrid, Spain
> >>>>> >
> >>>>> > carlos.alonso@cabify.com
> >>>>> >
> >>>>> > Prueba gratis con este código
> >>>>> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >>>>> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >>>>> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >>>>> >[image:
> >>>>> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>>>> >
> >>>>> --
> >>>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>>>>
> >>>>> *Carlos Alonso*
> >>>>> Data Engineer
> >>>>> Madrid, Spain
> >>>>>
> >>>>> carlos.alonso@cabify.com
> >>>>>
> >>>>> Prueba gratis con este código
> >>>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >>>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >>>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >>>>> >[image:
> >>>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>>>>
> >>>>> --
> >>>>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a
> >>>>> su
> >>>>> destinatario, pudiendo contener información confidencial sometida a
> >>>>> secreto
> >>>>> profesional. No está permitida su reproducción o distribución sin la
> >>>>> autorización expresa de Cabify. Si usted no es el destinatario final
> >>>>> por
> >>>>> favor elimínelo e infórmenos por esta vía.
> >>>>>
> >>>>> This message and any attached file are intended exclusively for the
> >>>>> addressee, and it may be confidential. You are not allowed to copy or
> >>>>> disclose it without Cabify's prior written authorization. If you are
> >>>>> not
> >>>>> the intended recipient please delete it from your system and notify
> us
> >>>>> by
> >>>>> e-mail.
> >>>>>
> >>>> --
> >>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>>>
> >>>> *Carlos Alonso*
> >>>> Data Engineer
> >>>> Madrid, Spain
> >>>>
> >>>> carlos.alonso@cabify.com
> >>>>
> >>>> Prueba gratis con este código
> >>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> >>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>>>
> >>> --
> >>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>>
> >>> *Carlos Alonso*
> >>> Data Engineer
> >>> Madrid, Spain
> >>>
> >>> carlos.alonso@cabify.com
> >>>
> >>> Prueba gratis con este código
> >>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> >>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>>
> >> --
> >> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >>
> >> *Carlos Alonso*
> >> Data Engineer
> >> Madrid, Spain
> >>
> >> carlos.alonso@cabify.com
> >>
> >> Prueba gratis con este código
> >> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> >> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> >> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> >> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >>
> > --
> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >
> > *Carlos Alonso*
> > Data Engineer
> > Madrid, Spain
> >
> > carlos.alonso@cabify.com
> >
> > Prueba gratis con este código
> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
> --
> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> destinatario, pudiendo contener información confidencial sometida a secreto
> profesional. No está permitida su reproducción o distribución sin la
> autorización expresa de Cabify. Si usted no es el destinatario final por
> favor elimínelo e infórmenos por esta vía.
>
> This message and any attached file are intended exclusively for the
> addressee, and it may be confidential. You are not allowed to copy or
> disclose it without Cabify's prior written authorization. If you are not
> the intended recipient please delete it from your system and notify us by
> e-mail.
>

Re: Trying to understand why a node gets 'frozen'

Posted by Carlos Alonso <ca...@cabify.com>.
I'd like to connect a diagnosing tool such as etop, observer, ... to see
which processes are open there but I cannot seem to have it working.

Could anyone please share how to run any of those tools on a remote server?

Regards

On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso <ca...@cabify.com>
wrote:

> So I could find another relevant symptom. After adding _system endpoint
> monitoring I have discovered that the particular node has a different
> behaviour than the other ones in terms of Erlang process count.
>
> The process_count metric of the normal nodes is stable around 1k to 1.3k
> while the other node's process_count is slowly but continuously growing
> until a little above than 5k processes that is when it gets 'frozen'. After
> restarting the value comes back to the normal 1k to 1.3k (to immediately
> start slowly growing again, of course :)).
>
> Any idea? Thanks!
>
> On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso <ca...@cabify.com>
> wrote:
>
>> This is one of the complete errors sequences I can see:
>>
>> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator
>> -------- Error in process <0.24558.209> on node 'couchdb@couchdb-node-1'
>> with exit value:
>>
>> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}
>>
>> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
>>
>> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1 <0.5208.204>
>> aab326c0bb req_err(2515771787 <(251)%20577-1787>) badmatch : ok
>>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
>> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6
>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
>> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1 <0.20718.207>
>> -------- Replicator, request PUT to "
>> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false"
>> failed due to error {error,
>>     {'EXIT',
>>         {{{nocatch,{mp_parser_died,noproc}},
>> ...
>>
>> Regards
>>
>> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <ca...@cabify.com>
>> wrote:
>>
>>> The 'weird' thing about the mp_parser_died error is that, according to
>>> the description of the issue 745, the replication never finishes as the
>>> item that fails once, seems to fail forever, but in my case they fail, but
>>> then they seem to work (possibly as the replication is retried), as I can
>>> find the documents that generated the error (in the logs) in the target
>>> db...
>>>
>>> Regards
>>>
>>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <ca...@cabify.com>
>>> wrote:
>>>
>>>> So to give some more context this node is responsible for replicating a
>>>> database that has quite many attachments and it raises the 'famous'
>>>> mp_parser_died,noproc error, that I think is this one:
>>>> https://github.com/apache/couchdb/issues/745
>>>>
>>>> What I've identified so far from the logs is that along with the error
>>>> described above, also this error appears:
>>>>
>>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1
>>>> <0.30012.3408> 520e44b7ae req_err(2515771787 <(251)%20577-1787>)
>>>> badmatch : ok
>>>>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
>>>> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6
>>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
>>>>
>>>> Sometimes it appears just after the mp_parser_died error, sometimes the
>>>> parser error happens without 'triggering' one of this badmatch ones.
>>>>
>>>> Then, after a while of this sequence, the initially described
>>>> sel_conn_closed error starts raising for all requests and the node gets
>>>> frozen. It is not responsive but it is still not removed from the cluster,
>>>> holding its replications and, obviously, not replicating anything until it
>>>> is restarted.
>>>>
>>>> I can also see interleaved unauthorized errors, which don't make much
>>>> sense as I'm the only one accessing this cluster
>>>>
>>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1
>>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are not
>>>> authorized to access this db.">>} [{couch_db,open,2
>>>>
>>>> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
>>>>
>>>>
>>>> To me, it feels like the mp_parser_died error slowly breaks something
>>>> that in the end brings the node unresponsive, as those errors happen a lot
>>>> in that particular replication.
>>>>
>>>> Regards and thanks a lot for your help!
>>>>
>>>>
>>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <wo...@apache.org> wrote:
>>>>
>>>>> Is there more to the error? All this shows us is that the replicator
>>>>> itself attempted a POST and had the connection closed on it. (Remember
>>>>> that the replicator is basically just a custom client that sits
>>>>> alongside CouchDB on the same machine.) There should be more to the
>>>>> error log that shows why CouchDB hung up the phone.
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Carlos Alonso" <ca...@cabify.com>
>>>>> To: "user" <us...@couchdb.apache.org>
>>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
>>>>> Subject: Re: Trying to understand why a node gets 'frozen'
>>>>>
>>>>> Hello, this is happening every day, always on the same node. Any ideas?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <
>>>>> carlos.alonso@cabify.com>
>>>>> wrote:
>>>>>
>>>>> > Hello everyone!!
>>>>> >
>>>>> > I'm trying to understand an issue we're experiencing on CouchDB 2.1.0
>>>>> > running on Ubuntu 14.04. The cluster itself is currently replicating
>>>>> from
>>>>> > another source cluster and we have seen that one node gets frozen
>>>>> from time
>>>>> > to time having to restart it to get it to respond again.
>>>>> >
>>>>> > Before getting unresponsive, the node throws a lot of {error,
>>>>> > sel_conn_closed}. See an example trace below.
>>>>> >
>>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
>>>>> > -------- gen_server <0.13489.0> terminated with reason:
>>>>> > {checkpoint_commit_failure,<<"Failure on target commit:
>>>>> > {'EXIT',{http_request_failed,\"POST\",\n
>>>>>  \"
>>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>>>>> >        {error,sel_conn_closed}}}">>}
>>>>> >   last msg:
>>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
>>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
>>>>> >          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>>>>> >                  {error,sel_conn_closed}}}">>}}
>>>>> >      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
>>>>> > https://source_ip/mydb/
>>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>>>>> >
>>>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
>>>>> > http://127.0.0.1:5984/mydb/
>>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>>>>> >
>>>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
>>>>> >
>>>>> > The particular node is 'responsible' for a replication that has
>>>>> quite many
>>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
>>>>> > https://github.com/apache/couchdb/issues/745), but I don't know if
>>>>> that
>>>>> > may have any relationship.
>>>>> >
>>>>> > When that happens, just restarting the node brings it up and running
>>>>> > properly.
>>>>> >
>>>>> > Any help would be really appreciated.
>>>>> >
>>>>> > Regards
>>>>> > --
>>>>> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>>> >
>>>>> > *Carlos Alonso*
>>>>> > Data Engineer
>>>>> > Madrid, Spain
>>>>> >
>>>>> > carlos.alonso@cabify.com
>>>>> >
>>>>> > Prueba gratis con este código
>>>>> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>>>> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>>>> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>>>>> >[image:
>>>>> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>>> >
>>>>> --
>>>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>>>
>>>>> *Carlos Alonso*
>>>>> Data Engineer
>>>>> Madrid, Spain
>>>>>
>>>>> carlos.alonso@cabify.com
>>>>>
>>>>> Prueba gratis con este código
>>>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>>>>> >[image:
>>>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>>>
>>>>> --
>>>>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a
>>>>> su
>>>>> destinatario, pudiendo contener información confidencial sometida a
>>>>> secreto
>>>>> profesional. No está permitida su reproducción o distribución sin la
>>>>> autorización expresa de Cabify. Si usted no es el destinatario final
>>>>> por
>>>>> favor elimínelo e infórmenos por esta vía.
>>>>>
>>>>> This message and any attached file are intended exclusively for the
>>>>> addressee, and it may be confidential. You are not allowed to copy or
>>>>> disclose it without Cabify's prior written authorization. If you are
>>>>> not
>>>>> the intended recipient please delete it from your system and notify us
>>>>> by
>>>>> e-mail.
>>>>>
>>>> --
>>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>>
>>>> *Carlos Alonso*
>>>> Data Engineer
>>>> Madrid, Spain
>>>>
>>>> carlos.alonso@cabify.com
>>>>
>>>> Prueba gratis con este código
>>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
>>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>>
>>> --
>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>
>>> *Carlos Alonso*
>>> Data Engineer
>>> Madrid, Spain
>>>
>>> carlos.alonso@cabify.com
>>>
>>> Prueba gratis con este código
>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>
>> --
>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>
>> *Carlos Alonso*
>> Data Engineer
>> Madrid, Spain
>>
>> carlos.alonso@cabify.com
>>
>> Prueba gratis con este código
>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Carlos Alonso <ca...@cabify.com>.
So I could find another relevant symptom. After adding _system endpoint
monitoring I have discovered that the particular node has a different
behaviour than the other ones in terms of Erlang process count.

The process_count metric of the normal nodes is stable around 1k to 1.3k
while the other node's process_count is slowly but continuously growing
until a little above than 5k processes that is when it gets 'frozen'. After
restarting the value comes back to the normal 1k to 1.3k (to immediately
start slowly growing again, of course :)).

Any idea? Thanks!

On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso <ca...@cabify.com>
wrote:

> This is one of the complete errors sequences I can see:
>
> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator
> -------- Error in process <0.24558.209> on node 'couchdb@couchdb-node-1'
> with exit value:
>
> {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}
>
> ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
>
> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1 <0.5208.204>
> aab326c0bb req_err(2515771787 <(251)%20577-1787>) badmatch : ok
>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6
> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1 <0.20718.207>
> -------- Replicator, request PUT to "
> http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false"
> failed due to error {error,
>     {'EXIT',
>         {{{nocatch,{mp_parser_died,noproc}},
> ...
>
> Regards
>
> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <ca...@cabify.com>
> wrote:
>
>> The 'weird' thing about the mp_parser_died error is that, according to
>> the description of the issue 745, the replication never finishes as the
>> item that fails once, seems to fail forever, but in my case they fail, but
>> then they seem to work (possibly as the replication is retried), as I can
>> find the documents that generated the error (in the logs) in the target
>> db...
>>
>> Regards
>>
>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <ca...@cabify.com>
>> wrote:
>>
>>> So to give some more context this node is responsible for replicating a
>>> database that has quite many attachments and it raises the 'famous'
>>> mp_parser_died,noproc error, that I think is this one:
>>> https://github.com/apache/couchdb/issues/745
>>>
>>> What I've identified so far from the logs is that along with the error
>>> described above, also this error appears:
>>>
>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1
>>> <0.30012.3408> 520e44b7ae req_err(2515771787 <(251)%20577-1787>)
>>> badmatch : ok
>>>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
>>> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6
>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
>>>
>>> Sometimes it appears just after the mp_parser_died error, sometimes the
>>> parser error happens without 'triggering' one of this badmatch ones.
>>>
>>> Then, after a while of this sequence, the initially described
>>> sel_conn_closed error starts raising for all requests and the node gets
>>> frozen. It is not responsive but it is still not removed from the cluster,
>>> holding its replications and, obviously, not replicating anything until it
>>> is restarted.
>>>
>>> I can also see interleaved unauthorized errors, which don't make much
>>> sense as I'm the only one accessing this cluster
>>>
>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1
>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are not
>>> authorized to access this db.">>} [{couch_db,open,2
>>>
>>> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
>>>
>>>
>>> To me, it feels like the mp_parser_died error slowly breaks something
>>> that in the end brings the node unresponsive, as those errors happen a lot
>>> in that particular replication.
>>>
>>> Regards and thanks a lot for your help!
>>>
>>>
>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <wo...@apache.org> wrote:
>>>
>>>> Is there more to the error? All this shows us is that the replicator
>>>> itself attempted a POST and had the connection closed on it. (Remember
>>>> that the replicator is basically just a custom client that sits
>>>> alongside CouchDB on the same machine.) There should be more to the
>>>> error log that shows why CouchDB hung up the phone.
>>>>
>>>> ----- Original Message -----
>>>> From: "Carlos Alonso" <ca...@cabify.com>
>>>> To: "user" <us...@couchdb.apache.org>
>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
>>>> Subject: Re: Trying to understand why a node gets 'frozen'
>>>>
>>>> Hello, this is happening every day, always on the same node. Any ideas?
>>>>
>>>> Thanks!
>>>>
>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <carlos.alonso@cabify.com
>>>> >
>>>> wrote:
>>>>
>>>> > Hello everyone!!
>>>> >
>>>> > I'm trying to understand an issue we're experiencing on CouchDB 2.1.0
>>>> > running on Ubuntu 14.04. The cluster itself is currently replicating
>>>> from
>>>> > another source cluster and we have seen that one node gets frozen
>>>> from time
>>>> > to time having to restart it to get it to respond again.
>>>> >
>>>> > Before getting unresponsive, the node throws a lot of {error,
>>>> > sel_conn_closed}. See an example trace below.
>>>> >
>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
>>>> > -------- gen_server <0.13489.0> terminated with reason:
>>>> > {checkpoint_commit_failure,<<"Failure on target commit:
>>>> > {'EXIT',{http_request_failed,\"POST\",\n
>>>>  \"
>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>>>> >        {error,sel_conn_closed}}}">>}
>>>> >   last msg: {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure
>>>> on
>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
>>>> >          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>>>> >                  {error,sel_conn_closed}}}">>}}
>>>> >      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
>>>> > https://source_ip/mydb/
>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>>>> >
>>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
>>>> > http://127.0.0.1:5984/mydb/
>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>>>> >
>>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
>>>> >
>>>> > The particular node is 'responsible' for a replication that has quite
>>>> many
>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
>>>> > https://github.com/apache/couchdb/issues/745), but I don't know if
>>>> that
>>>> > may have any relationship.
>>>> >
>>>> > When that happens, just restarting the node brings it up and running
>>>> > properly.
>>>> >
>>>> > Any help would be really appreciated.
>>>> >
>>>> > Regards
>>>> > --
>>>> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>> >
>>>> > *Carlos Alonso*
>>>> > Data Engineer
>>>> > Madrid, Spain
>>>> >
>>>> > carlos.alonso@cabify.com
>>>> >
>>>> > Prueba gratis con este código
>>>> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>>> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>>> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>>>> >[image:
>>>> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>> >
>>>> --
>>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>>
>>>> *Carlos Alonso*
>>>> Data Engineer
>>>> Madrid, Spain
>>>>
>>>> carlos.alonso@cabify.com
>>>>
>>>> Prueba gratis con este código
>>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>>>> >[image:
>>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>>
>>>> --
>>>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
>>>> destinatario, pudiendo contener información confidencial sometida a
>>>> secreto
>>>> profesional. No está permitida su reproducción o distribución sin la
>>>> autorización expresa de Cabify. Si usted no es el destinatario final por
>>>> favor elimínelo e infórmenos por esta vía.
>>>>
>>>> This message and any attached file are intended exclusively for the
>>>> addressee, and it may be confidential. You are not allowed to copy or
>>>> disclose it without Cabify's prior written authorization. If you are not
>>>> the intended recipient please delete it from your system and notify us
>>>> by
>>>> e-mail.
>>>>
>>> --
>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>
>>> *Carlos Alonso*
>>> Data Engineer
>>> Madrid, Spain
>>>
>>> carlos.alonso@cabify.com
>>>
>>> Prueba gratis con este código
>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>
>> --
>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>
>> *Carlos Alonso*
>> Data Engineer
>> Madrid, Spain
>>
>> carlos.alonso@cabify.com
>>
>> Prueba gratis con este código
>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Carlos Alonso <ca...@cabify.com>.
This is one of the complete errors sequences I can see:

[error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator
-------- Error in process <0.24558.209> on node 'couchdb@couchdb-node-1'
with exit value:
{{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}
]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}

[error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1 <0.5208.204>
aab326c0bb req_err(2515771787) badmatch : ok
    [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6
L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
[error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1 <0.20718.207>
-------- Replicator, request PUT to "
http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false"
failed due to error {error,
    {'EXIT',
        {{{nocatch,{mp_parser_died,noproc}},
...

Regards

On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <ca...@cabify.com>
wrote:

> The 'weird' thing about the mp_parser_died error is that, according to the
> description of the issue 745, the replication never finishes as the item
> that fails once, seems to fail forever, but in my case they fail, but then
> they seem to work (possibly as the replication is retried), as I can find
> the documents that generated the error (in the logs) in the target db...
>
> Regards
>
> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <ca...@cabify.com>
> wrote:
>
>> So to give some more context this node is responsible for replicating a
>> database that has quite many attachments and it raises the 'famous'
>> mp_parser_died,noproc error, that I think is this one:
>> https://github.com/apache/couchdb/issues/745
>>
>> What I've identified so far from the logs is that along with the error
>> described above, also this error appears:
>>
>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1
>> <0.30012.3408> 520e44b7ae req_err(2515771787 <(251)%20577-1787>)
>> badmatch : ok
>>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
>> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6
>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
>>
>> Sometimes it appears just after the mp_parser_died error, sometimes the
>> parser error happens without 'triggering' one of this badmatch ones.
>>
>> Then, after a while of this sequence, the initially described
>> sel_conn_closed error starts raising for all requests and the node gets
>> frozen. It is not responsive but it is still not removed from the cluster,
>> holding its replications and, obviously, not replicating anything until it
>> is restarted.
>>
>> I can also see interleaved unauthorized errors, which don't make much
>> sense as I'm the only one accessing this cluster
>>
>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1
>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are not
>> authorized to access this db.">>} [{couch_db,open,2
>>
>> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
>>
>>
>> To me, it feels like the mp_parser_died error slowly breaks something
>> that in the end brings the node unresponsive, as those errors happen a lot
>> in that particular replication.
>>
>> Regards and thanks a lot for your help!
>>
>>
>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <wo...@apache.org> wrote:
>>
>>> Is there more to the error? All this shows us is that the replicator
>>> itself attempted a POST and had the connection closed on it. (Remember
>>> that the replicator is basically just a custom client that sits
>>> alongside CouchDB on the same machine.) There should be more to the
>>> error log that shows why CouchDB hung up the phone.
>>>
>>> ----- Original Message -----
>>> From: "Carlos Alonso" <ca...@cabify.com>
>>> To: "user" <us...@couchdb.apache.org>
>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
>>> Subject: Re: Trying to understand why a node gets 'frozen'
>>>
>>> Hello, this is happening every day, always on the same node. Any ideas?
>>>
>>> Thanks!
>>>
>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <ca...@cabify.com>
>>> wrote:
>>>
>>> > Hello everyone!!
>>> >
>>> > I'm trying to understand an issue we're experiencing on CouchDB 2.1.0
>>> > running on Ubuntu 14.04. The cluster itself is currently replicating
>>> from
>>> > another source cluster and we have seen that one node gets frozen from
>>> time
>>> > to time having to restart it to get it to respond again.
>>> >
>>> > Before getting unresponsive, the node throws a lot of {error,
>>> > sel_conn_closed}. See an example trace below.
>>> >
>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
>>> > -------- gen_server <0.13489.0> terminated with reason:
>>> > {checkpoint_commit_failure,<<"Failure on target commit:
>>> > {'EXIT',{http_request_failed,\"POST\",\n                             \"
>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>>> >        {error,sel_conn_closed}}}">>}
>>> >   last msg: {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure
>>> on
>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
>>> >          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>>> >                  {error,sel_conn_closed}}}">>}}
>>> >      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
>>> > https://source_ip/mydb/
>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>>> >
>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
>>> > http://127.0.0.1:5984/mydb/
>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>>> >
>>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
>>> >
>>> > The particular node is 'responsible' for a replication that has quite
>>> many
>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
>>> > https://github.com/apache/couchdb/issues/745), but I don't know if
>>> that
>>> > may have any relationship.
>>> >
>>> > When that happens, just restarting the node brings it up and running
>>> > properly.
>>> >
>>> > Any help would be really appreciated.
>>> >
>>> > Regards
>>> > --
>>> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>> >
>>> > *Carlos Alonso*
>>> > Data Engineer
>>> > Madrid, Spain
>>> >
>>> > carlos.alonso@cabify.com
>>> >
>>> > Prueba gratis con este código
>>> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>>> >[image:
>>> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>> >
>>> --
>>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>>
>>> *Carlos Alonso*
>>> Data Engineer
>>> Madrid, Spain
>>>
>>> carlos.alonso@cabify.com
>>>
>>> Prueba gratis con este código
>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>>> >[image:
>>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>>
>>> --
>>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
>>> destinatario, pudiendo contener información confidencial sometida a
>>> secreto
>>> profesional. No está permitida su reproducción o distribución sin la
>>> autorización expresa de Cabify. Si usted no es el destinatario final por
>>> favor elimínelo e infórmenos por esta vía.
>>>
>>> This message and any attached file are intended exclusively for the
>>> addressee, and it may be confidential. You are not allowed to copy or
>>> disclose it without Cabify's prior written authorization. If you are not
>>> the intended recipient please delete it from your system and notify us by
>>> e-mail.
>>>
>> --
>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>
>> *Carlos Alonso*
>> Data Engineer
>> Madrid, Spain
>>
>> carlos.alonso@cabify.com
>>
>> Prueba gratis con este código
>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Carlos Alonso <ca...@cabify.com>.
The 'weird' thing about the mp_parser_died error is that, according to the
description of the issue 745, the replication never finishes as the item
that fails once, seems to fail forever, but in my case they fail, but then
they seem to work (possibly as the replication is retried), as I can find
the documents that generated the error (in the logs) in the target db...

Regards

On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso <ca...@cabify.com>
wrote:

> So to give some more context this node is responsible for replicating a
> database that has quite many attachments and it raises the 'famous'
> mp_parser_died,noproc error, that I think is this one:
> https://github.com/apache/couchdb/issues/745
>
> What I've identified so far from the logs is that along with the error
> described above, also this error appears:
>
> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1 <0.30012.3408>
> 520e44b7ae req_err(2515771787 <(251)%20577-1787>) badmatch : ok
>     [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
> L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6
> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]
>
> Sometimes it appears just after the mp_parser_died error, sometimes the
> parser error happens without 'triggering' one of this badmatch ones.
>
> Then, after a while of this sequence, the initially described
> sel_conn_closed error starts raising for all requests and the node gets
> frozen. It is not responsive but it is still not removed from the cluster,
> holding its replications and, obviously, not replicating anything until it
> is restarted.
>
> I can also see interleaved unauthorized errors, which don't make much
> sense as I'm the only one accessing this cluster
>
> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1 <0.32501.3323>
> c683120c97 rexi_server throw:{unauthorized,<<"You are not authorized to
> access this db.">>} [{couch_db,open,2
>
> ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
>
>
> To me, it feels like the mp_parser_died error slowly breaks something that
> in the end brings the node unresponsive, as those errors happen a lot in
> that particular replication.
>
> Regards and thanks a lot for your help!
>
>
> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <wo...@apache.org> wrote:
>
>> Is there more to the error? All this shows us is that the replicator
>> itself attempted a POST and had the connection closed on it. (Remember
>> that the replicator is basically just a custom client that sits
>> alongside CouchDB on the same machine.) There should be more to the
>> error log that shows why CouchDB hung up the phone.
>>
>> ----- Original Message -----
>> From: "Carlos Alonso" <ca...@cabify.com>
>> To: "user" <us...@couchdb.apache.org>
>> Sent: Tuesday, 3 October, 2017 4:18:18 AM
>> Subject: Re: Trying to understand why a node gets 'frozen'
>>
>> Hello, this is happening every day, always on the same node. Any ideas?
>>
>> Thanks!
>>
>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <ca...@cabify.com>
>> wrote:
>>
>> > Hello everyone!!
>> >
>> > I'm trying to understand an issue we're experiencing on CouchDB 2.1.0
>> > running on Ubuntu 14.04. The cluster itself is currently replicating
>> from
>> > another source cluster and we have seen that one node gets frozen from
>> time
>> > to time having to restart it to get it to respond again.
>> >
>> > Before getting unresponsive, the node throws a lot of {error,
>> > sel_conn_closed}. See an example trace below.
>> >
>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
>> > -------- gen_server <0.13489.0> terminated with reason:
>> > {checkpoint_commit_failure,<<"Failure on target commit:
>> > {'EXIT',{http_request_failed,\"POST\",\n                             \"
>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>> >        {error,sel_conn_closed}}}">>}
>> >   last msg: {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
>> >          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>> >                  {error,sel_conn_closed}}}">>}}
>> >      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
>> > https://source_ip/mydb/
>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>> >
>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
>> > http://127.0.0.1:5984/mydb/
>> ",nil,[{"Accept","application/json"},{"Authorization","Basic
>> >
>> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
>> >
>> > The particular node is 'responsible' for a replication that has quite
>> many
>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
>> > https://github.com/apache/couchdb/issues/745), but I don't know if that
>> > may have any relationship.
>> >
>> > When that happens, just restarting the node brings it up and running
>> > properly.
>> >
>> > Any help would be really appreciated.
>> >
>> > Regards
>> > --
>> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
>> >
>> > *Carlos Alonso*
>> > Data Engineer
>> > Madrid, Spain
>> >
>> > carlos.alonso@cabify.com
>> >
>> > Prueba gratis con este código
>> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>> >[image:
>> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
>> >
>> --
>> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>>
>> *Carlos Alonso*
>> Data Engineer
>> Madrid, Spain
>>
>> carlos.alonso@cabify.com
>>
>> Prueba gratis con este código
>> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
>> >[image:
>> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>>
>> --
>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
>> destinatario, pudiendo contener información confidencial sometida a
>> secreto
>> profesional. No está permitida su reproducción o distribución sin la
>> autorización expresa de Cabify. Si usted no es el destinatario final por
>> favor elimínelo e infórmenos por esta vía.
>>
>> This message and any attached file are intended exclusively for the
>> addressee, and it may be confidential. You are not allowed to copy or
>> disclose it without Cabify's prior written authorization. If you are not
>> the intended recipient please delete it from your system and notify us by
>> e-mail.
>>
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Carlos Alonso <ca...@cabify.com>.
So to give some more context this node is responsible for replicating a
database that has quite many attachments and it raises the 'famous'
mp_parser_died,noproc error, that I think is this one:
https://github.com/apache/couchdb/issues/745

What I've identified so far from the logs is that along with the error
described above, also this error appears:

[error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1 <0.30012.3408>
520e44b7ae req_err(2515771787) badmatch : ok
    [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1
L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6
L91">>,<<"proc_lib:init_p_do_apply/3 L240">>]

Sometimes it appears just after the mp_parser_died error, sometimes the
parser error happens without 'triggering' one of this badmatch ones.

Then, after a while of this sequence, the initially described
sel_conn_closed error starts raising for all requests and the node gets
frozen. It is not responsive but it is still not removed from the cluster,
holding its replications and, obviously, not replicating anything until it
is restarted.

I can also see interleaved unauthorized errors, which don't make much sense
as I'm the only one accessing this cluster

[error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1 <0.32501.3323>
c683120c97 rexi_server throw:{unauthorized,<<"You are not authorized to
access this db.">>} [{couch_db,open,2
,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]


To me, it feels like the mp_parser_died error slowly breaks something that
in the end brings the node unresponsive, as those errors happen a lot in
that particular replication.

Regards and thanks a lot for your help!


On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <wo...@apache.org> wrote:

> Is there more to the error? All this shows us is that the replicator
> itself attempted a POST and had the connection closed on it. (Remember
> that the replicator is basically just a custom client that sits
> alongside CouchDB on the same machine.) There should be more to the
> error log that shows why CouchDB hung up the phone.
>
> ----- Original Message -----
> From: "Carlos Alonso" <ca...@cabify.com>
> To: "user" <us...@couchdb.apache.org>
> Sent: Tuesday, 3 October, 2017 4:18:18 AM
> Subject: Re: Trying to understand why a node gets 'frozen'
>
> Hello, this is happening every day, always on the same node. Any ideas?
>
> Thanks!
>
> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <ca...@cabify.com>
> wrote:
>
> > Hello everyone!!
> >
> > I'm trying to understand an issue we're experiencing on CouchDB 2.1.0
> > running on Ubuntu 14.04. The cluster itself is currently replicating from
> > another source cluster and we have seen that one node gets frozen from
> time
> > to time having to restart it to get it to respond again.
> >
> > Before getting unresponsive, the node throws a lot of {error,
> > sel_conn_closed}. See an example trace below.
> >
> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
> > -------- gen_server <0.13489.0> terminated with reason:
> > {checkpoint_commit_failure,<<"Failure on target commit:
> > {'EXIT',{http_request_failed,\"POST\",\n                             \"
> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
> >        {error,sel_conn_closed}}}">>}
> >   last msg: {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
> > target commit: {'EXIT',{http_request_failed,\"POST\",\n
> >          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
> >                  {error,sel_conn_closed}}}">>}}
> >      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
> > https://source_ip/mydb/
> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> >
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
> > http://127.0.0.1:5984/mydb/
> ",nil,[{"Accept","application/json"},{"Authorization","Basic
> >
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
> >
> > The particular node is 'responsible' for a replication that has quite
> many
> > {mp_parser_died,noproc} errors, which AFAIK is a known bug (
> > https://github.com/apache/couchdb/issues/745), but I don't know if that
> > may have any relationship.
> >
> > When that happens, just restarting the node brings it up and running
> > properly.
> >
> > Any help would be really appreciated.
> >
> > Regards
> > --
> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >
> > *Carlos Alonso*
> > Data Engineer
> > Madrid, Spain
> >
> > carlos.alonso@cabify.com
> >
> > Prueba gratis con este código
> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
> --
> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> destinatario, pudiendo contener información confidencial sometida a secreto
> profesional. No está permitida su reproducción o distribución sin la
> autorización expresa de Cabify. Si usted no es el destinatario final por
> favor elimínelo e infórmenos por esta vía.
>
> This message and any attached file are intended exclusively for the
> addressee, and it may be confidential. You are not allowed to copy or
> disclose it without Cabify's prior written authorization. If you are not
> the intended recipient please delete it from your system and notify us by
> e-mail.
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Joan Touzet <wo...@apache.org>.
Is there more to the error? All this shows us is that the replicator
itself attempted a POST and had the connection closed on it. (Remember
that the replicator is basically just a custom client that sits
alongside CouchDB on the same machine.) There should be more to the
error log that shows why CouchDB hung up the phone.

----- Original Message -----
From: "Carlos Alonso" <ca...@cabify.com>
To: "user" <us...@couchdb.apache.org>
Sent: Tuesday, 3 October, 2017 4:18:18 AM
Subject: Re: Trying to understand why a node gets 'frozen'

Hello, this is happening every day, always on the same node. Any ideas?

Thanks!

On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <ca...@cabify.com>
wrote:

> Hello everyone!!
>
> I'm trying to understand an issue we're experiencing on CouchDB 2.1.0
> running on Ubuntu 14.04. The cluster itself is currently replicating from
> another source cluster and we have seen that one node gets frozen from time
> to time having to restart it to get it to respond again.
>
> Before getting unresponsive, the node throws a lot of {error,
> sel_conn_closed}. See an example trace below.
>
> [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
> -------- gen_server <0.13489.0> terminated with reason:
> {checkpoint_commit_failure,<<"Failure on target commit:
> {'EXIT',{http_request_failed,\"POST\",\n                             \"
> http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>        {error,sel_conn_closed}}}">>}
>   last msg: {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
> target commit: {'EXIT',{http_request_failed,\"POST\",\n
>          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>                  {error,sel_conn_closed}}}">>}}
>      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
> https://source_ip/mydb/",nil,[{"Accept","application/json"},{"Authorization","Basic
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
> http://127.0.0.1:5984/mydb/",nil,[{"Accept","application/json"},{"Authorization","Basic
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
>
> The particular node is 'responsible' for a replication that has quite many
> {mp_parser_died,noproc} errors, which AFAIK is a known bug (
> https://github.com/apache/couchdb/issues/745), but I don't know if that
> may have any relationship.
>
> When that happens, just restarting the node brings it up and running
> properly.
>
> Any help would be really appreciated.
>
> Regards
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Re: Trying to understand why a node gets 'frozen'

Posted by Carlos Alonso <ca...@cabify.com>.
Hello, this is happening every day, always on the same node. Any ideas?

Thanks!

On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso <ca...@cabify.com>
wrote:

> Hello everyone!!
>
> I'm trying to understand an issue we're experiencing on CouchDB 2.1.0
> running on Ubuntu 14.04. The cluster itself is currently replicating from
> another source cluster and we have seen that one node gets frozen from time
> to time having to restart it to get it to respond again.
>
> Before getting unresponsive, the node throws a lot of {error,
> sel_conn_closed}. See an example trace below.
>
> [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0>
> -------- gen_server <0.13489.0> terminated with reason:
> {checkpoint_commit_failure,<<"Failure on target commit:
> {'EXIT',{http_request_failed,\"POST\",\n                             \"
> http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>        {error,sel_conn_closed}}}">>}
>   last msg: {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on
> target commit: {'EXIT',{http_request_failed,\"POST\",\n
>          \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n
>                  {error,sel_conn_closed}}}">>}}
>      state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb,"
> https://source_ip/mydb/",nil,[{"Accept","application/json"},{"Authorization","Basic
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb,"
> http://127.0.0.1:5984/mydb/",nil,[{"Accept","application/json"},{"Authorization","Basic
> ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}}
>
> The particular node is 'responsible' for a replication that has quite many
> {mp_parser_died,noproc} errors, which AFAIK is a known bug (
> https://github.com/apache/couchdb/issues/745), but I don't know if that
> may have any relationship.
>
> When that happens, just restarting the node brings it up and running
> properly.
>
> Any help would be really appreciated.
>
> Regards
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alonso@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
-- 
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.