You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Adam Kocoloski (JIRA)" <ji...@apache.org> on 2012/12/02 05:53:58 UTC

[jira] [Commented] (COUCHDB-1346) CouchDB hangs during start of view indexing

    [ https://issues.apache.org/jira/browse/COUCHDB-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508146#comment-13508146 ] 

Adam Kocoloski commented on COUCHDB-1346:
-----------------------------------------

I took a closer look at this ticket today and came up with the patch below to ensure that view index files are closed before we attempt to nuke the directory when executing a reset_indexes call.  [~dch] helped me load the patch on a test Windows instance at EC2, and we confirmed that the patch allows the "basics" entry in the test suite to run to completion when it hadn't before.  So, yay for that.

Unfortunately, we're noticing frequent hangs in other portions of the test suite, including but not limited to the "design_options" test.  My observations from an evening running tests:

* The design_options test always passes with the log level set to "debug".
* When the level is set to "info" the test often hangs, and the Erlang VM seems to hang as well.
* The hang often occurs on the second invocation of the test.
* The hang occurs with _and_ without my patch.

That last point is rather crucial.  [~dch] indicated that the test had been passing, but I can't seem to make that happen even with a stock build of the HEAD of 1.3.x.  My current recommendation is to review and apply this patch as it only improves matters during my testing.  That being said, I'd love to understand the root cause of these hangs when debug logging is disabled.

{code:diff}
diff --git a/src/couch_index/src/couch_index_server.erl b/src/couch_index/src/couch_index_server.erl
index 48fa8e4..bc1fce7 100644
--- a/src/couch_index/src/couch_index_server.erl
+++ b/src/couch_index/src/couch_index_server.erl
@@ -160,7 +160,9 @@ reset_indexes(DbName, Root) ->
     % shutdown all the updaters and clear the files, the db got changed
     Fun = fun({_, {DDocId, Sig}}) ->
         [{_, Pid}] = ets:lookup(?BY_SIG, {DbName, Sig}),
-        couch_util:shutdown_sync(Pid),
+        MRef = erlang:monitor(process, Pid),
+        gen_server:cast(Pid, delete),
+        receive {'DOWN', MRef, _, _, _} -> ok end,
         rem_from_ets(DbName, Sig, DDocId, Pid)
     end,
     lists:foreach(Fun, ets:lookup(?BY_DB, DbName)),
{code}
                
> CouchDB hangs during start of view indexing
> -------------------------------------------
>
>                 Key: COUCHDB-1346
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1346
>             Project: CouchDB
>          Issue Type: Bug
>          Components: View Server Support
>    Affects Versions: 1.3
>         Environment: Windows 7 Enterprise only, not able to replicate on Mac OS X.
> Erlang R14B03 + crypto patches.
> Mozilla Javascript 1.8.5
>            Reporter: Dave Cottlehuber
>            Assignee: Adam Kocoloski
>            Priority: Blocker
>              Labels: Windows
>             Fix For: 1.3
>
>
> [info] [<0.20499.0>] Opening index for db: test_suite_db idx: f4421bf4e9c9bf2acb3db91bca9e9adc sig: "d5c87ad33242b181f86be2139cbccd96"
> [info] [<0.20504.0>] Starting index update for db: test_suite_db idx: f4421bf4e9c9bf2acb3db91bca9e9adc
> [info] [<0.20334.0>] 172.16.40.1 - - POST /test_suite_db/_temp_view 500
> [info] [<0.20513.0>] 172.16.40.1 - - GET /_utils/couch_tests.html?script/couch_tests.js 200
> [info] [<0.20514.0>] 172.16.40.1 - - GET /_utils/index.html 200
> [info] [<0.20060.0>] 172.16.40.1 - - DELETE /test_suite_db_a/ 200
> [info] [<0.20407.0>] 172.16.40.1 - - GET /test_suite_reports/ 404
> [info] [<0.20058.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20071.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20069.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20484.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20364.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20062.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20388.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20345.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20072.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20059.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20061.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
> [info] [<0.20472.0>] 172.16.40.1 - - DELETE /test_suite_db/ 200
> [error] [<0.20050.0>] ** Generic server couch_index_server terminating 
> ** Last message in was {'$gen_cast',{reset_indexes,<<"test_suite_db">>}}
> ** When Server state == {st,"../var/lib/couchdb"}
> ** Reason for termination == 
> ** {{case_clause,{error,eacces}},
>     [{couch_file,'-nuke_dir/2-fun-0-',3},
>      {lists,foreach,2},
>      {couch_file,nuke_dir,2},
>      {couch_index_server,handle_cast,2},
>      {gen_server,handle_msg,5},
>      {proc_lib,init_p_do_apply,3}]}
> =ERROR REPORT==== 23-Nov-2011::21:17:14 ===
> ** Generic server couch_index_server terminating 
> ** Last message in was {'$gen_cast',{reset_indexes,<<"test_suite_db">>}}
> ** When Server state == {st,"../var/lib/couchdb"}
> ** Reason for termination == 
> ** {{case_clause,{error,eacces}},
>     [{couch_file,'-nuke_dir/2-fun-0-',3},
>      {lists,foreach,2},
>      {couch_file,nuke_dir,2},
>      {couch_index_server,handle_cast,2},
>      {gen_server,handle_msg,5},
>      {proc_lib,init_p_do_apply,3}]}
> [error] [<0.20050.0>] {error_report,<0.19957.0>,
>                           {<0.20050.0>,crash_report,
>                            [[{initial_call,
>                                  {couch_index_server,init,['Argument__1']}},
>                              {pid,<0.20050.0>},
>                              {registered_name,couch_index_server},
>                              {error_info,
>                                  {exit,
>                                      {{case_clause,{error,eacces}},
>                                       [{couch_file,'-nuke_dir/2-fun-0-',3},
>                                        {lists,foreach,2},
>                                        {couch_file,nuke_dir,2},
>                                        {couch_index_server,handle_cast,2},
>                                        {gen_server,handle_msg,5},
>                                        {proc_lib,init_p_do_apply,3}]},
>                                      [{gen_server,terminate,6},
>                                       {proc_lib,init_p_do_apply,3}]}},
>                              {ancestors,
>                                  [couch_secondary_services,couch_server_sup,
>                                   <0.19958.0>]},
>                              {messages,
>                                  [{'$gen_cast',
>                                       {reset_indexes,<<"test_suite_db_a">>}}]},
>                              {links,[<0.20051.0>,<0.20026.0>]},
>                              {dictionary,[]},
>                              {trap_exit,true},
>                              {status,running},
>                              {heap_size,1597},
>                              {stack_size,24},
>                              {reductions,12211}],
>                             [{neighbour,
>                                  [{pid,<0.20051.0>},
>                                   {registered_name,[]},
>                                   {initial_call,
>                                       {couch_event_sup,init,['Argument__1']}},
>                                   {current_function,{gen_server,loop,6}},
>                                   {ancestors,
>                                       [couch_index_server,
>                                        couch_secondary_services,
>                                        couch_server_sup,<0.19958.0>]},
>                                   {messages,[]},
>                                   {links,[<0.20050.0>,<0.20018.0>]},
>                                   {dictionary,[]},
>                                   {trap_exit,false},
>                                   {status,waiting},
>                                   {heap_size,233},
>                                   {stack_size,9},
>                                   {reductions,32}]}]]}}
> =CRASH REPORT==== 23-Nov-2011::21:17:14 ===
>   crasher:
>     initial call: couch_index_server:init/1
>     pid: <0.20050.0>
>     registered_name: couch_index_server
>     exception exit: {{case_clause,{error,eacces}},
>                      [{couch_file,'-nuke_dir/2-fun-0-',3},
>                       {lists,foreach,2},
>                       {couch_file,nuke_dir,2},
>                       {couch_index_server,handle_cast,2},
>                       {gen_server,handle_msg,5},
>                       {proc_lib,init_p_do_apply,3}]}
>       in function  gen_server:terminate/6
>     ancestors: [couch_secondary_services,couch_server_sup,<0.19958.0>]
>     messages: [{'$gen_cast',{reset_indexes,<<"test_suite_db_a">>}}]
>     links: [<0.20051.0>,<0.20026.0>]
>     dictionary: []
>     trap_exit: true
>     status: running
>     heap_size: 1597
>     stack_size: 24
>     reductions: 12211
>   neighbours:
>     neighbour: [{pid,<0.20051.0>},
>                   {registered_name,[]},
>                   {initial_call,{couch_event_sup,init,['Argument__1']}},
>                   {current_function,{gen_server,loop,6}},
>                   {ancestors,[couch_index_server,couch_secondary_services,
>                               couch_server_sup,<0.19958.0>]},
>                   {messages,[]},
>                   {links,[<0.20050.0>,<0.20018.0>]},
>                   {dictionary,[]},
>                   {trap_exit,false},
>                   {status,waiting},
>                   {heap_size,233},
>                   {stack_size,9},
>                   {reductions,32}]
> [error] [<0.20026.0>] {error_report,<0.19957.0>,
>                           {<0.20026.0>,supervisor_report,
>                            [{supervisor,{local,couch_secondary_services}},
>                             {errorContext,child_terminated},
>                             {reason,
>                                 {{case_clause,{error,eacces}},
>                                  [{couch_file,'-nuke_dir/2-fun-0-',3},
>                                   {lists,foreach,2},
>                                   {couch_file,nuke_dir,2},
>                                   {couch_index_server,handle_cast,2},
>                                   {gen_server,handle_msg,5},
>                                   {proc_lib,init_p_do_apply,3}]}},
>                             {offender,
>                                 [{pid,<0.20050.0>},
>                                  {name,index_server},
>                                  {mfargs,{couch_index_server,start_link,[]}},
>                                  {restart_type,permanent},
>                                  {shutdown,brutal_kill},
>                                  {child_type,worker}]}]}}
> OS process tree at this time is:
> Process information for SENDAI:
> Name                             Pid Pri Thd  Hnd      VM      WS    Priv
> Idle                               0   0   2    0       0      24       0
>   System                           4   8  79  477    3380     304     108
> explorer                        1984   8  21  664  213732   46340   21540
>   cmd                           2104   8   1   25   48132    3304    2144
>     pslist                      2776  13   1  133   63584    4976    2000
>   cmd                           2504   8   1   26   44980    3512    3012
>     werl                        2680   8  16  390  196232   40064   28628
>       win32sysinfo              1152   8   1   21   12624    2124     640
>       couchspawnkillable        1444   8   1   30   12992    2284     688
>         couchjs                 1468   8   1   39   55900    6572    4056
>       couchspawnkillable        2740   8   1   30   12992    2280     684
>         couchjs                 2756   8   1   39   55900    7108    4444
> Erlang resumes running CouchDB when couchjs procs are terminated with extreme
> prejudice. The hang still occurs after reverting fdmanana's COUCHDB-1334
> commit. This could be a race condition during invalidation of the views, and
> subsequent deletion of the related ddoc view directory prior to reindexing.
> On Windows a filesystem object cannot be deleted if there are open handles
> remaining.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira