You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Randall Leeds (JIRA)" <ji...@apache.org> on 2010/05/14 01:01:45 UTC

[jira] Created: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Timeouts in couch_log are masked, crashes callers
-------------------------------------------------

                 Key: COUCHDB-761
                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
             Project: CouchDB
          Issue Type: Bug
          Components: Database Core
    Affects Versions: 0.11, 0.10.2, 0.10.1
            Reporter: Randall Leeds
             Fix For: 0.10.3, 0.11.1, 1.0


Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.

After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.

The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.

Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Kocoloski closed COUCHDB-761.
----------------------------------

    Resolution: Fixed

Good idea, Randall.  I added the include_sasl to default.ini in trunk, but I don't think we necessarily need to backport it.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874885#action_12874885 ] 

Randall Leeds commented on COUCHDB-761:
---------------------------------------

Had this in production for about a week now on a bunch of servers. Seems to fix the timeout problem mentioned above. I'd appreciate if some brave soul would apply this patch and give their server a beating to be sure everything looks okay. Some of the servers where this is running here are under crazy load and nothing seems broken, but I like second opinions, especially on things that could have broad performance implications under stress.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878394#action_12878394 ] 

Adam Kocoloski commented on COUCHDB-761:
----------------------------------------

I needed one more patch for get_level_integer() to get make check running, since some of the tests call couch code that tries to log when couch_log is not running.  I've inlined it below.  I've committed on trunk and backported to 0.10.x.  Waiting on 0.11.x because Jan has a monster fix for that branch in the works.


diff --git a/src/couchdb/couch_log.erl b/src/couchdb/couch_log.erl
index 5c8a5e5..2d62cbb 100644
--- a/src/couchdb/couch_log.erl
+++ b/src/couchdb/couch_log.erl
@@ -81,7 +81,11 @@ get_level() ->
     level_atom(get_level_integer()).
 
 get_level_integer() ->
-    ets:lookup_element(?MODULE, level, 2).
+    try
+        ets:lookup_element(?MODULE, level, 2)
+    catch error:badarg ->
+        ?LEVEL_ERROR
+    end.
 
 set_level_integer(Int) ->
     gen_event:call(error_logger, couch_log, {set_level_integer, Int}).


> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868629#action_12868629 ] 

Adam Kocoloski commented on COUCHDB-761:
----------------------------------------

Oh, we definitely have to update the handle_event function in couch_log.  For example, instead of

handle_event({error_report, _, {Pid, couch_error, {Format, Args}}}, {Fd, _LogLevel}=State) -> ...

it would be

handle_event({couch, error, Msg, {Fd, _LogLevel}=State) -> ...

couch_log already has a separate clause for SASL error messages, so no worries there.

error_logger won't ignore messages in this format; it'll forward them to the various handlers (including couch_log) just fine.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878391#action_12878391 ] 

Adam Kocoloski commented on COUCHDB-761:
----------------------------------------

I say we commit.  We've been running a similar sync_notify logger at Cloudant for a long time now without issues.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Randall Leeds updated COUCHDB-761:
----------------------------------

    Attachment: improved-sync-logging-v2.patch

Here's the second version of the patch.

Changes:
1) I took Adam's advice and simplified the message format for messages sent by the ?LOG_* macros.
2) I re-ordered the handle_event clauses. The first few handle the new log macro message format, the next few handle messages coming from sasl
3) I took this opportunity to add a config direction log:include_sasl which defaults to true. When include_sasl is false the couch_log process will ignore sasl messages. I thought about making this dependent on log level somehow but decided I prefer to give people options and have sensible defaults.

What this accomplishes:
1) synchronous logging - log messages are never missed since we use sync_notify now instead of the notify call error_logger uses
2) log macros for log levels that are disabled should have less performance impact. I don't know and haven't tested how an ets lookup compares to a gen_server call but at least responses can be fast even when the log server is busy formatting a long message
3) eliminating the gen:call for couch_log:log_level_integer/0 means no more unexpected {#Ref<...>, LogLevel} messages arriving after a timeout due to a slow or busy log server process
4) sasl logging can be disabled in production environments where the entire state of crashing gen_servers is probably waaay TMI when we probably ?LOG_ERROR a more useful message anyway

Open for re-review, comments, questions.


> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878464#action_12878464 ] 

Randall Leeds commented on COUCHDB-761:
---------------------------------------

Since there's a new config option added by this patch, should it be added to the default.ini?

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Kocoloski updated COUCHDB-761:
-----------------------------------

    Priority: Blocker  (was: Major)

Thanks for filing this ticket, Randall.  I'm bumping it to Blocker.

I'd discourage fully async logging.  I've tried it in the past; it's far too easy to overwhelm the error_logger process with debug messages.  Eventually the error_logger mailbox exhausts the available memory and the VM dies a horrible death.

Infinite timeouts are a viable option in my opinion.  Another option is to spawn a function to log the message:

Pros:
- doesn't block the original process

Cons:
- spends extra CPU cycles copying data to new process heap
- potential to exhaust process limit

Personally, I don't think it's worth the risk.  Here's what I'd propose:

1) Reimplement debug_on(), info_on() to use ets table lookups.  This is pretty easy because the log level is already stored in couch_config.

2) If the log level is enabled, use an infinite timeout to log the message.

This way we can suppress the LOG_DEBUG messages without slowing down request processing by more than a few µs, we fix the crashes implicated in this ticket, and we keep the error_logger mailbox small.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868165#action_12868165 ] 

Adam Kocoloski commented on COUCHDB-761:
----------------------------------------

I haven't tested it, but the patch looks correct.

Since we're moving away from the *_report functions we could also simplify the messages a bit.  For example, instead of

gen_event:sync_notify(error_logger, {info_report, group_leader(), {self(), couch_info, {Format, Args}}});

we could do

gen_event:sync_notify(error_logger, {couch, info, self(), Format, Args});

or, since we're doing sync logging, we might even partially format the message in the sending process and make it a little simpler to copy:

gen_event:sync_notify(error_logger, {couch, info, self(), io_lib:format(Format, Args)});

Randall, what do you think?



> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868453#action_12868453 ] 

Randall Leeds commented on COUCHDB-761:
---------------------------------------

I'm not sure that we can change the format of those messages unless we go directly to couch_log and then we have to make sure couch_log understands two message formats or we're gonna miss the sasl log messages.

error_logger reads those messages to figure out what kind of report to send out to the registered handlers. I'd have to double check the code by I feel like it might ignore us if we send it something else.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Randall Leeds updated COUCHDB-761:
----------------------------------

    Attachment: improved-sync-logging.patch

Here's my first run at a patch.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867349#action_12867349 ] 

Adam Kocoloski commented on COUCHDB-761:
----------------------------------------

To clarify, the way to do an infinite timeout in step 2) is to use gen_event:sync_notify instead of error_logger:info_report.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.