You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Randall Leeds (JIRA)" <ji...@apache.org> on 2010/05/14 01:01:45 UTC
[jira] Created: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Timeouts in couch_log are masked, crashes callers
-------------------------------------------------
Key: COUCHDB-761
URL: https://issues.apache.org/jira/browse/COUCHDB-761
Project: CouchDB
Issue Type: Bug
Components: Database Core
Affects Versions: 0.11, 0.10.2, 0.10.1
Reporter: Randall Leeds
Fix For: 0.10.3, 0.11.1, 1.0
Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Closed: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Kocoloski closed COUCHDB-761.
----------------------------------
Resolution: Fixed
Good idea, Randall. I added the include_sasl to default.ini in trunk, but I don't think we necessarily need to backport it.
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874885#action_12874885 ]
Randall Leeds commented on COUCHDB-761:
---------------------------------------
Had this in production for about a week now on a bunch of servers. Seems to fix the timeout problem mentioned above. I'd appreciate if some brave soul would apply this patch and give their server a beating to be sure everything looks okay. Some of the servers where this is running here are under crazy load and nothing seems broken, but I like second opinions, especially on things that could have broad performance implications under stress.
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878394#action_12878394 ]
Adam Kocoloski commented on COUCHDB-761:
----------------------------------------
I needed one more patch for get_level_integer() to get make check running, since some of the tests call couch code that tries to log when couch_log is not running. I've inlined it below. I've committed on trunk and backported to 0.10.x. Waiting on 0.11.x because Jan has a monster fix for that branch in the works.
diff --git a/src/couchdb/couch_log.erl b/src/couchdb/couch_log.erl
index 5c8a5e5..2d62cbb 100644
--- a/src/couchdb/couch_log.erl
+++ b/src/couchdb/couch_log.erl
@@ -81,7 +81,11 @@ get_level() ->
level_atom(get_level_integer()).
get_level_integer() ->
- ets:lookup_element(?MODULE, level, 2).
+ try
+ ets:lookup_element(?MODULE, level, 2)
+ catch error:badarg ->
+ ?LEVEL_ERROR
+ end.
set_level_integer(Int) ->
gen_event:call(error_logger, couch_log, {set_level_integer, Int}).
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868629#action_12868629 ]
Adam Kocoloski commented on COUCHDB-761:
----------------------------------------
Oh, we definitely have to update the handle_event function in couch_log. For example, instead of
handle_event({error_report, _, {Pid, couch_error, {Format, Args}}}, {Fd, _LogLevel}=State) -> ...
it would be
handle_event({couch, error, Msg, {Fd, _LogLevel}=State) -> ...
couch_log already has a separate clause for SASL error messages, so no worries there.
error_logger won't ignore messages in this format; it'll forward them to the various handlers (including couch_log) just fine.
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878391#action_12878391 ]
Adam Kocoloski commented on COUCHDB-761:
----------------------------------------
I say we commit. We've been running a similar sync_notify logger at Cloudant for a long time now without issues.
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Randall Leeds updated COUCHDB-761:
----------------------------------
Attachment: improved-sync-logging-v2.patch
Here's the second version of the patch.
Changes:
1) I took Adam's advice and simplified the message format for messages sent by the ?LOG_* macros.
2) I re-ordered the handle_event clauses. The first few handle the new log macro message format, the next few handle messages coming from sasl
3) I took this opportunity to add a config direction log:include_sasl which defaults to true. When include_sasl is false the couch_log process will ignore sasl messages. I thought about making this dependent on log level somehow but decided I prefer to give people options and have sensible defaults.
What this accomplishes:
1) synchronous logging - log messages are never missed since we use sync_notify now instead of the notify call error_logger uses
2) log macros for log levels that are disabled should have less performance impact. I don't know and haven't tested how an ets lookup compares to a gen_server call but at least responses can be fast even when the log server is busy formatting a long message
3) eliminating the gen:call for couch_log:log_level_integer/0 means no more unexpected {#Ref<...>, LogLevel} messages arriving after a timeout due to a slow or busy log server process
4) sasl logging can be disabled in production environments where the entire state of crashing gen_servers is probably waaay TMI when we probably ?LOG_ERROR a more useful message anyway
Open for re-review, comments, questions.
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878464#action_12878464 ]
Randall Leeds commented on COUCHDB-761:
---------------------------------------
Since there's a new config option added by this patch, should it be added to the default.ini?
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Kocoloski updated COUCHDB-761:
-----------------------------------
Priority: Blocker (was: Major)
Thanks for filing this ticket, Randall. I'm bumping it to Blocker.
I'd discourage fully async logging. I've tried it in the past; it's far too easy to overwhelm the error_logger process with debug messages. Eventually the error_logger mailbox exhausts the available memory and the VM dies a horrible death.
Infinite timeouts are a viable option in my opinion. Another option is to spawn a function to log the message:
Pros:
- doesn't block the original process
Cons:
- spends extra CPU cycles copying data to new process heap
- potential to exhaust process limit
Personally, I don't think it's worth the risk. Here's what I'd propose:
1) Reimplement debug_on(), info_on() to use ets table lookups. This is pretty easy because the log level is already stored in couch_config.
2) If the log level is enabled, use an infinite timeout to log the message.
This way we can suppress the LOG_DEBUG messages without slowing down request processing by more than a few µs, we fix the crashes implicated in this ticket, and we keep the error_logger mailbox small.
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868165#action_12868165 ]
Adam Kocoloski commented on COUCHDB-761:
----------------------------------------
I haven't tested it, but the patch looks correct.
Since we're moving away from the *_report functions we could also simplify the messages a bit. For example, instead of
gen_event:sync_notify(error_logger, {info_report, group_leader(), {self(), couch_info, {Format, Args}}});
we could do
gen_event:sync_notify(error_logger, {couch, info, self(), Format, Args});
or, since we're doing sync logging, we might even partially format the message in the sending process and make it a little simpler to copy:
gen_event:sync_notify(error_logger, {couch, info, self(), io_lib:format(Format, Args)});
Randall, what do you think?
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868453#action_12868453 ]
Randall Leeds commented on COUCHDB-761:
---------------------------------------
I'm not sure that we can change the format of those messages unless we go directly to couch_log and then we have to make sure couch_log understands two message formats or we're gonna miss the sasl log messages.
error_logger reads those messages to figure out what kind of report to send out to the registered handlers. I'd have to double check the code by I feel like it might ignore us if we send it something else.
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Randall Leeds updated COUCHDB-761:
----------------------------------
Attachment: improved-sync-logging.patch
Here's my first run at a patch.
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
> Attachments: improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-761) Timeouts in couch_log are masked,
crashes callers
Posted by "Adam Kocoloski (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867349#action_12867349 ]
Adam Kocoloski commented on COUCHDB-761:
----------------------------------------
To clarify, the way to do an infinite timeout in step 2) is to use gen_event:sync_notify instead of error_logger:info_report.
> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
> Key: COUCHDB-761
> URL: https://issues.apache.org/jira/browse/COUCHDB-761
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.10.1, 0.10.2, 0.11
> Reporter: Randall Leeds
> Priority: Blocker
> Fix For: 0.10.3, 0.11.1, 1.0
>
>
> Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.
> The logging macros (?LOG_*) call couch_log/*_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.