You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Fifteen (Jira)" <ji...@apache.org> on 2021/09/02 07:51:00 UTC
[jira] [Commented] (IMPALA-10342) Flooding of UDF warnings crash
the coordinator
[ https://issues.apache.org/jira/browse/IMPALA-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408649#comment-17408649 ]
Fifteen commented on IMPALA-10342:
----------------------------------
Hi, thank you for looking into this case.
I do like the idea that we should rate limit those warnings, but what makes implementation difficult is that each executor is highly state-less. Each executor has no idea how many warnings have been reported by others. I think there're at least two different ways to achieve this.
1) Simple way, Ignore parallelization. We can set a rate limit for per-line warnings. The limit is a query-level option and remains the same for all fragement instances. In `ExecNodes`, before calling `addWarning()`, the warnings will be de-duplicated by a hash set. We can then make sure the rate limitation will not mute other low probability warnings.
2) Complex way, use global state. We can add a new type of topic in statestored and each executor node is a subscriber. The content of that topic is how many different warnings have been reported by fragment instances of a given query. The decision of report or not is made basing on the snapshot of such global state. This way is a bit complex, but will make the rate limitation more accurate.
Owning to ExchangeNode/KrpcDatastreamService is also bottleneck for the flooding warnings. I think the receiver side rate limit is not a viable option.
> Flooding of UDF warnings crash the coordinator
> ----------------------------------------------
>
> Key: IMPALA-10342
> URL: https://issues.apache.org/jira/browse/IMPALA-10342
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Fifteen
> Assignee: Fifteen
> Priority: Minor
> Attachments: image-2020-11-19-17-30-22-918.png, image-2020-11-23-09-57-49-840.png, image-2021-04-28-20-20-45-798.png, impalad-ram-profile.pdf
>
>
> Hi, when encounting error, both `get_json_object()` and `DecimalOperators::IntToDecimalVal` will raise warning.
> During to their stateless nature, The warning flood will easily overwhelm cluster's processing capacity.
> To be specific, we have observed these bottlenecks:
> *Exchange Receiver*: the default value for `rpc_max_message_size` is 50MB. The flooding warning messages carried by ReportExecStatusPB may exceed that limit, causing profile-less status report. Or, if the report message size is somehow under the limit, the bandwidth consumption is also non-trivial.
> *Storage:* like IMPALA-5256 , flooding warnings produce huge log files since `stdout/stderr` won't be redirected when glog is rolling logs. Under this circumstance, we had enough of clearing log files and restarting executors.
> *Coordinator*: runtime profiles will be serialized to thrift and stored in Coordinator's memory. The warning flood will make `Untracked Memory` rising rapidly. I have made a heap profile(with pprof) and found most memory were used by RuntimeProfile and Strings.
> !image-2020-11-23-09-57-49-840.png!
>
> *1 preliminary Solution:*
> We suffered a lot from this problem, and we have came out with an preliminary solution.
> # We have a straightforward solution by muting the AddWarning()
> # Introduced a query option to re-enable the warning when needed.
> *Testing:*
> With muted warning messages, we find the burden of C nodes is highly alleviated and heap profiles no longer bound to RuntimeProfile.
>
> *Update*
> Encountered a similar crash case with `get_json_object()` query, each time the query submitted, the Coordinator crashes.
> !image-2021-04-28-20-20-45-798.png!
> Log:
> {code:java}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x0000000002c64dca, pid=3633220, tid=0x00007eff73308700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 )
> # Problematic frame:
> # C [impalad+0x2864dca] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x13a
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /run/cloudera-scm-agent/process/10376-impala-IMPALAD/hs_err_pid3633220.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.java.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> d. The connection had 2 associated session(s).
> I0427 13:43:03.907536 3853145 status.cc:126] Couldn't serialize thrift object:
> std::bad_alloc
> @ 0xbf4ef9
> @ 0x1352d5f
> @ 0x1352eaf
> @ 0x11986de
> @ 0x122516c
> @ 0x1225515
> @ 0x137ee36
> @ 0x13801a0
> @ 0x139682f
> @ 0x139915a
> @ 0x1399784
> @ 0x7f34791e0e24
> @ 0x7f3475dd835c
> {code}
> StackTrace:
> {code:java}
> Stack: [0x00007eff72b08000,0x00007eff73309000], sp=0x00007eff733006b0, free space=8161k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> C [impalad+0x2864dca] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x13a
> C [impalad+0x286519f] tcmalloc::ThreadCache::Scavenge()+0x3f
> C [impalad+0x29a211a] operator delete(void*)+0x32a
> C [impalad+0xae94d9] impala::TRuntimeProfileNode::~TRuntimeProfileNode()+0x289
> C [impalad+0xae4987] impala::TRuntimeProfileTree::~TRuntimeProfileTree()+0x47
> C [impalad+0xf5280a] impala::RuntimeProfile::Compress(std::vector<unsigned char, std::allocator<unsigned char> >*) const+0x3aa
> C [impalad+0xf52eb0] impala::RuntimeProfile::SerializeToArchiveString(std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*) const+0x40
> C [impalad+0xd986df] impala::ImpalaServer::GetRuntimeProfileOutput(impala::TUniqueId const&, std::string const&, impala::TRuntimeProfileFormat::type, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, impala::TRuntimeProfileTree*, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>, rapidjson::CrtAllocator>*)+0x5bf
> C [impalad+0xe2516d] impala::ImpalaHttpHandler::QueryProfileHelper(kudu::WebCallbackRegistry::WebRequest const&, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>, rapidjson::CrtAllocator>*, impala::TRuntimeProfileFormat::type)+0x4ed
> C [impalad+0xe25516] impala::ImpalaHttpHandler::QueryProfileEncodedHandler(kudu::WebCallbackRegistry::WebRequest const&, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>, rapidjson::CrtAllocator>*)+0x16
> C [impalad+0xf7ee37] impala::Webserver::RenderUrlWithTemplate(sq_connection const*, kudu::WebCallbackRegistry::WebRequest const&, impala::Webserver::UrlHandler const&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, impala::ContentType*)+0x177
> C [impalad+0xf801a1] impala::Webserver::BeginRequestCallback(sq_connection*, sq_request_info*)+0x951
> C [impalad+0xf96830] kudu::StringGauge::~StringGauge()+0x100
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org