You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Fifteen (Jira)" <ji...@apache.org> on 2021/09/02 07:51:00 UTC
[jira] [Commented] (IMPALA-10342) Flooding of UDF warnings crash the coordinator

    [ https://issues.apache.org/jira/browse/IMPALA-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408649#comment-17408649 ] 

Fifteen commented on IMPALA-10342:
----------------------------------

Hi, thank you for looking into this case.

 

I do like the idea that we should rate limit those warnings, but what makes implementation difficult is that each executor is highly state-less. Each executor has no idea how many warnings have been reported by others. I think there're at least two different ways to achieve this.

1) Simple way, Ignore parallelization. We can set a rate limit for per-line warnings. The limit is a query-level option and remains the same for all fragement instances. In `ExecNodes`,  before calling `addWarning()`, the warnings will be de-duplicated by a hash set. We can then make sure the rate limitation will not mute other low probability warnings.  

2) Complex way, use global state. We can add a new type of topic in statestored and each executor node is a subscriber. The content of that topic is how many different warnings have been reported by fragment instances of a given query. The decision of report or not is made basing on the snapshot of such global state. This way is a bit complex, but will make the rate limitation more accurate.

 

Owning to ExchangeNode/KrpcDatastreamService is also bottleneck for the flooding warnings. I think the receiver side rate limit is not a viable option.

> Flooding of UDF warnings crash the coordinator
> ----------------------------------------------
>
>                 Key: IMPALA-10342
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10342
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Fifteen
>            Assignee: Fifteen
>            Priority: Minor
>         Attachments: image-2020-11-19-17-30-22-918.png, image-2020-11-23-09-57-49-840.png, image-2021-04-28-20-20-45-798.png, impalad-ram-profile.pdf
>
>
> Hi, when encounting error, both `get_json_object()` and `DecimalOperators::IntToDecimalVal` will raise warning.
> During to their stateless nature, The warning flood will easily overwhelm cluster's processing capacity.
> To be specific, we have observed these bottlenecks:
> *Exchange Receiver*:   the default value for `rpc_max_message_size` is 50MB. The flooding warning messages carried by ReportExecStatusPB may exceed that limit, causing profile-less status report. Or,  if the report message size is somehow under the limit, the bandwidth consumption is also non-trivial.
> *Storage:* like IMPALA-5256 , flooding warnings produce huge log files since `stdout/stderr` won't be redirected when glog is rolling logs.  Under this circumstance, we had enough of clearing log files and restarting executors. 
> *Coordinator*: runtime profiles will be serialized to thrift and stored in Coordinator's memory. The warning flood will make `Untracked Memory` rising rapidly. I have made a heap profile(with pprof) and found most memory were used by RuntimeProfile and Strings. 
>   !image-2020-11-23-09-57-49-840.png!
>  
> *1 preliminary Solution:*
> We suffered a lot from this problem, and we have came out with an preliminary solution. 
>  # We have a straightforward solution by muting the AddWarning()
>  # Introduced a query option to re-enable the warning when needed.
>  *Testing:*
> With muted warning messages, we find the burden of C nodes is highly alleviated and heap profiles no longer bound to RuntimeProfile.
>  
> *Update*
> Encountered a similar crash case with  `get_json_object()` query, each time the query submitted, the Coordinator crashes.
> !image-2021-04-28-20-20-45-798.png!
> Log:
> {code:java}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0000000002c64dca, pid=3633220, tid=0x00007eff73308700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 )
> # Problematic frame:
> # C  [impalad+0x2864dca]  tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x13a
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /run/cloudera-scm-agent/process/10376-impala-IMPALAD/hs_err_pid3633220.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> d. The connection had 2 associated session(s).
> I0427 13:43:03.907536 3853145 status.cc:126] Couldn't serialize thrift object:
> std::bad_alloc
>     @           0xbf4ef9
>     @          0x1352d5f
>     @          0x1352eaf
>     @          0x11986de
>     @          0x122516c
>     @          0x1225515
>     @          0x137ee36
>     @          0x13801a0
>     @          0x139682f
>     @          0x139915a
>     @          0x1399784
>     @     0x7f34791e0e24
>     @     0x7f3475dd835c
> {code}
>  StackTrace:
> {code:java}
> Stack: [0x00007eff72b08000,0x00007eff73309000],  sp=0x00007eff733006b0,  free space=8161k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> C  [impalad+0x2864dca]  tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)+0x13a
> C  [impalad+0x286519f]  tcmalloc::ThreadCache::Scavenge()+0x3f
> C  [impalad+0x29a211a]  operator delete(void*)+0x32a
> C  [impalad+0xae94d9]  impala::TRuntimeProfileNode::~TRuntimeProfileNode()+0x289
> C  [impalad+0xae4987]  impala::TRuntimeProfileTree::~TRuntimeProfileTree()+0x47
> C  [impalad+0xf5280a]  impala::RuntimeProfile::Compress(std::vector<unsigned char, std::allocator<unsigned char> >*) const+0x3aa
> C  [impalad+0xf52eb0]  impala::RuntimeProfile::SerializeToArchiveString(std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*) const+0x40
> C  [impalad+0xd986df]  impala::ImpalaServer::GetRuntimeProfileOutput(impala::TUniqueId const&, std::string const&, impala::TRuntimeProfileFormat::type, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, impala::TRuntimeProfileTree*, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>, rapidjson::CrtAllocator>*)+0x5bf
> C  [impalad+0xe2516d]  impala::ImpalaHttpHandler::QueryProfileHelper(kudu::WebCallbackRegistry::WebRequest const&, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>, rapidjson::CrtAllocator>*, impala::TRuntimeProfileFormat::type)+0x4ed
> C  [impalad+0xe25516]  impala::ImpalaHttpHandler::QueryProfileEncodedHandler(kudu::WebCallbackRegistry::WebRequest const&, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>, rapidjson::CrtAllocator>*)+0x16
> C  [impalad+0xf7ee37]  impala::Webserver::RenderUrlWithTemplate(sq_connection const*, kudu::WebCallbackRegistry::WebRequest const&, impala::Webserver::UrlHandler const&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, impala::ContentType*)+0x177
> C  [impalad+0xf801a1]  impala::Webserver::BeginRequestCallback(sq_connection*, sq_request_info*)+0x951
> C  [impalad+0xf96830]  kudu::StringGauge::~StringGauge()+0x100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org