You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Thomas Tauber-Marshall (JIRA)" <ji...@apache.org> on 2018/02/05 18:39:00 UTC

[jira] [Resolved] (IMPALA-6338) test_profile_fragment_instances failing

     [ https://issues.apache.org/jira/browse/IMPALA-6338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Tauber-Marshall resolved IMPALA-6338.
--------------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 3.0

commit a018038df5b13f24f7980b75d755e0123ae2687d
Author: Thomas Tauber-Marshall <tm...@cloudera.com>
Date:   Tue Jan 9 13:59:52 2018 -0800

    IMPALA-6338: Fix flaky test_profile_fragment_instances
    
    test_profile_fragment_instances checks that, once all the results have
    been returned, every fragment instance appears in the query profile
    for a query that internally cancels fragment instances that are still
    executing when the results have been fully returned.
    
    Every fis is guaranteed to send a profile to the coordinator in
    Finalize(), but previously fragment profiles were not applied by the
    coordinator if the backend was 'done', defined as either all instances
    having completed or one has entered an error state (including
    cancelled).
    
    So, the test could fail by the following sequence:
    - Some fragment for a particular backend sends an update to the
      coordinator. 'returned_all_results_' is true, so the coordinator
      responds indicating the the backend should cancel its remaining
      fragments.
    - Another fragment from that backend executes Finalize() and reports
      that it was cancelled. This causes the coordinator to consider the
      entire backend to be 'done'.
    - A third fragment, which had not previously sent a report from the
      reporting thread, from the same backend executes Finalize(). This
      report will not be applied by the coordinator as the backend is
      considered 'done', so this fragment will not appear in the final
      profile.
    
    The solution is to change the definition of 'done' to not include a
    backend that has been cancelled but still has fragments that haven't
    completed. This guarantees that for queries that complete successfully
    and are cancelled internally, all fis will send a report and have it
    applied by the coordinator before all results have been returned,
    since if eos is true Coordinator::GetNext() calls
    WaitForBackendCompletion(), which in this situation will now wait for
    all fis to Finalize().
    
    Returning results for queries that are cancelled by the user is
    unaffected as the manual cancel path causes WaitForBackendCompletion().
    
    Testing:
    - Ran test_profile_fragment_instances in a loop with no failures.
      I can reliably repro the original problem with a few carefully
      placed sleeps.
    
    Change-Id: I77773a1e3c62952003f37f88fe2b662bb11889ed
    Reviewed-on: http://gerrit.cloudera.org:8080/8997
    Reviewed-by: Tim Armstrong <ta...@cloudera.com>
    Tested-by: Impala Public Jenkins

> test_profile_fragment_instances failing
> ---------------------------------------
>
>                 Key: IMPALA-6338
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6338
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.11.0, Impala 2.12.0
>            Reporter: David Knupp
>            Assignee: Thomas Tauber-Marshall
>            Priority: Critical
>              Labels: flaky
>             Fix For: Impala 3.0
>
>         Attachments: profile-failure.txt, profile-success.txt
>
>
> Stack trace:
> {noformat}
> query_test/test_observability.py:123: in test_profile_fragment_instances
>     assert results.runtime_profile.count("HDFS_SCAN_NODE") == 12
> E   assert 11 == 12
> E    +  where 11 = <built-in method count of str object at 0x68bd2f0>('HDFS_SCAN_NODE')
> E    +    where <built-in method count of str object at 0x68bd2f0> = 'Query (id=ae4cee91aafc5c6c:11b545c600000000):\n  DEBUG MODE WARNING: Query profile created while running a DEBUG buil...ontextSwitches: 0 (0)\n           - TotalRawHdfsReadTime(*): 5s784ms\n           - TotalReadThroughput: 17.33 MB/sec\n'.count
> E    +      where 'Query (id=ae4cee91aafc5c6c:11b545c600000000):\n  DEBUG MODE WARNING: Query profile created while running a DEBUG buil...ontextSwitches: 0 (0)\n           - TotalRawHdfsReadTime(*): 5s784ms\n           - TotalReadThroughput: 17.33 MB/sec\n' = <tests.beeswax.impala_beeswax.ImpalaBeeswaxResult object at 0x6322e10>.runtime_profile
> {noformat}
> Query:
> {noformat}
> with l as (select * from tpch.lineitem UNION ALL select * from tpch.lineitem)
>         select STRAIGHT_JOIN count(*) from (select * from tpch.lineitem a LIMIT 1) a
>         join (select * from l LIMIT 2000000) b on a.l_orderkey = -b.l_orderkey;
> {noformat}
> Summary:
> {noformat}
> Operator             #Hosts  Avg Time  Max Time  #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                
> --------------------------------------------------------------------------------------------------------------------
> 05:AGGREGATE              1   0.000ns   0.000ns      1           1   28.00 KB       10.00 MB  FINALIZE              
> 04:HASH JOIN              1  15.000ms  15.000ms      0           1  141.06 MB       17.00 MB  INNER JOIN, BROADCAST 
> |--08:EXCHANGE            1   4s153ms   4s153ms  2.00M       2.00M          0              0  UNPARTITIONED         
> |  07:EXCHANGE            1   3s783ms   3s783ms  2.00M       2.00M          0              0  UNPARTITIONED         
> |  01:UNION               3  17.000ms  28.001ms  3.03M       2.00M          0              0                        
> |  |--03:SCAN HDFS        3   0.000ns   0.000ns      0       6.00M          0      176.00 MB  tpch.lineitem         
> |  02:SCAN HDFS           3   6s133ms   6s948ms  3.03M       6.00M   24.02 MB      176.00 MB  tpch.lineitem         
> 06:EXCHANGE               1   5s655ms   5s655ms      1           1          0              0  UNPARTITIONED         
> 00:SCAN HDFS              3   4s077ms   6s207ms      2           1   16.05 MB      176.00 MB  tpch.lineitem a       
> {noformat}
> Plan:
> {noformat}
> ----------------
> Max Per-Host Resource Reservation: Memory=17.00MB
> Per-Host Resource Estimates: Memory=379.00MB
> F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |  Per-Host Resources: mem-estimate=27.00MB mem-reservation=17.00MB
> PLAN-ROOT SINK
> |  mem-estimate=0B mem-reservation=0B
> |
> 05:AGGREGATE [FINALIZE]
> |  output: count(*)
> |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
> |  tuple-ids=7 row-size=8B cardinality=1
> |
> 04:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: a.l_orderkey = -1 * l_orderkey
> |  fk/pk conjuncts: assumed fk/pk
> |  runtime filters: RF000[bloom] <- -1 * l_orderkey
> |  mem-estimate=17.00MB mem-reservation=17.00MB spill-buffer=1.00MB
> |  tuple-ids=0,4 row-size=16B cardinality=1
> |
> |--08:EXCHANGE [UNPARTITIONED]
> |  |  mem-estimate=0B mem-reservation=0B
> |  |  tuple-ids=4 row-size=8B cardinality=2000000
> |  |
> |  F05:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |  Per-Host Resources: mem-estimate=0B mem-reservation=0B
> |  07:EXCHANGE [UNPARTITIONED]
> |  |  limit: 2000000
> |  |  mem-estimate=0B mem-reservation=0B
> |  |  tuple-ids=4 row-size=8B cardinality=2000000
> |  |
> |  F04:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
> |  Per-Host Resources: mem-estimate=176.00MB mem-reservation=0B
> |  01:UNION
> |  |  pass-through-operands: all
> |  |  limit: 2000000
> |  |  mem-estimate=0B mem-reservation=0B
> |  |  tuple-ids=4 row-size=8B cardinality=2000000
> |  |
> |  |--03:SCAN HDFS [tpch.lineitem, RANDOM]
> |  |     partitions=1/1 files=1 size=718.94MB
> |  |     stored statistics:
> |  |       table: rows=6001215 size=718.94MB
> |  |       columns: all
> |  |     extrapolated-rows=disabled
> |  |     mem-estimate=176.00MB mem-reservation=0B
> |  |     tuple-ids=3 row-size=8B cardinality=6001215
> |  |
> |  02:SCAN HDFS [tpch.lineitem, RANDOM]
> |     partitions=1/1 files=1 size=718.94MB
> |     stored statistics:
> |       table: rows=6001215 size=718.94MB
> |       columns: all
> |     extrapolated-rows=disabled
> |     mem-estimate=176.00MB mem-reservation=0B
> |     tuple-ids=2 row-size=8B cardinality=6001215
> |
> 06:EXCHANGE [UNPARTITIONED]
> |  limit: 1
> |  mem-estimate=0B mem-reservation=0B
> |  tuple-ids=0 row-size=8B cardinality=1
> |
> F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
> Per-Host Resources: mem-estimate=176.00MB mem-reservation=0B
> 00:SCAN HDFS [tpch.lineitem a, RANDOM]
>    partitions=1/1 files=1 size=718.94MB
>    runtime filters: RF000[bloom] -> a.l_orderkey
>    stored statistics:
>      table: rows=6001215 size=718.94MB
>      columns: all
>    extrapolated-rows=disabled
>    limit: 1
>    mem-estimate=176.00MB mem-reservation=0B
>    tuple-ids=0 row-size=8B cardinality=1
> ----------------
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)