You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "John Sherman (Jira)" <ji...@apache.org> on 2022/10/03 18:20:00 UTC

[jira] [Comment Edited] (HIVE-26584) compressed_skip_header_footer_aggr.q is flaky

    [ https://issues.apache.org/jira/browse/HIVE-26584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612407#comment-17612407 ] 

John Sherman edited comment on HIVE-26584 at 10/3/22 6:19 PM:
--------------------------------------------------------------

After digging in deeper - You are correct, it is not a concurrent issue. It just happened to be the easiest way to repro and I mistakenly thought it was the root of the issue (before we had the containerized ptest framework, test conflicts were somewhat common iirc).

Here is what is what I think is happening:
1. During PR testing TestMiniLlapLocalCliDriver tests get split into 32 different splits
[https://github.com/apache/hive/blob/master/itests/bin/generate-cli-splits.sh]
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L39]
(It codegens 32 new TestMiniLlapLocalCliDriver objects each with split0 - split32 in the package name)

2. Test assignment for each split is handled via runtime introspection of the class name:
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L43]
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/SplitSupport.java#L46]

in my PRs case:
empty_skip_header_footer_aggr.q gets assigned to split-7:
{code:java}
<testcase name="testCliDriver[empty_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver" time="2.534"/>
{code}
compressed_skip_header_footer_aggr.q gets assigned to split-4:
{code:java}
<testcase name="testCliDriver[compressed_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver" time="7.242">
{code}
3. All test splits are split across 20 executors (not sure where this lives, maybe Jenkins scripts)
split-7 and split-4 get assigned to the same "execution split" of 14
{code:java}
split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver.xml
144:  <testcase name="testCliDriver[empty_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver" time="2.534"/>

split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver.xml
165:  <testcase name="testCliDriver[compressed_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver" time="7.242">
{code}
4. empty_skip_header_footer_aggr gets executed before compressed_skip_header_footer_aggr (this can be seen above in that 144 is before 165 in the test xml)

5. Both empty_skip_header_footer_aggr and compressed_skip_header_footer_aggr create external tables with the data copied to the same location(s). 
For example these locations get used in both tests:
${system:test.tmp.dir}/testcase1
${system:test.tmp.dir}/testcase2
since each test invocation ends up using the same path and the tmp directory is not cleaned between tests this is where the conflict occurs.

6. empty_skip_header_footer_aggr includes rmr commands to cleanup the testcase1 and testcase2 directories.
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/empty_skip_header_footer_aggr.q#L6]

compressed_skip_header does not:
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/compressed_skip_header_footer_aggr.q#L1]

This also like explains why it is not reproducible via:
{code:java}
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=compressed_skip_header_footer_aggr.q,empty_skip_header_footer_aggr.q
{code}
I think the order of the tests when executed this way is always compressed_skip_header_footer_aggr.q and then empty_skip_header_footer_aggr.q

My fix ends up working because I give a unique location for each tests test external data files.

I'll likely modify empty_skip_header_footer_aggr.q to remove the rmr's (because the only thing it really does is to hide this problem) and give all the files/directories unique names. We could like add a "unique external directory" variable that is generated per testcase and cleaned up after each one (or some other solution) but I think that is out of the scope of this ticket.


was (Author: jfs):
After digging in deeper - You are correct, it is not a concurrent issue. It just happened to be the easiest way to repro and I mistakenly thought it was the root of the issue (before we had the containerized ptest framework, test conflicts were somewhat common iirc).

Here is what is what I think is happening:
1. During PR testing TestMiniLlapLocalCliDriver tests get split into 32 different splits
[https://github.com/apache/hive/blob/master/itests/bin/generate-cli-splits.sh]
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L39]
(It codegens 32 new TestMiniLlapLocalCliDriver objects each with split0 - split32 in the package name)

2. Test assignment for each split is handled via runtime introspection of the class name:
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestMiniLlapLocalCliDriver.java#L43]
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/SplitSupport.java#L46]

in my PRs case:
empty_skip_header_footer_aggr.q gets assigned to split-7:
{code:java}
<testcase name="testCliDriver[empty_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver" time="2.534"/>
{code}
compressed_skip_header_footer_aggr.q gets assigned to split-4:
{code:java}
<testcase name="testCliDriver[compressed_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver" time="7.242">
{code}
3. All test splits are split across 20 executors (not sure where this lives, maybe Jenkins scripts)
split-7 and split-4 get assigned to the same "execution split" of 14
{code:java}
split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver.xml
144:  <testcase name="testCliDriver[empty_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split7.TestMiniLlapLocalCliDriver" time="2.534"/>

split-14/itests/qtest/target/surefire-reports/TEST-org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver.xml
165:  <testcase name="testCliDriver[compressed_skip_header_footer_aggr]" classname="org.apache.hadoop.hive.cli.split4.TestMiniLlapLocalCliDriver" time="7.242">
{code}
4. empty_skip_header_footer_aggr gets executed before compressed_skip_header_footer_aggr (this can be seen above in that 144 is before 165 in the test xml)

5. Both empty_skip_header_footer_aggr and compressed_skip_header_footer_aggr create external tables with the data copied to the same location(s). 
For example these locations get used in both tests:
${system:test.tmp.dir}/testcase1
${system:test.tmp.dir}/testcase2
since each test invocation ends up using the same path and the tmp directory is not cleaned between tests this is where the conflict occurs.

6. empty_skip_header_footer_aggr includes rmr commands to cleanup the testcase1 and testcase2 directories.
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/empty_skip_header_footer_aggr.q#L6]

compressed_skip_header does not:
[https://github.com/apache/hive/blob/4170e566143e6daa291654e97116199aa738377c/ql/src/test/queries/clientpositive/compressed_skip_header_footer_aggr.q#L1]

This also like explains why it is not reproducible via:
{code:java}
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=compressed_skip_header_footer_aggr.q,empty_skip_header_footer_aggr.q
{code}
I think the order of the tests when executed this way is always compressed_skip_header_footer_aggr.q and then empty_skip_header_footer_aggr.q

My fix ends up working because I give a unique location for each tests test external data files.

I'll likely modify empty_skip_header_footer_aggr.q to remove the rmr's (because the only thing the do is to hide this problem) and give all the files/directories unique names. We could like add a "unique external directory" variable that is generated per testcase and cleaned up after each one (or some other solution) but I think that is out of the scope of this ticket.

> compressed_skip_header_footer_aggr.q is flaky
> ---------------------------------------------
>
>                 Key: HIVE-26584
>                 URL: https://issues.apache.org/jira/browse/HIVE-26584
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: John Sherman
>            Assignee: John Sherman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> One of my PRs compressed_skip_header_footer_aggr.q  was failing with unexpected diff. Such as:
> {code:java}
>  TestMiniLlapLocalCliDriver.testCliDriver:62 Client Execution succeeded but contained differences (error code = 1) after executing compressed_skip_header_footer_aggr.q
> 69,71c69,70
> < 1 2019-12-31
> < 2 2018-12-31
> < 3 2017-12-31
> ---
> > 2 2019-12-31
> > 3 2019-12-31
> 89d87
> < NULL  NULL
> 91c89
> < 2 2018-12-31
> ---
> > 2 2019-12-31
> 100c98
> < 1
> ---
> > 2
> 109c107
> < 1 2019-12-31
> ---
> > 2 2019-12-31
> 127,128c125,126
> < 1 2019-12-31
> < 3 2017-12-31
> ---
> > 2 2019-12-31
> > 3 2019-12-31
> 146a145
> > 2 2019-12-31
> 155c154
> < 1
> ---
> > 2 {code}
> Investigating it, it did not seem to fail when executed locally. Since I suspected test interference I searched for the tablenames/directories used and discovered empty_skip_header_footer_aggr.q which uses the same table names AND external directories.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)