You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Vova Vysotskyi (Jira)" <ji...@apache.org> on 2020/02/05 11:50:00 UTC
[jira] [Updated] (DRILL-7570) Fix unstable statistics tests

     [ https://issues.apache.org/jira/browse/DRILL-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vova Vysotskyi updated DRILL-7570:
----------------------------------
    Description: 
Drill contains tests for checking that statistics is applied, some of them also use sampling to calculate statistics value.

Sampling adds limit above scan, but tests check the value of the estimated row count to verify that statistics were applied. limit without sorting doesn't guarantee consistent results, so these tests may fail sometime:
{noformat}
[ERROR]   TestMetastoreCommands.testAnalyzeWithSampleStatistics:2739 Did not find expected pattern in plan: Filter\(condition.*\).*rowcount = 96.25,
00-00    Screen : rowType = RecordType(ANY employee_id): rowcount = 105.0, cumulative cost = {2530.5 rows, 7570.5 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336738
00-01      Project(employee_id=[$1]) : rowType = RecordType(ANY employee_id): rowcount = 105.0, cumulative cost = {2520.0 rows, 7560.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336737
00-02        SelectionVectorRemover : rowType = RecordType(ANY department_id, ANY employee_id): rowcount = 105.0, cumulative cost = {2415.0 rows, 7455.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336736
00-03          Filter(condition=[=($0, 2)]) : rowType = RecordType(ANY department_id, ANY employee_id): rowcount = 105.0, cumulative cost = {2310.0 rows, 7350.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336735
00-04            Scan(table=[[dfs, tmp, employeeWithStatsFile]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/home/runner/work/drill/drill/exec/java-exec/target/org.apache.drill.exec.sql.TestMetastoreCommands/dfsTestTmp/1580901135676-0/employeeWithStatsFile/0_0_0.parquet]], selectionRoot=/home/runner/work/drill/drill/exec/java-exec/target/org.apache.drill.exec.sql.TestMetastoreCommands/dfsTestTmp/1580901135676-0/employeeWithStatsFile, numFiles=1, numRowGroups=1, usedMetadataFile=false, usedMetastore=true, filter=equal(`department_id`, 2) , columns=[`department_id`, `employee_id`]]]) : rowType = RecordType(ANY department_id, ANY employee_id): rowcount = 1155.0, cumulative cost = {1155.0 rows, 2310.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336734
 expected:<true> but was:<false> {noformat}
List of tests to fix:

- TestMetastoreCommands.testAnalyzeWithSampleStatistics;
- TestAnalyze.testHistogramWithSubsetColumnsAndSampling;
- TestAnalyze.basic3.

  was:
Drill contains tests for checking that statistics is applied, some of them also use sampling to calculate statistics value.

Sampling adds limit above scan, but tests check the value of the estimated row count to verify that statistics were applied. limit without sorting doesn't guarantee consistent results, so these tests may fail sometime:
{noformat}
[ERROR]   TestMetastoreCommands.testAnalyzeWithSampleStatistics:2739 Did not find expected pattern in plan: Filter\(condition.*\).*rowcount = 96.25,
00-00    Screen : rowType = RecordType(ANY employee_id): rowcount = 105.0, cumulative cost = {2530.5 rows, 7570.5 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336738
00-01      Project(employee_id=[$1]) : rowType = RecordType(ANY employee_id): rowcount = 105.0, cumulative cost = {2520.0 rows, 7560.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336737
00-02        SelectionVectorRemover : rowType = RecordType(ANY department_id, ANY employee_id): rowcount = 105.0, cumulative cost = {2415.0 rows, 7455.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336736
00-03          Filter(condition=[=($0, 2)]) : rowType = RecordType(ANY department_id, ANY employee_id): rowcount = 105.0, cumulative cost = {2310.0 rows, 7350.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336735
00-04            Scan(table=[[dfs, tmp, employeeWithStatsFile]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/home/runner/work/drill/drill/exec/java-exec/target/org.apache.drill.exec.sql.TestMetastoreCommands/dfsTestTmp/1580901135676-0/employeeWithStatsFile/0_0_0.parquet]], selectionRoot=/home/runner/work/drill/drill/exec/java-exec/target/org.apache.drill.exec.sql.TestMetastoreCommands/dfsTestTmp/1580901135676-0/employeeWithStatsFile, numFiles=1, numRowGroups=1, usedMetadataFile=false, usedMetastore=true, filter=equal(`department_id`, 2) , columns=[`department_id`, `employee_id`]]]) : rowType = RecordType(ANY department_id, ANY employee_id): rowcount = 1155.0, cumulative cost = {1155.0 rows, 2310.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336734
 expected:<true> but was:<false> {noformat}
List of tests to fix:

- TestMetastoreCommands.testAnalyzeWithSampleStatistics;

- TestAnalyze.testHistogramWithSubsetColumnsAndSampling

- TestAnalyze.basic3


> Fix unstable statistics tests
> -----------------------------
>
>                 Key: DRILL-7570
>                 URL: https://issues.apache.org/jira/browse/DRILL-7570
>             Project: Apache Drill
>          Issue Type: Task
>    Affects Versions: 1.17.0
>            Reporter: Vova Vysotskyi
>            Assignee: Vova Vysotskyi
>            Priority: Major
>             Fix For: 1.18.0
>
>
> Drill contains tests for checking that statistics is applied, some of them also use sampling to calculate statistics value.
> Sampling adds limit above scan, but tests check the value of the estimated row count to verify that statistics were applied. limit without sorting doesn't guarantee consistent results, so these tests may fail sometime:
> {noformat}
> [ERROR]   TestMetastoreCommands.testAnalyzeWithSampleStatistics:2739 Did not find expected pattern in plan: Filter\(condition.*\).*rowcount = 96.25,
> 00-00    Screen : rowType = RecordType(ANY employee_id): rowcount = 105.0, cumulative cost = {2530.5 rows, 7570.5 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336738
> 00-01      Project(employee_id=[$1]) : rowType = RecordType(ANY employee_id): rowcount = 105.0, cumulative cost = {2520.0 rows, 7560.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336737
> 00-02        SelectionVectorRemover : rowType = RecordType(ANY department_id, ANY employee_id): rowcount = 105.0, cumulative cost = {2415.0 rows, 7455.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336736
> 00-03          Filter(condition=[=($0, 2)]) : rowType = RecordType(ANY department_id, ANY employee_id): rowcount = 105.0, cumulative cost = {2310.0 rows, 7350.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336735
> 00-04            Scan(table=[[dfs, tmp, employeeWithStatsFile]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/home/runner/work/drill/drill/exec/java-exec/target/org.apache.drill.exec.sql.TestMetastoreCommands/dfsTestTmp/1580901135676-0/employeeWithStatsFile/0_0_0.parquet]], selectionRoot=/home/runner/work/drill/drill/exec/java-exec/target/org.apache.drill.exec.sql.TestMetastoreCommands/dfsTestTmp/1580901135676-0/employeeWithStatsFile, numFiles=1, numRowGroups=1, usedMetadataFile=false, usedMetastore=true, filter=equal(`department_id`, 2) , columns=[`department_id`, `employee_id`]]]) : rowType = RecordType(ANY department_id, ANY employee_id): rowcount = 1155.0, cumulative cost = {1155.0 rows, 2310.0 cpu, 2310.0 io, 0.0 network, 0.0 memory}, id = 336734
>  expected:<true> but was:<false> {noformat}
> List of tests to fix:
> - TestMetastoreCommands.testAnalyzeWithSampleStatistics;
> - TestAnalyze.testHistogramWithSubsetColumnsAndSampling;
> - TestAnalyze.basic3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)