You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org> on 2016/09/01 19:44:50 UTC

[Impala-CR] IMPALA-3567: Part 1: groundwork to make Join build sides DataSinks

Hello Marcel Kornacker, Internal Jenkins, Tim Armstrong,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/4212

to review the following change.

Change subject: IMPALA-3567: Part 1: groundwork to make Join build sides DataSinks
......................................................................

IMPALA-3567: Part 1: groundwork to make Join build sides DataSinks

Refactor DataSink interface to be more generic. We need more flexibility
in setting up MemTrackers, so that memory is accounted against the right
ExecNode. Also removes some redundancy between DataSink subclasses in
setting up RuntimeProfiles, etc.

Remove the redundancy in the DataSink between passing eos to GetNext()
and FlushFinal(). This simplifies HdfsTableSink quite a bit and makes
handling empty batches simpler.

Partially refactor join nodes so that the control flow between
BlockingJoinNode::Open() and its subclasses is easier to follow.
BlockingJoinNode now only calls one virtual function on its
subclasses: ConstructJoinBuild(). Once we convert all join nodes
to use the DataSink interface, we will also be able to remove that
as well.

As a minor optimisation, avoid updating a timer that is ignored for
non-async builds.

As a proof of concept, this patch separates out the build side of
NestedLoopJoinNode into a class that implements the DataSink
interface. The main challenge here is that NestedLoopJoinNode
recycles row batches to avoid reallocations and copies of row
batches in subplans. The solution to this is:
- Retain the special-case optimisation for SingularRowSrc
- Use the row batch cache and RowBatch::AcquireState() to copy
  the state of row batches passed to Send(), while recycling the
  RowBatch objects.

Refactoring the partitioned hash join is left for Part 2.

Testing:
Ran exhaustive, core ASAN, and exhaustive non-partioned agg/join builds.
Also ran a local stress test.

Performance:
Ran TPC-H nested locally. The results show that the change is perf-neutral.

+------------------+-----------------------+---------+------------+------------+----------------+
| Workload         | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+------------------+-----------------------+---------+------------+------------+----------------+
| TPCH_NESTED(_20) | parquet / none / none | 7.75    | +0.19%     | 5.64       | -0.34%         |
+------------------+-----------------------+---------+------------+------------+----------------+

+------------------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| Workload         | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters |
+------------------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| TPCH_NESTED(_20) | TPCH-Q17 | parquet / none / none | 18.96  | 17.95       |   +5.61%   |   4.97%   |   0.71%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q14 | parquet / none / none | 3.61   | 3.56        |   +1.25%   |   0.97%   |   1.19%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q8  | parquet / none / none | 6.25   | 6.23        |   +0.44%   |   0.44%   |   0.90%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q10 | parquet / none / none | 5.84   | 5.83        |   +0.30%   |   1.21%   |   1.82%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q5  | parquet / none / none | 4.91   | 4.90        |   +0.11%   |   0.18%   |   0.78%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q21 | parquet / none / none | 17.82  | 17.81       |   +0.07%   |   0.66%   |   0.58%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q4  | parquet / none / none | 5.12   | 5.12        |   -0.02%   |   0.97%   |   1.23%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q9  | parquet / none / none | 23.85  | 23.88       |   -0.15%   |   0.72%   |   0.22%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q12 | parquet / none / none | 6.15   | 6.16        |   -0.16%   |   1.60%   |   1.72%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q3  | parquet / none / none | 5.46   | 5.47        |   -0.23%   |   1.28%   |   0.90%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q16 | parquet / none / none | 3.61   | 3.62        |   -0.26%   |   1.00%   |   1.36%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q19 | parquet / none / none | 20.19  | 20.31       |   -0.58%   |   1.67%   |   0.65%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q7  | parquet / none / none | 9.42   | 9.48        |   -0.68%   |   0.87%   |   0.71%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q18 | parquet / none / none | 12.94  | 13.06       |   -0.90%   |   0.59%   |   0.48%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q22 | parquet / none / none | 1.09   | 1.10        |   -0.92%   |   2.26%   |   2.22%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q13 | parquet / none / none | 3.75   | 3.78        |   -0.94%   |   2.04%   |   2.86%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q20 | parquet / none / none | 4.33   | 4.37        |   -1.10%   |   3.00%   |   2.43%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q2  | parquet / none / none | 2.39   | 2.42        |   -1.38%   |   1.54%   |   1.30%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q11 | parquet / none / none | 1.43   | 1.46        |   -1.78%   |   2.05%   |   2.77%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q6  | parquet / none / none | 2.29   | 2.33        |   -1.79%   |   0.56%   |   1.23%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q15 | parquet / none / none | 5.04   | 5.13        |   -1.84%   |   0.61%   |   2.01%        | 1           | 10    |
| TPCH_NESTED(_20) | TPCH-Q1  | parquet / none / none | 5.98   | 6.12        |   -2.30%   |   1.84%   |   3.19%        | 1           | 10    |
+------------------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+

Change-Id: I9d7608181eeacfe706a09c1e153d0a3e1ee9b475
Reviewed-on: http://gerrit.cloudera.org:8080/3842
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Reviewed-by: Marcel Kornacker <ma...@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 3e2411f3078314c03bbe9c9b225770aa1580fdc4)
---
M be/src/exec/CMakeLists.txt
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/data-sink.cc
M be/src/exec/data-sink.h
M be/src/exec/hash-join-node.cc
M be/src/exec/hash-join-node.h
M be/src/exec/hbase-table-sink.cc
M be/src/exec/hbase-table-sink.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/kudu-table-sink-test.cc
M be/src/exec/kudu-table-sink.cc
M be/src/exec/kudu-table-sink.h
A be/src/exec/nested-loop-join-builder.cc
A be/src/exec/nested-loop-join-builder.h
M be/src/exec/nested-loop-join-node.cc
M be/src/exec/nested-loop-join-node.h
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/exec/row-batch-cache.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
M be/src/runtime/data-stream-test.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/runtime/row-batch.cc
M be/src/runtime/row-batch.h
M be/src/util/stopwatch.h
29 files changed, 718 insertions(+), 461 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/12/4212/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4212
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I9d7608181eeacfe706a09c1e153d0a3e1ee9b475
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR] IMPALA-3567: Part 1: groundwork to make Join build sides DataSinks

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has abandoned this change.

Change subject: IMPALA-3567: Part 1: groundwork to make Join build sides DataSinks
......................................................................


Abandoned

-- 
To view, visit http://gerrit.cloudera.org:8080/4212
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: I9d7608181eeacfe706a09c1e153d0a3e1ee9b475
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>