You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/08/12 20:27:00 UTC
[jira] [Commented] (IMPALA-9859) Milestone 4: Read updated tables

    [ https://issues.apache.org/jira/browse/IMPALA-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176587#comment-17176587 ] 

ASF subversion and git services commented on IMPALA-9859:
---------------------------------------------------------

Commit da34d34a42ad1bb77d6911708f1363c53ac79018 in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=da34d34 ]

IMPALA-9859: Full ACID Milestone 4: Part 2 Reading modified tables (complex types)

This implements scanning full ACID tables that contain complex types.
The same technique works that we use for primitive types. I.e. we add
a LEFT ANTI JOIN on top of the Hdfs scan node in order to subtract
the deleted rows from the inserted rows.

However, there were some types of queries where we couldn't do that.
These are the queries that scan the nested collection items directly.

E.g.: SELECT item FROM complextypestbl.int_array;

The above query only creates a single tuple descriptor that holds the
collection items. Since this tuple descriptor is not at the table-level,
we cannot add slot references to the hidden ACID column which are at the
top level of the table schema.

To resolve this I added a statement rewriter that rewrites the above
statement to the following:

  SELECT item FROM complextypestbl $a$1, $a$1.int_array;

Now in this example we'll have two tuple descriptors, one for the
table-level, and one for the collection item. So we can add the ACID
slot refs to the table-level tuple descriptor. The rewrite is
implemented by the new AcidRewriter class.

Performance

I executed the following query with num_nodes=1 on a non-transactional
table (without the rewrite), and on an ACID table (with the rewrite):

  select count(*) from customer_nested.c_orders.o_lineitems;

Without the rewrite:
Fetched 1 row(s) in 0.41s
+--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+
| Operator     | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail                                            |
+--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+
| F00:ROOT     | 1      | 1     | 13.61us  | 13.61us  |       |            | 0 B      | 0 B           |                                                   |
| 01:AGGREGATE | 1      | 1     | 3.68ms   | 3.68ms   | 1     | 1          | 16.00 KB | 10.00 MB      | FINALIZE                                          |
| 00:SCAN HDFS | 1      | 1     | 280.47ms | 280.47ms | 6.00M | 15.00M     | 56.98 MB | 8.00 MB       | tpch_nested_orc_def.customer.c_orders.o_lineitems |
+--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+

With the rewrite:
Fetched 1 row(s) in 0.42s
+---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+
| Operator                  | #Hosts | #Inst | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem | Est. Peak Mem | Detail                                |
+---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+
| F00:ROOT                  | 1      | 1     | 25.16us  | 25.16us  |         |            | 0 B      | 0 B           |                                       |
| 05:AGGREGATE              | 1      | 1     | 3.44ms   | 3.44ms   | 1       | 1          | 63.00 KB | 10.00 MB      | FINALIZE                              |
| 01:SUBPLAN                | 1      | 1     | 16.52ms  | 16.52ms  | 6.00M   | 125.92M    | 47.00 KB | 0 B           |                                       |
| |--04:NESTED LOOP JOIN    | 1      | 1     | 188.47ms | 188.47ms | 0       | 10         | 24.00 KB | 12 B          | CROSS JOIN                            |
| |  |--02:SINGULAR ROW SRC | 1      | 1     | 0ns      | 0ns      | 0       | 1          | 0 B      | 0 B           |                                       |
| |  03:UNNEST              | 1      | 1     | 25.37ms  | 25.37ms  | 0       | 10         | 0 B      | 0 B           | $a$1.c_orders.o_lineitems o_lineitems |
| 00:SCAN HDFS              | 1      | 1     | 96.26ms  | 96.26ms  | 100.00K | 12.59M     | 38.19 MB | 72.00 MB      | default.customer_nested $a$1          |
+---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+

So the overhead is very little.

Testing
* Added planner tests to PlannerTest/acid-scans.test
* E2E query tests to QueryTest/full-acid-complex-type-scans.test
* E2E tests for rowid-generation: QueryTest/full-acid-rowid.test

Change-Id: I8b2c6cd3d87c452c5b96a913b14c90ada78d4c6f
Reviewed-on: http://gerrit.cloudera.org:8080/16228
Reviewed-by: Zoltan Borok-Nagy <bo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Tim Armstrong <ta...@cloudera.com>


> Milestone 4: Read updated tables
> --------------------------------
>
>                 Key: IMPALA-9859
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9859
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>
> Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows.
> _Note: UPDATE operations are implemented via DELETE+INSERT._
> In the filesystem it looks like e.g.:
> {noformat}
> full_acid/delta_0000001_0000001_0000/0000_0
> full_acid/delete_delta_0000002_0000002_0000/0000_0
> {noformat}
> During scanning we need to return INSERTed rows minus DELETEd rows. One way of doing that is to create an ANTI JOIN between INSERT and DELETE events.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org