You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by Franco VENTURI <fv...@comcast.net> on 2019/09/28 20:46:03 UTC
Joining two very large Kudu tables in Impala

At work we have two very large tables stored in Kudu (since we need to be able to perform row updates efficiently).
One of them (called table33 in the example SQL below) contains two columns called 'join_column1' and 'join_column2' that are used to join it with the other table (called table14 below).


A typical query would look like this one:


SELECT STRAIGHT_JOIN
       d.pk_date1_column,
       d.columnA,
       d.columnB,
       d.columnC
  FROM schema.table14 d
  JOIN (
SELECT join_column1, join_column2 FROM schema.table33 dt
WHERE dt.pk_column = 1234567890
   AND dt.pk_date2_column >= date_trunc('second',to_timestamp('06-10-2019 18:13:48', 'MM-dd-yyyy HH:mm:ss'))
   AND dt.pk_date2_column <= seconds_add(date_trunc('second',to_timestamp('06-10-2019 18:13:48', 'MM-dd-yyyy HH:mm:ss')),1)
   AND dt.column3 = 'SOME_VALUE_FOR_COLUMN_3'
   AND dt.column4 = 'SOME_VALUE_FOR_COLUMN_4') AS dtdt
WHERE dtdt.join_column1 = d.join_column1
   AND dtdt.join_column2 = d.join_column2
   AND d.pk_date1_column >= to_date(date_trunc('month',months_sub(to_timestamp('06-10-2019 18:13:48', 'MM-dd-yyyy HH:mm:ss'), 3)))
   AND d.pk_date1_column <= to_date(date_trunc('month',months_add(to_timestamp('06-10-2019 18:13:48', 'MM-dd-yyyy HH:mm:ss'), 6)))
ORDER BY d.pk_date1_column, d.date2_column;


The idea here is to use a very narrow subselect (using the columns 'pk_column', 'column3', and 'column4' from table33) to just select very few rows (a dozen or so), and then join with the other table (table14) to return the values of 'columnA', 'columnB', and 'columnC' for these few rows - if we 'knew' the values of the join columns for these two rows in advance, the whole join could be easily be replaced by an IN-list.


Some other useful information:
  - the cardinality of table14 is about 14 billion rows
  - the cardinality of table33 is about 33 billion rows
  - the purpose of the 'STRAIGHT_JOIN' hint is to avoid to have the Impala optimizer reorder the join (and hence get rid of any 'COMPUTE STATS' discussion)
  - the SQL query above is roughly modeled after the example here: https://impala.apache.org/docs/build/html/topics/impala_perf_joins.html (section: 'Overriding Join Reordering with STRAIGHT_JOIN')
  - the version of Impala we are using is 2.12 and the version of Kudu we are using is 1.7.0 (this is what comes with our vendor distribution)


Running EXPLAIN on the example query above shows the following plan:


PLAN-ROOT SINK
|
05:MERGING-EXCHANGE [UNPARTITIONED]
|  order by: pk_date1_column ASC, date2_column ASC
|
03:SORT
|  order by: pk_date1_column ASC, date2_column ASC
|
02:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: d.join_column2 = join_column2, d.join_column1 = join_column1
|  runtime filters: RF002 <- join_column2, RF003 <- join_column1
|
|--04:EXCHANGE [BROADCAST]
|  |
|  01:SCAN KUDU [schema.table33 dt]
|     kudu predicates: dt.pk_column = 1234567890, dt.column3 = 'SOME_VALUE_FOR_COLUMN_3', dt.pk_date2_column <= TIMESTAMP '2019-06-10 18:13:49', dt.pk_date2_column >= TIMESTAMP '2019-06-10 18:13:48', dt.column4 = 'SOME_VALUE_FOR_COLUMN_4'
|
00:SCAN KUDU [schema.table14 d]
   kudu predicates: d.pk_date1_column <= TIMESTAMP '2019-12-01 00:00:00', d.pk_date1_column >= TIMESTAMP '2019-03-01 00:00:00'
   runtime filters: RF002 -> d.join_column2, RF003 -> d.join_column1


As you can see above, step '01:SCAN KUDU' uses the kudu predicates to filter down the results from table33 (and when the query runs, the query profile shows it takes only a few ms) and returns a dozen or so rows, however step '00:SCAN KUDU' doesn't use any of the results from the other part of the join and goes through billions of rows in table14 (taking about 10 minutes or more on our cluster) to finally return the columns requested in the main select.


On the other hand, if we manually run the very narrow subselect, cut&paste the join_column values for the few returned rows and replace the join with an IN-list, Impala takes advantage of this information and returns the requested columns in sub-second time.


This (joining two very large table to extract only a few rows) seems a very common use case for us and we were surprised to see this significant performance impact when using Impala on Kudu tables.


Going through the Impala JIRA we found a couple of interesting pieces of information:
  - a comment by Todd Lipcon (https://issues.apache.org/jira/browse/IMPALA-4252?focusedCommentId=15921058&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15921058) back in October 2016, suggesting the use of IN-lists, but it doesn't seem to have had any reply
  - a JIRA task about pushing Bloom filters to Kudu scanners (https://issues.apache.org/jira/browse/IMPALA-3741), created in June 2016, but it looks like it is still unresolved.


I was wondering if any of you had to face a similar issue joining two very large Kudu tables, and if you found an efficient way with Impala to run this kind of query (short of having to re-design those two tables, which would be a non-trivial task for us, given the table sizes, and given the number of other code changes it would require in all the other applications that populate and query those tables).


We could also tell all our end users that they can no longer have this kind of two table joins (which they have been used to for years in our current RDBMS), and that instead they have to run two selects and then add some code for the IN-list logic, but we were hoping first to be able to find an elegant SQL-only solution to this problem.


Thanks in advance,
Franco Venturi