You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Lincoln Lee <li...@gmail.com> on 2022/05/19 07:29:13 UTC

[DISCUSS] FLINK-27625: Add a new query hint for async lookup join

Hi everyone,

I would like to open a discussion on adding a new query hint for async
lookup join. Since the changes were relatively minor, no new flip was
created, and if needed, I will create one.

FLINK-27623 adds a global parameter 'table.exec.async-lookup.output-mode'
for table users so that all three control parameters related to async I/O
can be configured at the same job level.
As planned in the issue, we‘d like to go a step further to offer more
precise control for async join operation more than job level config,
to introduce a new join hint: ‘ASYNC_LOOKUP’.

For the hint option, for intuitive and user-friendly reasons, we want to
support both simple and kv forms, with all options except table name being
optional (use job level configuration if not set)

# 1. simple form: (ordered hint option list)
```
ASYNC_LOOKUP('tableName'[, 'output-mode', 'buffer-capacity', 'timeout'])
optional:
output-mode
buffer-capacity
timeout
```

Note: since Calcite currently does not support the mixed type hint options,
the table name here needs to be a string instead of an identifier. (For
`SqlHint`: The option format can not be mixed in, they should either be all
simple identifiers or all literals or all key value pairs.) We can improve
this after Calcite support.

# 2. kv form: (support unordered hint option list)
```
ASYNC_LOOKUP('table'='tableName'[, 'output-mode'='ordered|allow-unordered',
'capacity'='int', 'timeout'='duration'])

optional kvs:
'output-mode'='ordered|allow-unordered'
'capacity'='int'
'timeout'='duration'
```

e.g., if the job level configuration is:
```
table.exec.async-lookup.output-mode: ORDERED
table.exec.async-lookup.buffer-capacity: 100
table.exec.async-lookup.timeout: 180s
```

then the following hints:
```
1. ASYNC_LOOKUP('dim1', 'allow-unordered', '200', '300s')
2. ASYNC_LOOKUP('dim1', 'allow-unordered', '200')
3. ASYNC_LOOKUP('table'='dim1', 'output-mode'='allow-unordered')
4. ASYNC_LOOKUP('table'='dim1', 'timeout'='300s')
5. ASYNC_LOOKUP('table'='dim1', 'capacity'='300')
```

are equivalent to:
```
1. ASYNC_LOOKUP('dim1', 'allow-unordered', '200', '300s')
2. ASYNC_LOOKUP('dim1', 'allow-unordered', '200', '180s')
3. ASYNC_LOOKUP('table'='dim1', 'output-mode'='allow-unordered',
'capacity'='100', 'timeout'='180s')
4. ASYNC_LOOKUP('table'='dim1', 'output-mode'='ordered', 'capacity'='100',
'timeout'='300s')
5. ASYNC_LOOKUP('table'='dim1', 'output-mode'='ordered', 'capacity'='300',
'timeout'='180s')
```

In addition, if the lookup source implements both sync and async table
function, the planner prefers to choose the async function when the
'ASYNC_LOOKUP' hint is specified.

Looking forward to your feedback!

Best,
Lincoln Lee