You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Jark Wu <im...@gmail.com> on 2020/02/04 02:23:30 UTC

Re: Large intervaljoin related question

Hi Chen,

AFAIK, DataStream doesn't have too much operator level optimization for
large window and interval join.
The suggested best practices are
 - please do not slide a window with very small step
 - use rocksdb statebackend on SSD for large state.
 - increase parallelism for the operators of large state

The pane optmization which is mentioned in the FLINK-7001 has partially
implemented in Flink SQL blink planner's window operator. You can try it
there.

Regarding the rocksdb statebackend tuning, I'm not an expert on rocksdb
statebackend. Hence, I'm pulling in Yu Li who might help you.

Best,
Jark

On Sat, 14 Dec 2019 at 03:24, Chen Qin <qi...@gmail.com> wrote:

> Hi there,
>
> We had seen growing interest of using large window and interval join
> operation. What is recommended way of handling these use cases?(e.g
> DeltaLake in Spark)
> After some benchmark, we found performance seems a bottleneck (still) on
> support those use cases.
> How is performance improvement
> https://issues.apache.org/jira/browse/FLINK-7001 <
> https://issues.apache.org/jira/browse/FLINK-7001> going?
>
> In tuning side, we plan to test giving larger blob cache on rocskdb side
> ~4GB, will this help?
> Otherwise, we plan to write to external hive table (seems no partition
> supported yet) and run frequent ETL job there.
>
>
> Thanks,
> Chen