You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "TisonKun (JIRA)" <ji...@apache.org> on 2018/10/17 07:40:01 UTC
[jira] [Issue Comment Deleted] (FLINK-10038) Parallel the creation of InputSplit if necessary

     [ https://issues.apache.org/jira/browse/FLINK-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

TisonKun updated FLINK-10038:
-----------------------------
    Comment: was deleted

(was: My original purpose of mention "parallelize the creation of InputSplit" might be parallelize the creation of ONE InputSplit. Take a look at {{FileInputFormat#createInputSplits}}, it creates InputSplits file by file. Here is where I aim to parallelize. Thus it said "the interface for the creation of input splits is definitely InputSplitSource#createInputSplits". And this could be done without modify the interface, by change the implementation of {{createInputSplits}}.

However, your ideas here are also brightly. Let's say a typical case gain benefits from these ideas is batch job with many files, where would prefer to using RegionFailover strategy if possible.
Here I see 3 options. 1. create InputSplits before job running. 2. create InputSplits concurrent to scheduling the job. 3. Use a specific single task to generate the work.

Option 1 is easier to implement as [~StephanEwen] said. Below with concrete challenges for the rest options.

The main issue I concern is in batch job, we prefer not to cancelling all vertices and restart. What's worse, since we don't have batch checkpoint, the batch job has to restart completely. This is unacceptable for large scale batch job.
For option2, what if jm failover after some input splits have been computed and sent off? We don't have specific jm failover strategy now, thus it cause the job completely restarted. By continue this option, it leads to discuss A jm failover strategy, that is, when jm failover and restart, it can recover(reconcile) state from the previous one.
For option3, there would be a wider consider about Source. Take two input case into consider(below). Currently we read from source blocking, now we compute the input split as a single task, if we still use blocking approach, the downstream maybe stuck for waiting one input while the other input is ready to be read.

Src1 ----\
Src2---->Join

One way to solve this issue is we read from the source unblocking. Assume introduce a method {{boolean SourceFunction#next(Collector<T>)}}, when the downstream calling it, the source sent its data to the collector and return true. If there remains no more data, it return false. This also async source read from file and produce data.

To sum up, focusing more on batch job, the main issue concerned would be jm failover for option 1 and 2(also extern but significant batch checkpoint), and more flexible source for option 3.)

> Parallel the creation of InputSplit if necessary
> ------------------------------------------------
>
>                 Key: FLINK-10038
>                 URL: https://issues.apache.org/jira/browse/FLINK-10038
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: TisonKun
>            Priority: Major
>              Labels: improvement, inputformat, parallel, perfomance
>
> As a continue to the discussion in the PR about parallelize the creation of ExecutionJobVertex [here|https://github.com/apache/flink/pull/6353].
> [~StephanEwen] suggested that we could parallelize the creation of InputSplit, from which we gain performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)