You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by David Olsen <da...@gmail.com> on 2016/05/28 07:59:06 UTC

Parallel read text

After searching on the internet I still do not find the answer (with key
word like 'apache flink parallel read text') I am looking for. So asking
here before jumping to write code ...

My problem is I want to a read text file or split text files (from local
file system). Therefore I want to parallel read those files and process
them accordingly.

From what I discover so far:
- Use ExecutionEnvironment.readTextFile but this only serves with 1
thread(?) (meaning reading the file(s) from the beginning to the end)
- Use streaming env to addSource[1] but that seems to me I need to
implement my own source with RichParallelSourceFunction.

Is there any classes or impl that already can read text in parallel?

Thanks

[1].
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Reading-separate-files-in-parallel-tasks-as-input-td1623.html

Re: Parallel read text

Posted by Robert Metzger <rm...@apache.org>.
Hi David,

I guess you can verify it by adding custom log statements into the Flink
code (therefore, you need to recompile Flink).
Maybe a debugger is also sufficient (if you are running Flink locally).
We are currently reworking the reading of static files for the streaming
environment. Maybe its interesting to check out the new implementation [1]

[1] https://github.com/apache/flink/pull/2020


On Sat, May 28, 2016 at 1:49 PM, David Olsen <da...@gmail.com>
wrote:

> Thank you for the advice!
>
> Now I have a new question. I read the source[1] streaming env exploits
> FileSourceFunction, which inherits RichParallelSourceFunction, to create
> split input[2]. I know I can set parallelism in streaming env, but any way
> I can verify that at runtime the split files or the file is read in
> parallel?
>
> Thank you again for your help.
>
> [1].
> https://raw.githubusercontent.com/eBay/Flink/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java
>
> [2].
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/source/FileSourceFunction.java
>
>
>
> On 28 May 2016 at 17:52, Chesnay Schepler <ch...@apache.org> wrote:
>
>> ExecutionEnvironment.readTextFile will read the file in parallel.
>>
>>
>> On 28.05.2016 09:59, David Olsen wrote:
>>
>> After searching on the internet I still do not find the answer (with key
>> word like 'apache flink parallel read text') I am looking for. So asking
>> here before jumping to write code ...
>>
>> My problem is I want to a read text file or split text files (from local
>> file system). Therefore I want to parallel read those files and process
>> them accordingly.
>>
>> From what I discover so far:
>> - Use ExecutionEnvironment.readTextFile but this only serves with 1
>> thread(?) (meaning reading the file(s) from the beginning to the end)
>> - Use streaming env to addSource[1] but that seems to me I need to
>> implement my own source with RichParallelSourceFunction.
>> Is there any classes or impl that already can read text in parallel?
>> Thanks
>>
>> [1].
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Reading-separate-files-in-parallel-tasks-as-input-td1623.html
>>
>>
>>
>

Re: Parallel read text

Posted by David Olsen <da...@gmail.com>.
Thank you for the advice!

Now I have a new question. I read the source[1] streaming env exploits
FileSourceFunction, which inherits RichParallelSourceFunction, to create
split input[2]. I know I can set parallelism in streaming env, but any way
I can verify that at runtime the split files or the file is read in
parallel?

Thank you again for your help.

[1].
https://raw.githubusercontent.com/eBay/Flink/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java

[2].
https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/source/FileSourceFunction.java



On 28 May 2016 at 17:52, Chesnay Schepler <ch...@apache.org> wrote:

> ExecutionEnvironment.readTextFile will read the file in parallel.
>
>
> On 28.05.2016 09:59, David Olsen wrote:
>
> After searching on the internet I still do not find the answer (with key
> word like 'apache flink parallel read text') I am looking for. So asking
> here before jumping to write code ...
>
> My problem is I want to a read text file or split text files (from local
> file system). Therefore I want to parallel read those files and process
> them accordingly.
>
> From what I discover so far:
> - Use ExecutionEnvironment.readTextFile but this only serves with 1
> thread(?) (meaning reading the file(s) from the beginning to the end)
> - Use streaming env to addSource[1] but that seems to me I need to
> implement my own source with RichParallelSourceFunction.
> Is there any classes or impl that already can read text in parallel?
> Thanks
>
> [1].
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Reading-separate-files-in-parallel-tasks-as-input-td1623.html
>
>
>

Re: Parallel read text

Posted by Chesnay Schepler <ch...@apache.org>.
ExecutionEnvironment.readTextFile will read the file in parallel.

On 28.05.2016 09:59, David Olsen wrote:
> After searching on the internet I still do not find the answer (with 
> key word like 'apache flink parallel read text') I am looking for. So 
> asking here before jumping to write code ...
>
> My problem is I want to a read text file or split text files (from 
> local file system). Therefore I want to parallel read those files and 
> process them accordingly.
>
> From what I discover so far:
> - Use ExecutionEnvironment.readTextFile but this only serves with 1 
> thread(?) (meaning reading the file(s) from the beginning to the end)
> - Use streaming env to addSource[1] but that seems to me I need to 
> implement my own source with RichParallelSourceFunction.
> Is there any classes or impl that already can read text in parallel?
> Thanks
>
> [1]. 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Reading-separate-files-in-parallel-tasks-as-input-td1623.html