You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Vasiliki Kalavri <va...@gmail.com> on 2014/12/02 16:32:57 UTC

Input from nested directory structure

Hello all,

I want to run a Flink log processing job and my input is stored locally in
a nested directory structure, like the following:

logs_dir/
|-----/machine1/
|-----------/january.log
|-----------/february.log
...
|-----/machine2/
...

etc.

When providing "logs_dir" as the argument to readTextFile(), nothing is
read and no an exception or error is returned.
Copying the nested individual files machine1/january.log,
machine1/february.log, ..., to the same directory works fine, but I was
wondering whether there is a better way to do this?

Thank you!
V.

Re: Input from nested directory structure

Posted by Ufuk Celebi <uc...@apache.org>.
+1 I find this useful as well.

On 04 Dec 2014, at 22:02, Robert Metzger <rm...@apache.org> wrote:

> +1 for adding such a feature. It should be very easy to implement (basically extend the createInputSplits() method)
> 
> On Tue, Dec 2, 2014 at 5:22 PM, Vasiliki Kalavri <va...@gmail.com> wrote:
> Hi,
> 
> thanks for replying!
> 
> It would certainly be useful for my use case, but not absolutely necessary. If you think other people might find it useful too, I can open a issue. 
> If not, I believe it would be nice to print a warning when a nested directory is given as input path, 
> since now, the files that are in the base directory are normally processed, but the nested ones are simply ignored.
> 
> Cheers,
> V.
> 
> On 2 December 2014 at 16:52, Stephan Ewen <se...@apache.org> wrote:
> Hi!
> 
> Not right now. The input formats do not recursively enumerate files. In that, we followed the way Hadoop did it.
> 
> If that is something that is interesting, it should not be too hard to add to the FileInputFormat an option to do a complete recursive traversal of the directory structure.
> 
> Greetings,
> Stephan
> 
> 
> On Tue, Dec 2, 2014 at 4:32 PM, Vasiliki Kalavri <va...@gmail.com> wrote:
> Hello all,
> 
> I want to run a Flink log processing job and my input is stored locally in a nested directory structure, like the following:
> 
> logs_dir/
> |-----/machine1/
> |-----------/january.log
> |-----------/february.log
> ...
> |-----/machine2/
> ...
> 
> etc.
> 
> When providing "logs_dir" as the argument to readTextFile(), nothing is read and no an exception or error is returned.
> Copying the nested individual files machine1/january.log, machine1/february.log, ..., to the same directory works fine, but I was wondering whether there is a better way to do this?
> 
> Thank you!
> V.
> 
> 
> 


Re: Input from nested directory structure

Posted by Robert Metzger <rm...@apache.org>.
+1 for adding such a feature. It should be very easy to implement
(basically extend the createInputSplits() method)

On Tue, Dec 2, 2014 at 5:22 PM, Vasiliki Kalavri <va...@gmail.com>
wrote:

> Hi,
>
> thanks for replying!
>
> It would certainly be useful for my use case, but not absolutely
> necessary. If you think other people might find it useful too, I can open a
> issue.
> If not, I believe it would be nice to print a warning when a nested
> directory is given as input path,
> since now, the files that are in the base directory are normally
> processed, but the nested ones are simply ignored.
>
> Cheers,
> V.
>
> On 2 December 2014 at 16:52, Stephan Ewen <se...@apache.org> wrote:
>
>> Hi!
>>
>> Not right now. The input formats do not recursively enumerate files. In
>> that, we followed the way Hadoop did it.
>>
>> If that is something that is interesting, it should not be too hard to
>> add to the FileInputFormat an option to do a complete recursive traversal
>> of the directory structure.
>>
>> Greetings,
>> Stephan
>>
>>
>> On Tue, Dec 2, 2014 at 4:32 PM, Vasiliki Kalavri <
>> vasilikikalavri@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I want to run a Flink log processing job and my input is stored locally
>>> in a nested directory structure, like the following:
>>>
>>> logs_dir/
>>> |-----/machine1/
>>> |-----------/january.log
>>> |-----------/february.log
>>> ...
>>> |-----/machine2/
>>> ...
>>>
>>> etc.
>>>
>>> When providing "logs_dir" as the argument to readTextFile(), nothing is
>>> read and no an exception or error is returned.
>>> Copying the nested individual files machine1/january.log,
>>> machine1/february.log, ..., to the same directory works fine, but I was
>>> wondering whether there is a better way to do this?
>>>
>>> Thank you!
>>> V.
>>>
>>
>>
>

Re: Input from nested directory structure

Posted by Vasiliki Kalavri <va...@gmail.com>.
Hi,

thanks for replying!

It would certainly be useful for my use case, but not absolutely necessary.
If you think other people might find it useful too, I can open a issue.
If not, I believe it would be nice to print a warning when a nested
directory is given as input path,
since now, the files that are in the base directory are normally processed,
but the nested ones are simply ignored.

Cheers,
V.

On 2 December 2014 at 16:52, Stephan Ewen <se...@apache.org> wrote:

> Hi!
>
> Not right now. The input formats do not recursively enumerate files. In
> that, we followed the way Hadoop did it.
>
> If that is something that is interesting, it should not be too hard to add
> to the FileInputFormat an option to do a complete recursive traversal of
> the directory structure.
>
> Greetings,
> Stephan
>
>
> On Tue, Dec 2, 2014 at 4:32 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com> wrote:
>
>> Hello all,
>>
>> I want to run a Flink log processing job and my input is stored locally
>> in a nested directory structure, like the following:
>>
>> logs_dir/
>> |-----/machine1/
>> |-----------/january.log
>> |-----------/february.log
>> ...
>> |-----/machine2/
>> ...
>>
>> etc.
>>
>> When providing "logs_dir" as the argument to readTextFile(), nothing is
>> read and no an exception or error is returned.
>> Copying the nested individual files machine1/january.log,
>> machine1/february.log, ..., to the same directory works fine, but I was
>> wondering whether there is a better way to do this?
>>
>> Thank you!
>> V.
>>
>
>

Re: Input from nested directory structure

Posted by Stephan Ewen <se...@apache.org>.
Hi!

Not right now. The input formats do not recursively enumerate files. In
that, we followed the way Hadoop did it.

If that is something that is interesting, it should not be too hard to add
to the FileInputFormat an option to do a complete recursive traversal of
the directory structure.

Greetings,
Stephan


On Tue, Dec 2, 2014 at 4:32 PM, Vasiliki Kalavri <va...@gmail.com>
wrote:

> Hello all,
>
> I want to run a Flink log processing job and my input is stored locally in
> a nested directory structure, like the following:
>
> logs_dir/
> |-----/machine1/
> |-----------/january.log
> |-----------/february.log
> ...
> |-----/machine2/
> ...
>
> etc.
>
> When providing "logs_dir" as the argument to readTextFile(), nothing is
> read and no an exception or error is returned.
> Copying the nested individual files machine1/january.log,
> machine1/february.log, ..., to the same directory works fine, but I was
> wondering whether there is a better way to do this?
>
> Thank you!
> V.
>