You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2014/12/04 17:59:12 UTC

HDFS append

Hi guys,
how can I efficiently appends data (as plain strings or also avro records)
to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio

Re: HDFS append

Posted by Flavio Pompermaier <po...@okkam.it>.

Thanks a lot Robert!
On Dec 15, 2014 12:54 PM, "Robert Metzger" <rm...@apache.org> wrote:

> Hey Flavio,
>
> this pull request got merged:
> https://github.com/apache/incubator-flink/pull/260
>
> With this, you now can simulate an append behavior with Flink:
>
> - You have a directory in HDFS where you put the files you want to append
> hdfs:///data/appendjob/
> - each time you want to append something, you run your job and let it
> create a new directory in hdfs:///data/appendjob/, lets
> say hdfs:///data/appendjob/run-X/
> - Now, you can instruct the job to read the full output by letting it
> recursively read hdfs:///data/appendjob/.
>
> I hope that helps.
>
>
> Best,
> Robert
>
>
> On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>>
>> I didn't know such difference! Thus, Flink is very smart :)
>> Thank for the explanation Robert.
>>
>> On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Vasia is working on support for reading directories recursively. But I
>>> thought that this is also allowing you to simulate something like an append.
>>>
>>> Did you notice an issue when reading many small files with Flink? Flink
>>> is handling the reading of files differently than Spark.
>>>
>>> Spark basically starts a task for each file / file split. So if you have
>>> millions of small files in your HDFS, spark will start millions of tasks
>>> (queued however). You need to coalesce in spark to reduce the number of
>>> partitions. by default, they re-use the partitions of the preceding
>>> operator.
>>> Flink on the other hand is starting a fixed number of tasks which are
>>> reading multiple input splits which are lazily assigned to these tasks once
>>> they ready to process new splits.
>>> Flink will not create a partition for each (small) input file. I expect
>>> Flink to handle that case a bit better than Spark (I haven't tested it
>>> though)
>>>
>>>
>>>
>>> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <pompermaier@okkam.it
>>> > wrote:
>>>
>>>> Great! Append data to HDFS will be a very useful feature!
>>>> I think that then you should think also how to read efficiently
>>>> directories containing a lot of small files. I know that this can be quite
>>>> inefficient so that's why in Spark they give you a coalesce operation to be
>>>> able to deal siwth such cases..
>>>>
>>>>
>>>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
>>>> vasilikikalavri@gmail.com> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> Yes, I took a look into this. I hope I'll be able to find some time to
>>>>> work on it this week.
>>>>> I'll keep you updated :)
>>>>>
>>>>> Cheers,
>>>>> V.
>>>>>
>>>>> On 9 December 2014 at 14:03, Robert Metzger <rm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> It seems that Vasia started working on adding support for recursive
>>>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>>>>> I'm still occupied with refactoring the YARN client, the HDFS
>>>>>> refactoring is next on my list.
>>>>>>
>>>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Any news about this Robert?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Flavio
>>>>>>>
>>>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rmetzger@apache.org
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I think there is no support for appending to HDFS files in Flink
>>>>>>>> yet.
>>>>>>>> HDFS supports it, but there are some adjustments in the system
>>>>>>>> required (not deleting / creating directories before writing; exposing the
>>>>>>>> append() methods in the FS abstractions).
>>>>>>>>
>>>>>>>> I'm planning to work on the FS abstractions in the next week, if I
>>>>>>>> have enough time, I can also look into adding support for append().
>>>>>>>>
>>>>>>>> Another approach could be adding support for recursively reading
>>>>>>>> directories with the input formats. Vasia asked for this feature a few days
>>>>>>>> ago on the mailing list. If we would have that feature, you could just
>>>>>>>> write to a directory and read the parent directory (with all the dirs for
>>>>>>>> the appends).
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Robert
>>>>>>>>
>>>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>> how can I efficiently appends data (as plain strings or also avro
>>>>>>>>> records) to  HDFS using Flink?
>>>>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>

Re: HDFS append

Posted by Robert Metzger <rm...@apache.org>.

Hey Flavio,

this pull request got merged:
https://github.com/apache/incubator-flink/pull/260

With this, you now can simulate an append behavior with Flink:

- You have a directory in HDFS where you put the files you want to append
hdfs:///data/appendjob/
- each time you want to append something, you run your job and let it
create a new directory in hdfs:///data/appendjob/, lets
say hdfs:///data/appendjob/run-X/
- Now, you can instruct the job to read the full output by letting it
recursively read hdfs:///data/appendjob/.

I hope that helps.


Best,
Robert


On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <po...@okkam.it>
wrote:
>
> I didn't know such difference! Thus, Flink is very smart :)
> Thank for the explanation Robert.
>
> On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <rm...@apache.org>
> wrote:
>
>> Vasia is working on support for reading directories recursively. But I
>> thought that this is also allowing you to simulate something like an append.
>>
>> Did you notice an issue when reading many small files with Flink? Flink
>> is handling the reading of files differently than Spark.
>>
>> Spark basically starts a task for each file / file split. So if you have
>> millions of small files in your HDFS, spark will start millions of tasks
>> (queued however). You need to coalesce in spark to reduce the number of
>> partitions. by default, they re-use the partitions of the preceding
>> operator.
>> Flink on the other hand is starting a fixed number of tasks which are
>> reading multiple input splits which are lazily assigned to these tasks once
>> they ready to process new splits.
>> Flink will not create a partition for each (small) input file. I expect
>> Flink to handle that case a bit better than Spark (I haven't tested it
>> though)
>>
>>
>>
>> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <po...@okkam.it>
>> wrote:
>>
>>> Great! Append data to HDFS will be a very useful feature!
>>> I think that then you should think also how to read efficiently
>>> directories containing a lot of small files. I know that this can be quite
>>> inefficient so that's why in Spark they give you a coalesce operation to be
>>> able to deal siwth such cases..
>>>
>>>
>>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
>>> vasilikikalavri@gmail.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> Yes, I took a look into this. I hope I'll be able to find some time to
>>>> work on it this week.
>>>> I'll keep you updated :)
>>>>
>>>> Cheers,
>>>> V.
>>>>
>>>> On 9 December 2014 at 14:03, Robert Metzger <rm...@apache.org>
>>>> wrote:
>>>>
>>>>> It seems that Vasia started working on adding support for recursive
>>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>>>> I'm still occupied with refactoring the YARN client, the HDFS
>>>>> refactoring is next on my list.
>>>>>
>>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> Any news about this Robert?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Flavio
>>>>>>
>>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rm...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think there is no support for appending to HDFS files in Flink
>>>>>>> yet.
>>>>>>> HDFS supports it, but there are some adjustments in the system
>>>>>>> required (not deleting / creating directories before writing; exposing the
>>>>>>> append() methods in the FS abstractions).
>>>>>>>
>>>>>>> I'm planning to work on the FS abstractions in the next week, if I
>>>>>>> have enough time, I can also look into adding support for append().
>>>>>>>
>>>>>>> Another approach could be adding support for recursively reading
>>>>>>> directories with the input formats. Vasia asked for this feature a few days
>>>>>>> ago on the mailing list. If we would have that feature, you could just
>>>>>>> write to a directory and read the parent directory (with all the dirs for
>>>>>>> the appends).
>>>>>>>
>>>>>>> Best,
>>>>>>> Robert
>>>>>>>
>>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>> how can I efficiently appends data (as plain strings or also avro
>>>>>>>> records) to  HDFS using Flink?
>>>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: HDFS append

Posted by Flavio Pompermaier <po...@okkam.it>.

I didn't know such difference! Thus, Flink is very smart :)
Thank for the explanation Robert.

On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <rm...@apache.org> wrote:

> Vasia is working on support for reading directories recursively. But I
> thought that this is also allowing you to simulate something like an append.
>
> Did you notice an issue when reading many small files with Flink? Flink is
> handling the reading of files differently than Spark.
>
> Spark basically starts a task for each file / file split. So if you have
> millions of small files in your HDFS, spark will start millions of tasks
> (queued however). You need to coalesce in spark to reduce the number of
> partitions. by default, they re-use the partitions of the preceding
> operator.
> Flink on the other hand is starting a fixed number of tasks which are
> reading multiple input splits which are lazily assigned to these tasks once
> they ready to process new splits.
> Flink will not create a partition for each (small) input file. I expect
> Flink to handle that case a bit better than Spark (I haven't tested it
> though)
>
>
>
> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Great! Append data to HDFS will be a very useful feature!
>> I think that then you should think also how to read efficiently
>> directories containing a lot of small files. I know that this can be quite
>> inefficient so that's why in Spark they give you a coalesce operation to be
>> able to deal siwth such cases..
>>
>>
>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
>> vasilikikalavri@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> Yes, I took a look into this. I hope I'll be able to find some time to
>>> work on it this week.
>>> I'll keep you updated :)
>>>
>>> Cheers,
>>> V.
>>>
>>> On 9 December 2014 at 14:03, Robert Metzger <rm...@apache.org> wrote:
>>>
>>>> It seems that Vasia started working on adding support for recursive
>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>>> I'm still occupied with refactoring the YARN client, the HDFS
>>>> refactoring is next on my list.
>>>>
>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> Any news about this Robert?
>>>>>
>>>>> Thanks in advance,
>>>>> Flavio
>>>>>
>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I think there is no support for appending to HDFS files in Flink yet.
>>>>>> HDFS supports it, but there are some adjustments in the system
>>>>>> required (not deleting / creating directories before writing; exposing the
>>>>>> append() methods in the FS abstractions).
>>>>>>
>>>>>> I'm planning to work on the FS abstractions in the next week, if I
>>>>>> have enough time, I can also look into adding support for append().
>>>>>>
>>>>>> Another approach could be adding support for recursively reading
>>>>>> directories with the input formats. Vasia asked for this feature a few days
>>>>>> ago on the mailing list. If we would have that feature, you could just
>>>>>> write to a directory and read the parent directory (with all the dirs for
>>>>>> the appends).
>>>>>>
>>>>>> Best,
>>>>>> Robert
>>>>>>
>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>> how can I efficiently appends data (as plain strings or also avro
>>>>>>> records) to  HDFS using Flink?
>>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Flavio
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HDFS append

Posted by Robert Metzger <rm...@apache.org>.

Vasia is working on support for reading directories recursively. But I
thought that this is also allowing you to simulate something like an append.

Did you notice an issue when reading many small files with Flink? Flink is
handling the reading of files differently than Spark.

Spark basically starts a task for each file / file split. So if you have
millions of small files in your HDFS, spark will start millions of tasks
(queued however). You need to coalesce in spark to reduce the number of
partitions. by default, they re-use the partitions of the preceding
operator.
Flink on the other hand is starting a fixed number of tasks which are
reading multiple input splits which are lazily assigned to these tasks once
they ready to process new splits.
Flink will not create a partition for each (small) input file. I expect
Flink to handle that case a bit better than Spark (I haven't tested it
though)



On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Great! Append data to HDFS will be a very useful feature!
> I think that then you should think also how to read efficiently
> directories containing a lot of small files. I know that this can be quite
> inefficient so that's why in Spark they give you a coalesce operation to be
> able to deal siwth such cases..
>
>
> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com> wrote:
>
>> Hi!
>>
>> Yes, I took a look into this. I hope I'll be able to find some time to
>> work on it this week.
>> I'll keep you updated :)
>>
>> Cheers,
>> V.
>>
>> On 9 December 2014 at 14:03, Robert Metzger <rm...@apache.org> wrote:
>>
>>> It seems that Vasia started working on adding support for recursive
>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>> I'm still occupied with refactoring the YARN client, the HDFS
>>> refactoring is next on my list.
>>>
>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>> pompermaier@okkam.it> wrote:
>>>
>>>> Any news about this Robert?
>>>>
>>>> Thanks in advance,
>>>> Flavio
>>>>
>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rm...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I think there is no support for appending to HDFS files in Flink yet.
>>>>> HDFS supports it, but there are some adjustments in the system
>>>>> required (not deleting / creating directories before writing; exposing the
>>>>> append() methods in the FS abstractions).
>>>>>
>>>>> I'm planning to work on the FS abstractions in the next week, if I
>>>>> have enough time, I can also look into adding support for append().
>>>>>
>>>>> Another approach could be adding support for recursively reading
>>>>> directories with the input formats. Vasia asked for this feature a few days
>>>>> ago on the mailing list. If we would have that feature, you could just
>>>>> write to a directory and read the parent directory (with all the dirs for
>>>>> the appends).
>>>>>
>>>>> Best,
>>>>> Robert
>>>>>
>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>> how can I efficiently appends data (as plain strings or also avro
>>>>>> records) to  HDFS using Flink?
>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Flavio
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HDFS append

Posted by Flavio Pompermaier <po...@okkam.it>.

Great! Append data to HDFS will be a very useful feature!
I think that then you should think also how to read efficiently directories
containing a lot of small files. I know that this can be quite inefficient
so that's why in Spark they give you a coalesce operation to be able to
deal siwth such cases..

On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <va...@gmail.com>
wrote:

> Hi!
>
> Yes, I took a look into this. I hope I'll be able to find some time to
> work on it this week.
> I'll keep you updated :)
>
> Cheers,
> V.
>
> On 9 December 2014 at 14:03, Robert Metzger <rm...@apache.org> wrote:
>
>> It seems that Vasia started working on adding support for recursive
>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>> I'm still occupied with refactoring the YARN client, the HDFS refactoring
>> is next on my list.
>>
>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>
>>> Any news about this Robert?
>>>
>>> Thanks in advance,
>>> Flavio
>>>
>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rm...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think there is no support for appending to HDFS files in Flink yet.
>>>> HDFS supports it, but there are some adjustments in the system required
>>>> (not deleting / creating directories before writing; exposing the append()
>>>> methods in the FS abstractions).
>>>>
>>>> I'm planning to work on the FS abstractions in the next week, if I have
>>>> enough time, I can also look into adding support for append().
>>>>
>>>> Another approach could be adding support for recursively reading
>>>> directories with the input formats. Vasia asked for this feature a few days
>>>> ago on the mailing list. If we would have that feature, you could just
>>>> write to a directory and read the parent directory (with all the dirs for
>>>> the appends).
>>>>
>>>> Best,
>>>> Robert
>>>>
>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> Hi guys,
>>>>> how can I efficiently appends data (as plain strings or also avro
>>>>> records) to  HDFS using Flink?
>>>>> Do I need to use Flume or can I avoid it?
>>>>>
>>>>> Thanks in advance,
>>>>> Flavio
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HDFS append

Posted by Vasiliki Kalavri <va...@gmail.com>.

Hi!

Yes, I took a look into this. I hope I'll be able to find some time to work
on it this week.
I'll keep you updated :)

Cheers,
V.

On 9 December 2014 at 14:03, Robert Metzger <rm...@apache.org> wrote:

> It seems that Vasia started working on adding support for recursive
> reading: https://issues.apache.org/jira/browse/FLINK-1307.
> I'm still occupied with refactoring the YARN client, the HDFS refactoring
> is next on my list.
>
> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Any news about this Robert?
>>
>> Thanks in advance,
>> Flavio
>>
>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> I think there is no support for appending to HDFS files in Flink yet.
>>> HDFS supports it, but there are some adjustments in the system required
>>> (not deleting / creating directories before writing; exposing the append()
>>> methods in the FS abstractions).
>>>
>>> I'm planning to work on the FS abstractions in the next week, if I have
>>> enough time, I can also look into adding support for append().
>>>
>>> Another approach could be adding support for recursively reading
>>> directories with the input formats. Vasia asked for this feature a few days
>>> ago on the mailing list. If we would have that feature, you could just
>>> write to a directory and read the parent directory (with all the dirs for
>>> the appends).
>>>
>>> Best,
>>> Robert
>>>
>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <pompermaier@okkam.it
>>> > wrote:
>>>
>>>> Hi guys,
>>>> how can I efficiently appends data (as plain strings or also avro
>>>> records) to  HDFS using Flink?
>>>> Do I need to use Flume or can I avoid it?
>>>>
>>>> Thanks in advance,
>>>> Flavio
>>>>
>>>>
>>>
>>
>

Re: HDFS append

Posted by Robert Metzger <rm...@apache.org>.

It seems that Vasia started working on adding support for recursive
reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring
is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Any news about this Robert?
>
> Thanks in advance,
> Flavio
>
> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rm...@apache.org>
> wrote:
>
>> Hi,
>>
>> I think there is no support for appending to HDFS files in Flink yet.
>> HDFS supports it, but there are some adjustments in the system required
>> (not deleting / creating directories before writing; exposing the append()
>> methods in the FS abstractions).
>>
>> I'm planning to work on the FS abstractions in the next week, if I have
>> enough time, I can also look into adding support for append().
>>
>> Another approach could be adding support for recursively reading
>> directories with the input formats. Vasia asked for this feature a few days
>> ago on the mailing list. If we would have that feature, you could just
>> write to a directory and read the parent directory (with all the dirs for
>> the appends).
>>
>> Best,
>> Robert
>>
>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <po...@okkam.it>
>> wrote:
>>
>>> Hi guys,
>>> how can I efficiently appends data (as plain strings or also avro
>>> records) to  HDFS using Flink?
>>> Do I need to use Flume or can I avoid it?
>>>
>>> Thanks in advance,
>>> Flavio
>>>
>>>
>>
>

Re: HDFS append

Posted by Flavio Pompermaier <po...@okkam.it>.

Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <rm...@apache.org> wrote:

> Hi,
>
> I think there is no support for appending to HDFS files in Flink yet.
> HDFS supports it, but there are some adjustments in the system required
> (not deleting / creating directories before writing; exposing the append()
> methods in the FS abstractions).
>
> I'm planning to work on the FS abstractions in the next week, if I have
> enough time, I can also look into adding support for append().
>
> Another approach could be adding support for recursively reading
> directories with the input formats. Vasia asked for this feature a few days
> ago on the mailing list. If we would have that feature, you could just
> write to a directory and read the parent directory (with all the dirs for
> the appends).
>
> Best,
> Robert
>
> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Hi guys,
>> how can I efficiently appends data (as plain strings or also avro
>> records) to  HDFS using Flink?
>> Do I need to use Flume or can I avoid it?
>>
>> Thanks in advance,
>> Flavio
>>
>>
>

Re: HDFS append

Posted by Robert Metzger <rm...@apache.org>.

Hi,

I think there is no support for appending to HDFS files in Flink yet.
HDFS supports it, but there are some adjustments in the system required
(not deleting / creating directories before writing; exposing the append()
methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have
enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading
directories with the input formats. Vasia asked for this feature a few days
ago on the mailing list. If we would have that feature, you could just
write to a directory and read the parent directory (with all the dirs for
the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Hi guys,
> how can I efficiently appends data (as plain strings or also avro records)
> to  HDFS using Flink?
> Do I need to use Flume or can I avoid it?
>
> Thanks in advance,
> Flavio
>
>