You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Nitin Kumar <ni...@gmail.com> on 2018/04/20 18:49:32 UTC

Append existing Avro file - HDFS Sink

Hi All,

I am using Flume v1.8 in which Flume agent comprises of Kafka Channel &
HDFS Sink.
I am able to write data in Avro file on HDFS into a external HIVE table,
but the problem is whenever Flume gets restarted it closes that file and
open a new file because of which I can see many small files. (Data is
partition on the basis of date)

Can't Flume append to existing file to avoid creation of new file?
Also, how can I solve this problem which leads to creation of too many
small files?

Any help would be appreciated.

-- 

*Regards,Nitin Kumar*

Re: Append existing Avro file - HDFS Sink

Posted by Nitin Kumar <ni...@gmail.com>.
Thanks Matt

On Sat, Apr 21, 2018 at 12:43 AM, Matt Sicker <bo...@gmail.com> wrote:

> It's not a Flume native solution, but an alternative I used in the past
> was Kafka Connect using the HDFS connector plugin. That plugin provides
> configuration regarding how often to roll over Avro files.
>
> On 20 April 2018 at 13:49, Nitin Kumar <ni...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am using Flume v1.8 in which Flume agent comprises of Kafka Channel &
>> HDFS Sink.
>> I am able to write data in Avro file on HDFS into a external HIVE table,
>> but the problem is whenever Flume gets restarted it closes that file and
>> open a new file because of which I can see many small files. (Data is
>> partition on the basis of date)
>>
>> Can't Flume append to existing file to avoid creation of new file?
>> Also, how can I solve this problem which leads to creation of too many
>> small files?
>>
>> Any help would be appreciated.
>>
>> --
>>
>> *Regards,Nitin Kumar*
>>
>
>
>
> --
> Matt Sicker <bo...@gmail.com>
>



-- 
*Regards,Nitin Kumar Choudhary*

Re: Append existing Avro file - HDFS Sink

Posted by Matt Sicker <bo...@gmail.com>.
It's not a Flume native solution, but an alternative I used in the past was
Kafka Connect using the HDFS connector plugin. That plugin provides
configuration regarding how often to roll over Avro files.

On 20 April 2018 at 13:49, Nitin Kumar <ni...@gmail.com> wrote:

> Hi All,
>
> I am using Flume v1.8 in which Flume agent comprises of Kafka Channel &
> HDFS Sink.
> I am able to write data in Avro file on HDFS into a external HIVE table,
> but the problem is whenever Flume gets restarted it closes that file and
> open a new file because of which I can see many small files. (Data is
> partition on the basis of date)
>
> Can't Flume append to existing file to avoid creation of new file?
> Also, how can I solve this problem which leads to creation of too many
> small files?
>
> Any help would be appreciated.
>
> --
>
> *Regards,Nitin Kumar*
>



-- 
Matt Sicker <bo...@gmail.com>

Re: Append existing Avro file - HDFS Sink

Posted by Mike Percy <mp...@apache.org>.
Also consider setting up a Spark job or similar (Impala, Hive) to
periodically read the Avro files and output in a columnar format (Parquet
or ORC) which would give you small-files compaction (assuming you delete
the source files periodically) and better analytical read performance on
the columnar files.

Mike

On Fri, Oct 12, 2018 at 12:20 AM Rickard Cardell <ri...@klarna.com>
wrote:

>
>
> Den fre 20 apr. 2018 20:49Nitin Kumar <ni...@gmail.com> skrev:
>
>> Hi All,
>>
>> I am using Flume v1.8 in which Flume agent comprises of Kafka Channel &
>> HDFS Sink.
>> I am able to write data in Avro file on HDFS into a external HIVE table,
>> but the problem is whenever Flume gets restarted it closes that file and
>> open a new file because of which I can see many small files. (Data is
>> partition on the basis of date)
>>
>> Can't Flume append to existing file to avoid creation of new file?
>>
> Hi
> No, not hdfs-sink at least
>
> Also, how can I solve this problem which leads to creation of too many
>> small files?
>>
>
>
> We also used the hdfs-sink but because of the high maintenance we went for
> hbase-sink instead, which also gave us deduplication. The major drawback is
> that it requires an extra step, an hbase to hdfs job.
>
> Your many-small-files problem might be solved with an extra step, e.g
> oozie job, that would merge smaller files to larger ones.
>
> That would also solve the problem with the left over temp-files that flume
> doesn't clean up in some circumstances
>
> /Rickard
>
>
>> Any help would be appreciated.
>>
>> --
>>
>> *Regards,Nitin Kumar*
>>
>

Re: Append existing Avro file - HDFS Sink

Posted by Rickard Cardell <ri...@klarna.com>.
Den fre 20 apr. 2018 20:49Nitin Kumar <ni...@gmail.com> skrev:

> Hi All,
>
> I am using Flume v1.8 in which Flume agent comprises of Kafka Channel &
> HDFS Sink.
> I am able to write data in Avro file on HDFS into a external HIVE table,
> but the problem is whenever Flume gets restarted it closes that file and
> open a new file because of which I can see many small files. (Data is
> partition on the basis of date)
>
> Can't Flume append to existing file to avoid creation of new file?
>
Hi
No, not hdfs-sink at least

Also, how can I solve this problem which leads to creation of too many
> small files?
>


We also used the hdfs-sink but because of the high maintenance we went for
hbase-sink instead, which also gave us deduplication. The major drawback is
that it requires an extra step, an hbase to hdfs job.

Your many-small-files problem might be solved with an extra step, e.g oozie
job, that would merge smaller files to larger ones.

That would also solve the problem with the left over temp-files that flume
doesn't clean up in some circumstances

/Rickard


> Any help would be appreciated.
>
> --
>
> *Regards,Nitin Kumar*
>