You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by no jihun <je...@gmail.com> on 2016/03/21 09:10:27 UTC

About Avro file writing progress on hdfs via Flume.

Hello developers.

I am using flume to write avro file on hdfs.

Now I am very curious why avro.tmp file size on hdfs do not increased
eventually but increased at the last moment(closing time)
even though many batch or transaction happen.


For example.

There is continues high event traffic(5k event/s) in comming to the hdfs
channel which sinked to hdfs.

when the '~~9320.avro.tmp' file created It start with size 895KB
[image: 본문 이미지 2]



but even channels sinked well to hdfs the ~~9320.avro.tmp file never get
increased in size.
- 5 minutes later
[image: 본문 이미지 6]
- 5 minutes later
[image: 본문 이미지 7]


And lastly rolling happen due to the rollSize.
At that moment 9320.avro file increased to 113MB.

[image: 본문 이미지 3]

Also the next file 9321.avro.tmp do not increased until rolled.
5 minutes later
[image: 본문 이미지 8]



I thought "may be the avro file buffered on the Flume agent's machine and
flushed entire file at the last moment, closing, rolling"

So I checked the network traffic at the rolling moment,
but network traffic does not go high at the rolling moment.

Finally I think hdfs sink flush the transaction batch to hdfs but HDFS
holds the stream on somewhere on the hadoop and do not write the stream to
disk until file closing time.

Does any body does know about the progress that avro file created, flushed,
closed on hdfs?



This is configuration of hdfssink.

hadoop1.sinks.hdfsSk.type = hdfs
hadoop1.sinks.hdfsSk.channel = fileCh1
hadoop1.sinks.hdfsSk.hdfs.fileType = DataStream
hadoop1.sinks.hdfsSk.serializer = avro_event
hadoop1.sinks.hdfsSk.serializer.compressionCodec = snappy
hadoop1.sinks.hdfsSk.hdfs.path =
xxxxxxx/data/flume/%{category}/%{type}/%Y/%m/%d/%{partition}/%{hour}
hadoop1.sinks.hdfsSk.hdfs.filePrefix = %{type}_%Y-%m-%d_%H_%{host}
hadoop1.sinks.hdfsSk.hdfs.fileSuffix = .avro
hadoop1.sinks.hdfsSk.hdfs.rollInterval = 3700
hadoop1.sinks.hdfsSk.hdfs.rollSize = 67000000
hadoop1.sinks.hdfsSk.hdfs.rollCount = 0
hadoop1.sinks.hdfsSk.hdfs.batchSize = 10000
hadoop1.sinks.hdfsSk.hdfs.idleTimeout = 300


Thanks!

Re: About Avro file writing progress on hdfs via Flume.

Posted by no jihun <je...@gmail.com>.

Thanks Gonzalo.

For other people . .
This is the detail explanation I found.

https://community.hortonworks.com/questions/6251/practical-limits-on-number-of-simultaneous-open-hd.html
2016. 3. 21. 오후 6:03에 "Gonzalo Herreros" <gh...@gmail.com>님이 작성:

> Hdfs doesn't work ecactly like a regular filesytem. Works in blocks, by
> default of 128MB if I remember right.
> Could be that the Namenode doesn't update the size until a block is closed
> On Mar 21, 2016 8:10 AM, "no jihun" <je...@gmail.com> wrote:
>
>> Hello developers.
>>
>> I am using flume to write avro file on hdfs.
>>
>> Now I am very curious why avro.tmp file size on hdfs do not increased
>> eventually but increased at the last moment(closing time)
>> even though many batch or transaction happen.
>>
>>
>> For example.
>>
>> There is continues high event traffic(5k event/s) in comming to the hdfs
>> channel which sinked to hdfs.
>>
>> when the '~~9320.avro.tmp' file created It start with size 895KB
>> [image: 본문 이미지 2]
>>
>>
>>
>> but even channels sinked well to hdfs the ~~9320.avro.tmp file never get
>> increased in size.
>> - 5 minutes later
>> [image: 본문 이미지 6]
>> - 5 minutes later
>> [image: 본문 이미지 7]
>>
>>
>> And lastly rolling happen due to the rollSize.
>> At that moment 9320.avro file increased to 113MB.
>>
>> [image: 본문 이미지 3]
>>
>> Also the next file 9321.avro.tmp do not increased until rolled.
>> 5 minutes later
>> [image: 본문 이미지 8]
>>
>>
>>
>> I thought "may be the avro file buffered on the Flume agent's machine and
>> flushed entire file at the last moment, closing, rolling"
>>
>> So I checked the network traffic at the rolling moment,
>> but network traffic does not go high at the rolling moment.
>>
>> Finally I think hdfs sink flush the transaction batch to hdfs but HDFS
>> holds the stream on somewhere on the hadoop and do not write the stream to
>> disk until file closing time.
>>
>> Does any body does know about the progress that avro file created,
>> flushed, closed on hdfs?
>>
>>
>>
>> This is configuration of hdfssink.
>>
>> hadoop1.sinks.hdfsSk.type = hdfs
>> hadoop1.sinks.hdfsSk.channel = fileCh1
>> hadoop1.sinks.hdfsSk.hdfs.fileType = DataStream
>> hadoop1.sinks.hdfsSk.serializer = avro_event
>> hadoop1.sinks.hdfsSk.serializer.compressionCodec = snappy
>> hadoop1.sinks.hdfsSk.hdfs.path =
>> xxxxxxx/data/flume/%{category}/%{type}/%Y/%m/%d/%{partition}/%{hour}
>> hadoop1.sinks.hdfsSk.hdfs.filePrefix = %{type}_%Y-%m-%d_%H_%{host}
>> hadoop1.sinks.hdfsSk.hdfs.fileSuffix = .avro
>> hadoop1.sinks.hdfsSk.hdfs.rollInterval = 3700
>> hadoop1.sinks.hdfsSk.hdfs.rollSize = 67000000
>> hadoop1.sinks.hdfsSk.hdfs.rollCount = 0
>> hadoop1.sinks.hdfsSk.hdfs.batchSize = 10000
>> hadoop1.sinks.hdfsSk.hdfs.idleTimeout = 300
>>
>>
>> Thanks!
>>
>

Re: About Avro file writing progress on hdfs via Flume.

Posted by Gonzalo Herreros <gh...@gmail.com>.

Hdfs doesn't work ecactly like a regular filesytem. Works in blocks, by
default of 128MB if I remember right.
Could be that the Namenode doesn't update the size until a block is closed
On Mar 21, 2016 8:10 AM, "no jihun" <je...@gmail.com> wrote:

> Hello developers.
>
> I am using flume to write avro file on hdfs.
>
> Now I am very curious why avro.tmp file size on hdfs do not increased
> eventually but increased at the last moment(closing time)
> even though many batch or transaction happen.
>
>
> For example.
>
> There is continues high event traffic(5k event/s) in comming to the hdfs
> channel which sinked to hdfs.
>
> when the '~~9320.avro.tmp' file created It start with size 895KB
> [image: 본문 이미지 2]
>
>
>
> but even channels sinked well to hdfs the ~~9320.avro.tmp file never get
> increased in size.
> - 5 minutes later
> [image: 본문 이미지 6]
> - 5 minutes later
> [image: 본문 이미지 7]
>
>
> And lastly rolling happen due to the rollSize.
> At that moment 9320.avro file increased to 113MB.
>
> [image: 본문 이미지 3]
>
> Also the next file 9321.avro.tmp do not increased until rolled.
> 5 minutes later
> [image: 본문 이미지 8]
>
>
>
> I thought "may be the avro file buffered on the Flume agent's machine and
> flushed entire file at the last moment, closing, rolling"
>
> So I checked the network traffic at the rolling moment,
> but network traffic does not go high at the rolling moment.
>
> Finally I think hdfs sink flush the transaction batch to hdfs but HDFS
> holds the stream on somewhere on the hadoop and do not write the stream to
> disk until file closing time.
>
> Does any body does know about the progress that avro file created,
> flushed, closed on hdfs?
>
>
>
> This is configuration of hdfssink.
>
> hadoop1.sinks.hdfsSk.type = hdfs
> hadoop1.sinks.hdfsSk.channel = fileCh1
> hadoop1.sinks.hdfsSk.hdfs.fileType = DataStream
> hadoop1.sinks.hdfsSk.serializer = avro_event
> hadoop1.sinks.hdfsSk.serializer.compressionCodec = snappy
> hadoop1.sinks.hdfsSk.hdfs.path =
> xxxxxxx/data/flume/%{category}/%{type}/%Y/%m/%d/%{partition}/%{hour}
> hadoop1.sinks.hdfsSk.hdfs.filePrefix = %{type}_%Y-%m-%d_%H_%{host}
> hadoop1.sinks.hdfsSk.hdfs.fileSuffix = .avro
> hadoop1.sinks.hdfsSk.hdfs.rollInterval = 3700
> hadoop1.sinks.hdfsSk.hdfs.rollSize = 67000000
> hadoop1.sinks.hdfsSk.hdfs.rollCount = 0
> hadoop1.sinks.hdfsSk.hdfs.batchSize = 10000
> hadoop1.sinks.hdfsSk.hdfs.idleTimeout = 300
>
>
> Thanks!
>