You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by "Wan Yi (武汉_技术部_搜索与精准化_万毅)" <wa...@yhd.com> on 2014/09/04 05:39:59 UTC

why lots of tmp files in hdfs

Hi, all
         I am using hdfs sink to store logs, I saw lots of tmp files(more than 10 ) in hdfs , Can anybody know why ?

Below is my hdfs configurations

Our hadoop version is : Hadoop 2.3.0-cdh5.0.2
Flume version is : 1.4.0

a1.sinks.sinks1.type = hdfs
a1.sinks.sinks1.channel = ch1
a1.sinks.sinks1.hdfs.path = hdfs://xxxxxxx
a1.sinks.sinks1.hdfs.filePrefix = events
a1.sinks.sinks1.hdfs.batchSize = 1000
a1.sinks.sinks1.hdfs.rollCount = 0
a1.sinks.sinks1.hdfs.rollSize = 0
a1.sinks.sinks1.hdfs.rollInterval = 300
a1.sinks.sinks1.hdfs.idleTimeout = 1800000
a1.sinks.sinks1.hdfs.callTimeout = 180000
a1.sinks.sinks1.hdfs.threadsPoolSize = 250
a1.sinks.sinks1.hdfs.writeFormat = Text
a1.sinks.sinks1.hdfs.fileType = DataStream




Best Regards

Wayne Wan



Best Regards
万毅(Wayne Wan)
Dev@个性精准化&无线部
[说明: ad-dolphin]


________________________________

* Email: wanyi@yhd.com<ma...@yhd.com>

* Cell: +86.1387.1388.731

* Addr: 8/F, Building F6, Optics Valley Software Park, Guanshan Avenue, Wuhan, China. 430074

________________________________




答复: why lots of tmp files in hdfs

Posted by "Wan Yi (武汉_技术部_搜索与精准化_万毅)" <wa...@yhd.com>.
@ Anandkumar Lakshmanan

Thanks for your reply,

I originally thought the idleTimeout was in millisecond as the callTimeout property.
I will try to change the idleTimeout.





Best Regards

Wayne Wan


发件人: Anandkumar Lakshmanan [mailto:anand@orzota.com]
发送时间: 2014年9月4日 12:59
收件人: user@flume.apache.org
主题: Re: why lots of tmp files in hdfs

Hi,

You can decide the file size to be stored in HDFS by using the following properties:

* hdfs.rollInterval ---> Number of seconds to wait before rolling current file(0 = never roll based on time interval) and Default value is 30 seconds.

* hdfs.rollSize ---> File size to trigger roll, in bytes (0: never roll based on file size) and Default value is 1024bytes.

* hdfs.rollCount ---> Number of events written to file before it rolled (0 = never roll based on number of events) and Default value is 10.

We have to specify based on "file size" or "number of events in a file" or "number of seconds to wait to roll the file".

In your configuration you specified as  "rollInterval = 300", i.e 300 seconds(5minutes) to wait before rolling the current file.


* idleTimeout ---> Timeout after which inactive files get closed (0 = disable automatic closing of idle files).

Also, you specified "idleTimeout = 1800000"(3000 minutes, the file will roll only after 3000 minutes of inactive state). This is the reason why you are getting all the files with .tmp state.
Reduce this value to 30 or 60 seconds then it will work well.

Thanks
Anand.



On 09/04/2014 09:09 AM, Wan Yi(武汉_技术部_搜索与精准化_万毅) wrote:
Hi, all
         I am using hdfs sink to store logs, I saw lots of tmp files(more than 10 ) in hdfs , Can anybody know why ?

Below is my hdfs configurations

Our hadoop version is : Hadoop 2.3.0-cdh5.0.2
Flume version is : 1.4.0

a1.sinks.sinks1.type = hdfs
a1.sinks.sinks1.channel = ch1
a1.sinks.sinks1.hdfs.path = hdfs://xxxxxxx
a1.sinks.sinks1.hdfs.filePrefix = events
a1.sinks.sinks1.hdfs.batchSize = 1000
a1.sinks.sinks1.hdfs.rollCount = 0
a1.sinks.sinks1.hdfs.rollSize = 0
a1.sinks.sinks1.hdfs.rollInterval = 300
a1.sinks.sinks1.hdfs.idleTimeout = 1800000
a1.sinks.sinks1.hdfs.callTimeout = 180000
a1.sinks.sinks1.hdfs.threadsPoolSize = 250
a1.sinks.sinks1.hdfs.writeFormat = Text
a1.sinks.sinks1.hdfs.fileType = DataStream




Best Regards

Wayne Wan



Best Regards
万毅(Wayne Wan)
Dev@个 性精准化&无线部
[说明: ad-dolphin]


________________________________

* Email: wanyi@yhd.com<ma...@yhd.com>

* Cell: +86.1387.1388.731

* Addr: 8/F, Building F6, Optics Valley Software Park, Guanshan Avenue, Wuhan, China. 430074

________________________________





Re: why lots of tmp files in hdfs

Posted by Anandkumar Lakshmanan <an...@orzota.com>.
Hi,

You can decide the file size to be stored in HDFS by using the following
properties:

* hdfs.rollInterval ---> Number of seconds to wait before rolling
current file(0 = never roll based on time interval) and Default value is
30 seconds.

* hdfs.rollSize ---> File size to trigger roll, in bytes (0: never roll
based on file size) and Default value is 1024bytes.

* hdfs.rollCount ---> Number of events written to file before it rolled
(0 = never roll based on number of events) and Default value is 10.

We have to specify based on "file size" or "number of events in a file"
or "number of seconds to wait to roll the file".

In your configuration you specified as "*rollInterval = 300*", i.e 300
seconds(5minutes) to wait before rolling the current file.


* idleTimeout ---> Timeout after which inactive files get closed (0 =
disable automatic closing of idle files).

Also, you specified "*idleTimeout = **1800000*"*(3000 minutes, the file
will roll only after 3000 minutes of inactive state)*. This is the
reason why you are getting all the files with*.tmp state*.
Reduce this value to 30 or 60 seconds then it will work well.

Thanks
Anand.




On 09/04/2014 09:09 AM, Wan Yi(武汉_技术部_搜索与精准化_万毅) wrote:
>
> Hi, all
>
> I am using hdfs sink to store logs, I saw lots of tmp files(more than
> 10 ) in hdfs , Can anybody know why ?
>
> Below is my hdfs configurations
>
> Our hadoop version is : Hadoop 2.3.0-cdh5.0.2
>
> Flume version is : 1.4.0
>
> a1.sinks.sinks1.type = hdfs
>
> a1.sinks.sinks1.channel = ch1
>
> a1.sinks.sinks1.hdfs.path = hdfs://xxxxxxx
>
> a1.sinks.sinks1.hdfs.filePrefix = events
>
> a1.sinks.sinks1.hdfs.batchSize = 1000
>
> a1.sinks.sinks1.hdfs.rollCount = 0
>
> a1.sinks.sinks1.hdfs.rollSize = 0
>
> a1.sinks.sinks1.hdfs.rollInterval = 300
>
> a1.sinks.sinks1.hdfs.idleTimeout = 1800000
>
> a1.sinks.sinks1.hdfs.callTimeout = 180000
>
> a1.sinks.sinks1.hdfs.threadsPoolSize = 250
>
> a1.sinks.sinks1.hdfs.writeFormat = Text
>
> a1.sinks.sinks1.hdfs.fileType = DataStream
>
> *Best Regards*
>
> *Wayne Wan*
>
> 	
>
> *Best Regards*
>
> *万毅**(Wayne Wan)
> **Dev*@*个 性精准化&无线部
> **说明: ad-dolphin***
>
> 	
>
> ------------------------------------------------------------------------
>
> +*Email:*wanyi@yhd.com <ma...@yhd.com>
>
> (*Cell:*+86.1387.1388.731
>
> **Addr:*8/F, Building F6, Optics Valley Software Park, Guanshan
> Avenue, Wuhan, China. 430074
>
> ------------------------------------------------------------------------
>