You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by "Wan Yi (武汉_技术部_搜索与精准化_万毅)" <wa...@yhd.com> on 2014/09/04 05:39:59 UTC
why lots of tmp files in hdfs
Hi, all
I am using hdfs sink to store logs, I saw lots of tmp files(more than 10 ) in hdfs , Can anybody know why ?
Below is my hdfs configurations
Our hadoop version is : Hadoop 2.3.0-cdh5.0.2
Flume version is : 1.4.0
a1.sinks.sinks1.type = hdfs
a1.sinks.sinks1.channel = ch1
a1.sinks.sinks1.hdfs.path = hdfs://xxxxxxx
a1.sinks.sinks1.hdfs.filePrefix = events
a1.sinks.sinks1.hdfs.batchSize = 1000
a1.sinks.sinks1.hdfs.rollCount = 0
a1.sinks.sinks1.hdfs.rollSize = 0
a1.sinks.sinks1.hdfs.rollInterval = 300
a1.sinks.sinks1.hdfs.idleTimeout = 1800000
a1.sinks.sinks1.hdfs.callTimeout = 180000
a1.sinks.sinks1.hdfs.threadsPoolSize = 250
a1.sinks.sinks1.hdfs.writeFormat = Text
a1.sinks.sinks1.hdfs.fileType = DataStream
Best Regards
Wayne Wan
Best Regards
万毅(Wayne Wan)
Dev@个性精准化&无线部
[说明: ad-dolphin]
________________________________
* Email: wanyi@yhd.com<ma...@yhd.com>
* Cell: +86.1387.1388.731
* Addr: 8/F, Building F6, Optics Valley Software Park, Guanshan Avenue, Wuhan, China. 430074
________________________________
答复: why lots of tmp files in hdfs
Posted by "Wan Yi (武汉_技术部_搜索与精准化_万毅)" <wa...@yhd.com>.
@ Anandkumar Lakshmanan
Thanks for your reply,
I originally thought the idleTimeout was in millisecond as the callTimeout property.
I will try to change the idleTimeout.
Best Regards
Wayne Wan
发件人: Anandkumar Lakshmanan [mailto:anand@orzota.com]
发送时间: 2014年9月4日 12:59
收件人: user@flume.apache.org
主题: Re: why lots of tmp files in hdfs
Hi,
You can decide the file size to be stored in HDFS by using the following properties:
* hdfs.rollInterval ---> Number of seconds to wait before rolling current file(0 = never roll based on time interval) and Default value is 30 seconds.
* hdfs.rollSize ---> File size to trigger roll, in bytes (0: never roll based on file size) and Default value is 1024bytes.
* hdfs.rollCount ---> Number of events written to file before it rolled (0 = never roll based on number of events) and Default value is 10.
We have to specify based on "file size" or "number of events in a file" or "number of seconds to wait to roll the file".
In your configuration you specified as "rollInterval = 300", i.e 300 seconds(5minutes) to wait before rolling the current file.
* idleTimeout ---> Timeout after which inactive files get closed (0 = disable automatic closing of idle files).
Also, you specified "idleTimeout = 1800000"(3000 minutes, the file will roll only after 3000 minutes of inactive state). This is the reason why you are getting all the files with .tmp state.
Reduce this value to 30 or 60 seconds then it will work well.
Thanks
Anand.
On 09/04/2014 09:09 AM, Wan Yi(武汉_技术部_搜索与精准化_万毅) wrote:
Hi, all
I am using hdfs sink to store logs, I saw lots of tmp files(more than 10 ) in hdfs , Can anybody know why ?
Below is my hdfs configurations
Our hadoop version is : Hadoop 2.3.0-cdh5.0.2
Flume version is : 1.4.0
a1.sinks.sinks1.type = hdfs
a1.sinks.sinks1.channel = ch1
a1.sinks.sinks1.hdfs.path = hdfs://xxxxxxx
a1.sinks.sinks1.hdfs.filePrefix = events
a1.sinks.sinks1.hdfs.batchSize = 1000
a1.sinks.sinks1.hdfs.rollCount = 0
a1.sinks.sinks1.hdfs.rollSize = 0
a1.sinks.sinks1.hdfs.rollInterval = 300
a1.sinks.sinks1.hdfs.idleTimeout = 1800000
a1.sinks.sinks1.hdfs.callTimeout = 180000
a1.sinks.sinks1.hdfs.threadsPoolSize = 250
a1.sinks.sinks1.hdfs.writeFormat = Text
a1.sinks.sinks1.hdfs.fileType = DataStream
Best Regards
Wayne Wan
Best Regards
万毅(Wayne Wan)
Dev@个 性精准化&无线部
[说明: ad-dolphin]
________________________________
* Email: wanyi@yhd.com<ma...@yhd.com>
* Cell: +86.1387.1388.731
* Addr: 8/F, Building F6, Optics Valley Software Park, Guanshan Avenue, Wuhan, China. 430074
________________________________
Re: why lots of tmp files in hdfs
Posted by Anandkumar Lakshmanan <an...@orzota.com>.
Hi,
You can decide the file size to be stored in HDFS by using the following
properties:
* hdfs.rollInterval ---> Number of seconds to wait before rolling
current file(0 = never roll based on time interval) and Default value is
30 seconds.
* hdfs.rollSize ---> File size to trigger roll, in bytes (0: never roll
based on file size) and Default value is 1024bytes.
* hdfs.rollCount ---> Number of events written to file before it rolled
(0 = never roll based on number of events) and Default value is 10.
We have to specify based on "file size" or "number of events in a file"
or "number of seconds to wait to roll the file".
In your configuration you specified as "*rollInterval = 300*", i.e 300
seconds(5minutes) to wait before rolling the current file.
* idleTimeout ---> Timeout after which inactive files get closed (0 =
disable automatic closing of idle files).
Also, you specified "*idleTimeout = **1800000*"*(3000 minutes, the file
will roll only after 3000 minutes of inactive state)*. This is the
reason why you are getting all the files with*.tmp state*.
Reduce this value to 30 or 60 seconds then it will work well.
Thanks
Anand.
On 09/04/2014 09:09 AM, Wan Yi(武汉_技术部_搜索与精准化_万毅) wrote:
>
> Hi, all
>
> I am using hdfs sink to store logs, I saw lots of tmp files(more than
> 10 ) in hdfs , Can anybody know why ?
>
> Below is my hdfs configurations
>
> Our hadoop version is : Hadoop 2.3.0-cdh5.0.2
>
> Flume version is : 1.4.0
>
> a1.sinks.sinks1.type = hdfs
>
> a1.sinks.sinks1.channel = ch1
>
> a1.sinks.sinks1.hdfs.path = hdfs://xxxxxxx
>
> a1.sinks.sinks1.hdfs.filePrefix = events
>
> a1.sinks.sinks1.hdfs.batchSize = 1000
>
> a1.sinks.sinks1.hdfs.rollCount = 0
>
> a1.sinks.sinks1.hdfs.rollSize = 0
>
> a1.sinks.sinks1.hdfs.rollInterval = 300
>
> a1.sinks.sinks1.hdfs.idleTimeout = 1800000
>
> a1.sinks.sinks1.hdfs.callTimeout = 180000
>
> a1.sinks.sinks1.hdfs.threadsPoolSize = 250
>
> a1.sinks.sinks1.hdfs.writeFormat = Text
>
> a1.sinks.sinks1.hdfs.fileType = DataStream
>
> *Best Regards*
>
> *Wayne Wan*
>
>
>
> *Best Regards*
>
> *万毅**(Wayne Wan)
> **Dev*@*个 性精准化&无线部
> **说明: ad-dolphin***
>
>
>
> ------------------------------------------------------------------------
>
> +*Email:*wanyi@yhd.com <ma...@yhd.com>
>
> (*Cell:*+86.1387.1388.731
>
> **Addr:*8/F, Building F6, Optics Valley Software Park, Guanshan
> Avenue, Wuhan, China. 430074
>
> ------------------------------------------------------------------------
>