You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Christian Schroer <cs...@autoscout24.com> on 2012/08/01 14:41:55 UTC

Problems with HDFS Sink (file rolling)

Hi,

i have some trouble setting up the HDFS sink in Flume-NG (CDH3U4, 1.1.0):

Here's my sink configuration:

agent.sinks.hdfsSinkSMP.type = hdfs
agent.sinks.hdfsSinkSMP.channel = memoryChannel
agent.sinks.hdfsSinkSMP.hdfs.filePrefix = flumenode1
agent.sinks.hdfsSinkSMP.hdfs.fileType = SequenceFile
agent.sinks.hdfsSinkSMP.hdfs.codeC = gzip
agent.sinks.hdfsSinkSMP.hdfs.rollCount = 0
agent.sinks.hdfsSinkSMP.hdfs.batchSize = 1
agent.sinks.hdfsSinkSMP.hdfs.rollInterval = 15
agent.sinks.hdfsSinkSMP.hdfs.rollSize = 0
agent.sinks.hdfsSinkSMP.hdfs.path = hdfs://namenode/user/hive/warehouse/someDatabase.db /someTable/%Y-%m-%d/%H00/%M/somePartion

Events are genereated by a SyslogTcp source. We write the data into hive partions. This works, it just keeps open a lot of .tmp files. I disabled event count and size based file rolling, just enabled the interval to have the files closed after 15 seconds. But flume keeps files open much longer than 15 seconds (sometimes for hours or even never closing them). Also stopping flume keeps .tmp files in those directories. Sometimes it opens new files in partions without having any data for those. Maybe I'm doing the file rolling completely wrong?

Some hive jobs use 5 minutes old data, but if flume renames a file after job start, the job fails. That's the reason why I want to close the files after 15 seconds. New files are no problems.

Anyone has an idea?

Best regards,
Christian

Re: Problems with HDFS Sink (file rolling)

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.
This was fixed in one of the patches for 1.2. We have files separated by 
hour and have the interval set to a bit over an hour and everything is 
closed properly. I'm not entirely sure about what happens when you 
restart flume though.

On 08/01/2012 09:41 PM, Christian Schroer wrote:
> Hi,
>
> i have some trouble setting up the HDFS sink in Flume-NG (CDH3U4, 1.1.0):
>
> Here's my sink configuration:
>
> agent.sinks.hdfsSinkSMP.type = hdfs
> agent.sinks.hdfsSinkSMP.channel = memoryChannel
> agent.sinks.hdfsSinkSMP.hdfs.filePrefix = flumenode1
> agent.sinks.hdfsSinkSMP.hdfs.fileType = SequenceFile
> agent.sinks.hdfsSinkSMP.hdfs.codeC = gzip
> agent.sinks.hdfsSinkSMP.hdfs.rollCount = 0
> agent.sinks.hdfsSinkSMP.hdfs.batchSize = 1
> agent.sinks.hdfsSinkSMP.hdfs.rollInterval = 15
> agent.sinks.hdfsSinkSMP.hdfs.rollSize = 0
> agent.sinks.hdfsSinkSMP.hdfs.path = hdfs://namenode/user/hive/warehouse/someDatabase.db /someTable/%Y-%m-%d/%H00/%M/somePartion
>
> Events are genereated by a SyslogTcp source. We write the data into hive partions. This works, it just keeps open a lot of .tmp files. I disabled event count and size based file rolling, just enabled the interval to have the files closed after 15 seconds. But flume keeps files open much longer than 15 seconds (sometimes for hours or even never closing them). Also stopping flume keeps .tmp files in those directories. Sometimes it opens new files in partions without having any data for those. Maybe I'm doing the file rolling completely wrong?
>
> Some hive jobs use 5 minutes old data, but if flume renames a file after job start, the job fails. That's the reason why I want to close the files after 15 seconds. New files are no problems.
>
> Anyone has an idea?
>
> Best regards,
> Christian
>


Re: Problems with HDFS Sink (file rolling)

Posted by "Wang, Yongkun | Yongkun | BDD" <yo...@mail.rakuten.com>.
I remember I had similar experience with 1.1.0.
I suggest to download the 1.2.0 and try it again.

Regards,
Yongkun Wang


On 12/08/01 21:41, "Christian Schroer" <cs...@autoscout24.com> wrote:

>Hi,
>
>i have some trouble setting up the HDFS sink in Flume-NG (CDH3U4, 1.1.0):
>
>Here's my sink configuration:
>
>agent.sinks.hdfsSinkSMP.type = hdfs
>agent.sinks.hdfsSinkSMP.channel = memoryChannel
>agent.sinks.hdfsSinkSMP.hdfs.filePrefix = flumenode1
>agent.sinks.hdfsSinkSMP.hdfs.fileType = SequenceFile
>agent.sinks.hdfsSinkSMP.hdfs.codeC = gzip
>agent.sinks.hdfsSinkSMP.hdfs.rollCount = 0
>agent.sinks.hdfsSinkSMP.hdfs.batchSize = 1
>agent.sinks.hdfsSinkSMP.hdfs.rollInterval = 15
>agent.sinks.hdfsSinkSMP.hdfs.rollSize = 0
>agent.sinks.hdfsSinkSMP.hdfs.path =
>hdfs://namenode/user/hive/warehouse/someDatabase.db
>/someTable/%Y-%m-%d/%H00/%M/somePartion
>
>Events are genereated by a SyslogTcp source. We write the data into hive
>partions. This works, it just keeps open a lot of .tmp files. I disabled
>event count and size based file rolling, just enabled the interval to
>have the files closed after 15 seconds. But flume keeps files open much
>longer than 15 seconds (sometimes for hours or even never closing them).
>Also stopping flume keeps .tmp files in those directories. Sometimes it
>opens new files in partions without having any data for those. Maybe I'm
>doing the file rolling completely wrong?
>
>Some hive jobs use 5 minutes old data, but if flume renames a file after
>job start, the job fails. That's the reason why I want to close the files
>after 15 seconds. New files are no problems.
>
>Anyone has an idea?
>
>Best regards,
>Christian
>