You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by R P <ha...@outlook.com> on 2016/02/09 20:17:12 UTC

Flume HDFS sink memory requierment.

Hello All,

  Hope you all are having great time. Thanks for reading my question, I appreciate any suggestion/reply.


I am evaluating flume for HDFS write. We get sparse data which will be bucketed into thousands of different logs. As this data is received sporadically through out the day we get into HDFS small files problem.


To address this problem one solution is to use file size as the only condition for file close using hdfs.rollSize.  As we might have thousands of files open for hours I have following questions.


1. Will flume keep thousands of files open until hdfs.rollSize condition is met?

2. How much memory is used by HDFS sink when thousands of files are open at a time?

3. Is memory used for HDFS event buffer equal to data written on HDFS? e.g if thousands of files to be written has total size of 500gb, will flume sink need 500gb memory size?


Thanks again for your input.


-R

Re: Flume HDFS sink memory requierment.

Posted by Roshan Naik <ro...@hortonworks.com>.
  1.  Num of open files is configurable. See hdfs.maxOpenFiles. May be better to setup your flume source it in way that it allows the HDFS sink to work on a smaller number set of files
  2.  If I recall correctly.. it writes the events out immediately and doesn't buffer. Some bufferring is surelly happening in the hdfs client libs. Beyond that mostly should be book-keeping info (open file handles etc) and any mem used in compression (like a block per open file, if using block compression). Best to measure it with a test setup. See how the 'in-use' mem consumption differs by when 1 file open, 100 files open, 1000 files open.
  3.  No. Assuming you are using file channel, then you can try starting from say 8GB as the max heap size for the agent, and go from there. Mem consumption of  Memory/Spillable channels depend on the  their memory capactiy settings.

-roshan

From: R P <ha...@outlook.com>>
Reply-To: "user@flume.apache.org<ma...@flume.apache.org>" <us...@flume.apache.org>>
Date: Tuesday, February 9, 2016 at 11:17 AM
To: "user@flume.apache.org<ma...@flume.apache.org>" <us...@flume.apache.org>>
Subject: Flume HDFS sink memory requierment.


Hello All,

  Hope you all are having great time. Thanks for reading my question, I appreciate any suggestion/reply.


I am evaluating flume for HDFS write. We get sparse data which will be bucketed into thousands of different logs. As this data is received sporadically through out the day we get into HDFS small files problem.


To address this problem one solution is to use file size as the only condition for file close using hdfs.rollSize.  As we might have thousands of files open for hours I have following questions.


1. Will flume keep thousands of files open until hdfs.rollSize condition is met?

2. How much memory is used by HDFS sink when thousands of files are open at a time?

3. Is memory used for HDFS event buffer equal to data written on HDFS? e.g if thousands of files to be written has total size of 500gb, will flume sink need 500gb memory size?


Thanks again for your input.


-R