You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Roberto Coluccio <ro...@eng.it> on 2016/11/16 10:35:58 UTC

Understand JMS source + HDFS sink batch management

Hello folks,

I'm testing a Flume agent defined by a topology made of :

*JMS source* (Tibco implementation) -> *memory channel* -> *hdfs sink*

The *JMS source* has:

  * my_agent.sources.my_source.batchSize = 100

The *memory channel* has:

  * my_agent.channels.my_channel.capacity = 100

The *HDFS sink* has:

  * my_agent.sinks.my_sink.hdfs.batchSize = 100
  * my_agent.sinks.my_sink.hdfs.rollCount = 0
  * my_agent.sinks.my_sink.hdfs.rollInterval = 0
  * my_agent.sinks.my_sink.hdfs.idleTimeout = 0

I don't understand how/why new files on HDFS are created/closed. In 
fact, when I:

 1. launch the agent (JMS queue empty)
 2. push a new text message on the JMS queue

It happens that a new file is created by the HDFS, but not yet closed 
(as I expect). BUT, when I

     3. push again a new text message on the JMS queue

regardles how much time I waited to perform step 3, the HDFS sink closes 
the previously open file, then open a new one for the new incoming 
message consumed from the queue and processed through the channel.

This way, files will always have 1 and only 1 message inside them. I was 
expecting that number to be 100, according to the configuration 
mentioned above.

Any hints?

Best regards,

Roberto

Re: Understand JMS source + HDFS sink batch management

Posted by Chris Horrocks <ch...@hor.rocks>.

Hi Roberto,

Setting the roll intervals to 0 will stop the sink rolling the files in HDFS. Try setting hdfs.rollCount to the number of messages you want to roll the file on (I.e. The number of messages per file). Bare in mind setting this low will result in higher HDFS overhead.

--
Chris Horrocks

On Wed, Nov 16, 2016 at 10:35 am, Roberto Coluccio <'roberto.coluccio@eng.it'> wrote:

Hello folks,

I'm testing a Flume agent defined by a topology made of :

JMS source (Tibco implementation) -> memory channel -> hdfs sink

The JMS source has:

- my_agent.sources.my_source.batchSize = 100

The memory channel has:

- my_agent.channels.my_channel.capacity = 100

The HDFS sink has:

- my_agent.sinks.my_sink.hdfs.batchSize = 100
- my_agent.sinks.my_sink.hdfs.rollCount = 0
- my_agent.sinks.my_sink.hdfs.rollInterval = 0
- my_agent.sinks.my_sink.hdfs.idleTimeout = 0

I don't understand how/why new files on HDFS are created/closed. In fact, when I:

- launch the agent (JMS queue empty)
- push a new text message on the JMS queue

It happens that a new file is created by the HDFS, but not yet closed (as I expect). BUT, when I

3. push again a new text message on the JMS queue

regardles how much time I waited to perform step 3, the HDFS sink closes the previously open file, then open a new one for the new incoming message consumed from the queue and processed through the channel.

This way, files will always have 1 and only 1 message inside them. I was expecting that number to be 100, according to the configuration mentioned above.

Any hints?

Best regards,

Roberto

Re: Understand JMS source + HDFS sink batch management

Posted by Bessenyei Balázs Donát <be...@apache.org>.

Hi Roberto,

Do you happen to have any information about the messages themselves?
(Looking at https://flume.apache.org/FlumeUserGuide.html#hdfs-sink ,
there is the hdfs.rollSize setting, that might also be interesting.)

Have you tried using different setup for the source and channel? (It's
easy to try different things with the netcat source for example:
message frequency and size are easily simulated.)


Thank you,

Donat


2016-11-16 11:35 GMT+01:00 Roberto Coluccio <ro...@eng.it>:
> Hello folks,
>
> I'm testing a Flume agent defined by a topology made of :
>
> JMS source (Tibco implementation) -> memory channel -> hdfs sink
>
> The JMS source has:
>
> my_agent.sources.my_source.batchSize = 100
>
> The memory channel has:
>
> my_agent.channels.my_channel.capacity = 100
>
> The HDFS sink has:
>
> my_agent.sinks.my_sink.hdfs.batchSize = 100
> my_agent.sinks.my_sink.hdfs.rollCount = 0
> my_agent.sinks.my_sink.hdfs.rollInterval = 0
> my_agent.sinks.my_sink.hdfs.idleTimeout = 0
>
> I don't understand how/why new files on HDFS are created/closed. In fact,
> when I:
>
> launch the agent (JMS queue empty)
> push a new text message on the JMS queue
>
> It happens that a new file is created by the HDFS, but not yet closed (as I
> expect). BUT, when I
>
>     3. push again a new text message on the JMS queue
>
> regardles how much time I waited to perform step 3, the HDFS sink closes the
> previously open file, then open a new one for the new incoming message
> consumed from the queue and processed through the channel.
>
> This way, files will always have 1 and only 1 message inside them. I was
> expecting that number to be 100, according to the configuration mentioned
> above.
>
> Any hints?
>
> Best regards,
>
> Roberto
>
>
>
>
>
>
>