You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Gary Malouf <ma...@gmail.com> on 2013/05/14 20:26:13 UTC

What does HDFSSink batch size actually effect?

I've previously posted something similar to this on StackOverflow:
http://stackoverflow.com/questions/16548358/how-come-flume-ng-hdfs-sink-does-not-write-to-file-when-the-number-of-events-equ

My understanding of batch size from looking at the code in flume-ng 1.3.x
is that batch size determines at what point data is written to hdfs.  With
my configuration below, I am not seeing any data written to file until the
rollInterval has passed.

imp-agent.channels.imp-ch1.type = memory
imp-agent.channels.imp-ch1.capacity = 40000
imp-agent.channels.imp-ch1.transactionCapacity = 1000

imp-agent.sources.avro-imp-source1.channels = imp-ch1
imp-agent.sources.avro-imp-source1.type = avro
imp-agent.sources.avro-imp-source1.bind = 0.0.0.0
imp-agent.sources.avro-imp-source1.port = 41414

imp-agent.sources.avro-imp-source1.interceptors = host1 timestamp1
imp-agent.sources.avro-imp-source1.interceptors.host1.type = host
imp-agent.sources.avro-imp-source1.interceptors.host1.useIP = false
imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type = timestamp

imp-agent.sinks.hdfs-imp-sink1.channel = imp-ch1
imp-agent.sinks.hdfs-imp-sink1.type = hdfs
imp-agent.sinks.hdfs-imp-sink1.hdfs.path =
hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix = Impr
imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize = 10
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval = 3600
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount = 0
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize = 66584576

imp-agent.channels = imp-ch1
imp-agent.sources = avro-imp-source1
imp-agent.sinks = hdfs-imp-sink1

I bring this up as I want to know that after the 'batchSize' number of
messages are sent to flume that they have been put into HDFS rather than
waiting for the log roll time to do all of the writing.  My strong
preference if possible is to make sure that data is being written to '.tmp'
file throughout the hour and then rolled after the 'rollInterval' amount of
time has passed.

Re: What does HDFSSink batch size actually effect?

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

HDFS batch size determines the number of events to take from the channel 
and send in one go.

These will be split up into multiple files if bucketted, which is worth 
consideration(how many events will get written to each file? If it's 
only a handful, a higher batch size or less files may be desirable)

The size from hdfs -ls will display as 0 but if you actually download 
the file it should contain everything. Each batch invokes a sync() 
operation on every bucketwriter. I'm not entirely sure how not having 
append activated might affect this.

On 05/15/2013 03:26 AM, Gary Malouf wrote:
> I've previously posted something similar to this on StackOverflow: 
> http://stackoverflow.com/questions/16548358/how-come-flume-ng-hdfs-sink-does-not-write-to-file-when-the-number-of-events-equ
>
> My understanding of batch size from looking at the code in flume-ng 
> 1.3.x is that batch size determines at what point data is written to 
> hdfs.  With my configuration below, I am not seeing any data written 
> to file until the rollInterval has passed.
>
> |imp-agent.channels.imp-ch1.type=  memory
> imp-agent.channels.imp-ch1.capacity=  40000
> imp-agent.channels.imp-ch1.transactionCapacity=  1000
>
> imp-agent.sources.avro-imp-source1.channels=  imp-ch1
> imp-agent.sources.avro-imp-source1.type=  avro
> imp-agent.sources.avro-imp-source1.bind=  0.0.0.0
> imp-agent.sources.avro-imp-source1.port=  41414
>
> imp-agent.sources.avro-imp-source1.interceptors=  host1 timestamp1
> imp-agent.sources.avro-imp-source1.interceptors.host1.type=  host
> imp-agent.sources.avro-imp-source1.interceptors.host1.useIP=  false
> imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type=  timestamp
>
> imp-agent.sinks.hdfs-imp-sink1.channel=  imp-ch1
> imp-agent.sinks.hdfs-imp-sink1.type=  hdfs
> imp-agent.sinks.hdfs-imp-sink1.hdfs.path=  hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
> imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix=  Impr
> imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize=  10
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval=  3600
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount=  0
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize=  66584576
>
> imp-agent.channels=  imp-ch1
> imp-agent.sources=  avro-imp-source1
> imp-agent.sinks=  hdfs-imp-sink1|
> I bring this up as I want to know that after the 'batchSize' number of 
> messages are sent to flume that they have been put into HDFS rather 
> than waiting for the log roll time to do all of the writing.  My 
> strong preference if possible is to make sure that data is being 
> written to '.tmp' file throughout the hour and then rolled after the 
> 'rollInterval' amount of time has passed.

Re: What does HDFSSink batch size actually effect?

Posted by Gary Malouf <ma...@gmail.com>.

Okay, so the display of there being 0 bytes in the file is a misnomer in
all likelihood.  This is a bit unfortunate as for our use case we then need
to wait an hour to find out how much data is actually in each file.

My understanding is that by default, hdfs appends are NOT active.  I guess
the remaining question is what, if anything will having append turned on
affect?


On Tue, May 14, 2013 at 2:26 PM, Gary Malouf <ma...@gmail.com> wrote:

> I've previously posted something similar to this on StackOverflow:
> http://stackoverflow.com/questions/16548358/how-come-flume-ng-hdfs-sink-does-not-write-to-file-when-the-number-of-events-equ
>
> My understanding of batch size from looking at the code in flume-ng 1.3.x
> is that batch size determines at what point data is written to hdfs.  With
> my configuration below, I am not seeing any data written to file until the
> rollInterval has passed.
>
> imp-agent.channels.imp-ch1.type = memory
> imp-agent.channels.imp-ch1.capacity = 40000
> imp-agent.channels.imp-ch1.transactionCapacity = 1000
>
> imp-agent.sources.avro-imp-source1.channels = imp-ch1
> imp-agent.sources.avro-imp-source1.type = avro
> imp-agent.sources.avro-imp-source1.bind = 0.0.0.0
> imp-agent.sources.avro-imp-source1.port = 41414
>
> imp-agent.sources.avro-imp-source1.interceptors = host1 timestamp1
> imp-agent.sources.avro-imp-source1.interceptors.host1.type = host
> imp-agent.sources.avro-imp-source1.interceptors.host1.useIP = false
> imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type = timestamp
>
> imp-agent.sinks.hdfs-imp-sink1.channel = imp-ch1
> imp-agent.sinks.hdfs-imp-sink1.type = hdfs
> imp-agent.sinks.hdfs-imp-sink1.hdfs.path = hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
> imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix = Impr
> imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize = 10
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval = 3600
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount = 0
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize = 66584576
>
> imp-agent.channels = imp-ch1
> imp-agent.sources = avro-imp-source1
> imp-agent.sinks = hdfs-imp-sink1
>
> I bring this up as I want to know that after the 'batchSize' number of
> messages are sent to flume that they have been put into HDFS rather than
> waiting for the log roll time to do all of the writing.  My strong
> preference if possible is to make sure that data is being written to '.tmp'
> file throughout the hour and then rolled after the 'rollInterval' amount of
> time has passed.
>

Re: What does HDFSSink batch size actually effect?

Posted by Gary Malouf <ma...@gmail.com>.

To append to my previous post, I have also looked into activating the hdfs
append setting but the descriptions on it are limited and it is tricky to
understand what effects it will have on my logging.


On Tue, May 14, 2013 at 2:26 PM, Gary Malouf <ma...@gmail.com> wrote:

> I've previously posted something similar to this on StackOverflow:
> http://stackoverflow.com/questions/16548358/how-come-flume-ng-hdfs-sink-does-not-write-to-file-when-the-number-of-events-equ
>
> My understanding of batch size from looking at the code in flume-ng 1.3.x
> is that batch size determines at what point data is written to hdfs.  With
> my configuration below, I am not seeing any data written to file until the
> rollInterval has passed.
>
> imp-agent.channels.imp-ch1.type = memory
> imp-agent.channels.imp-ch1.capacity = 40000
> imp-agent.channels.imp-ch1.transactionCapacity = 1000
>
> imp-agent.sources.avro-imp-source1.channels = imp-ch1
> imp-agent.sources.avro-imp-source1.type = avro
> imp-agent.sources.avro-imp-source1.bind = 0.0.0.0
> imp-agent.sources.avro-imp-source1.port = 41414
>
> imp-agent.sources.avro-imp-source1.interceptors = host1 timestamp1
> imp-agent.sources.avro-imp-source1.interceptors.host1.type = host
> imp-agent.sources.avro-imp-source1.interceptors.host1.useIP = false
> imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type = timestamp
>
> imp-agent.sinks.hdfs-imp-sink1.channel = imp-ch1
> imp-agent.sinks.hdfs-imp-sink1.type = hdfs
> imp-agent.sinks.hdfs-imp-sink1.hdfs.path = hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
> imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix = Impr
> imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize = 10
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval = 3600
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount = 0
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize = 66584576
>
> imp-agent.channels = imp-ch1
> imp-agent.sources = avro-imp-source1
> imp-agent.sinks = hdfs-imp-sink1
>
> I bring this up as I want to know that after the 'batchSize' number of
> messages are sent to flume that they have been put into HDFS rather than
> waiting for the log roll time to do all of the writing.  My strong
> preference if possible is to make sure that data is being written to '.tmp'
> file throughout the hour and then rolled after the 'rollInterval' amount of
> time has passed.
>