You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Chris Neal <cw...@gmail.com> on 2013/03/12 21:43:34 UTC

Best way to increase throughput of Exec->Memory->Avro agent.

Hi all.

I've been working on this for quite some time, and need some advice from
the experts.  I have a two tiered Flume architecture:

App Tier (all on one server):
 124 ExecSources -> MemoryChannel -> AvroSinks

HDFS Tier (on two servers):
  AvroSource -> FileChannel -> HDFSSinks

When I run the agents, the HDFS tier is keeping up fine with the App Tier.
 queue sizes stay between 0-10000 (I have a batch size of 10000).  All is
good.

On the App Tier, when I view the JMX data through jconsole, I watch the
size of the MemoryChannel grow steadily until it reaches the max, then it
starts throwing exceptions about not being able to put the batch on the
channel as expected.

There seems to be two basic ways to increase the throughput of the App Tier:
1.  Increase the MemoryChannel's transactionCapacity and the corresponding
AvroSink's batch-size.  Both are set to 10000 for me.
2.  Increase the number of AvroSinks to drain the MemoryChannel.  I'm up to
64 Sinks now which round-robin between the two Flume Agents on the HDFS
tier.

Both of those values seem quite high to me (batch size and number of
sinks).

Am I missing something as far as tuning?
Which would allow for greater increase to throughput, more Sinks or larger
batch size?

I'm stumped here.  I still think I can get this to work. :)

Any suggestions are most welcome.
Thanks for your time.
Chris

Re: Best way to increase throughput of Exec->Memory->Avro agent.

Posted by Roshan Naik <ro...@hortonworks.com>.

There would be less contention if you could reduce the sharing... so
may be divide them them into 31 per channel. 31 still looks like a
huge number. Best if you can you consolidate 31 down to just 1 or 2 ?

Keep in mind there is one thread per sink and one per source (unless
you are spawning more inside your source / sink). A rule of thumb
(actually more like guidance) is 2 to 4 threads per core. So keep the
an eye out for not overloading your box with too many threads.

On Tue, Mar 12, 2013 at 2:55 PM, Chris Neal <cw...@gmail.com> wrote:
> So, in a 4 channel setup, would I bind each of the 124 sources to all of the
> 4 channels, or divide them up and put 31 sources on each individual channel?
> :)
>
>
> On Tue, Mar 12, 2013 at 4:40 PM, Chris Neal <cw...@gmail.com> wrote:
>>
>> Beautiful.  Will try 4 channels in one Agent first.
>> Thanks!
>>
>>
>> On Tue, Mar 12, 2013 at 4:35 PM, Roshan Naik <ro...@hortonworks.com>
>> wrote:
>>>
>>> Even 16 on a single channel might be on the higher side IMHO.
>>>
>>> Try instead splitting into four channels with 4 sinks each... or even
>>> four agents with one channel and 4 sinks each ..... it will reduce
>>> contention. be careful to ensure your capacity of each channel is not
>>> too high since you now have many channels.
>>> -roshan
>>>
>>> On Tue, Mar 12, 2013 at 2:24 PM, Chris Neal <cw...@gmail.com> wrote:
>>> > Thanks for the reply.  You're definitely on to something with the
>>> > ever-increasing number of sinks.  :)
>>> >
>>> > I scaled it back to 16 AvroSinks, and used a
>>> > MemoryChannel.transactionCapacity of 1000, and AvroSink.batch-size of
>>> > 1000.
>>> > My ExecSource.batchSize is 100 (I chose this smaller number because
>>> > there
>>> > are so many of them (124), I didn't want 10s of thousands of events
>>> > getting
>>> > dropped on the MemoryChannel at once, rather just 1000s).  With those
>>> > settings, things are keeping the MemoryChannel drained.  Finally
>>> > getting
>>> > somewhere! :)
>>> >
>>> > Much appreciate the prompt response.  If anything else comes to mind,
>>> > please
>>> > do let me know.
>>> >
>>> > Thanks again.
>>> > Chris
>>> >
>>> >
>>> >
>>> > On Tue, Mar 12, 2013 at 4:12 PM, Roshan Naik <ro...@hortonworks.com>
>>> > wrote:
>>> >>
>>> >> i meant 640,000 not 64,000
>>> >>
>>> >> On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <ro...@hortonworks.com>
>>> >> wrote:
>>> >> > beyond a certain # of sinks it wont help adding more. my suspicion
>>> >> > is
>>> >> > you may have gone way overboard.
>>> >> >
>>> >> >  if your sink-side batch size is that large and you have 64 sinks in
>>> >> > the round-robin.. it will take a lot of events (64,000) to be pumped
>>> >> > in by the source order before the first event can start trickling
>>> >> > out
>>> >> > of any sink.  Also memory consumption will be quite high.. each sink
>>> >> > will open a transaction and hold on to 10000 events. This the cause
>>> >> > for the Memory channel filling up. Until the sink side transaction
>>> >> > is
>>> >> > committed (i.e 10k events are pulled), the memory reservation on the
>>> >> > channel is not relinquished. So your memory channel size will have
>>> >> > to
>>> >> > really high to support so manch sinks each with such a big batch
>>> >> > size.
>>> >> >
>>> >> > My gut feel is that your source-side batch size is not much of an
>>> >> > issue and can be smaller. Increasing the number of sinks will only
>>> >> > help if the sink is indeed the bott
>>> >> >
>>> >> > On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <cw...@gmail.com>
>>> >> > wrote:
>>> >> >> Hi all.
>>> >> >>
>>> >> >> I've been working on this for quite some time, and need some advice
>>> >> >> from the
>>> >> >> experts.  I have a two tiered Flume architecture:
>>> >> >>
>>> >> >> App Tier (all on one server):
>>> >> >>  124 ExecSources -> MemoryChannel -> AvroSinks
>>> >> >>
>>> >> >> HDFS Tier (on two servers):
>>> >> >>   AvroSource -> FileChannel -> HDFSSinks
>>> >> >>
>>> >> >> When I run the agents, the HDFS tier is keeping up fine with the
>>> >> >> App
>>> >> >> Tier.
>>> >> >> queue sizes stay between 0-10000 (I have a batch size of 10000).
>>> >> >> All
>>> >> >> is
>>> >> >> good.
>>> >> >>
>>> >> >> On the App Tier, when I view the JMX data through jconsole, I watch
>>> >> >> the
>>> >> >> size
>>> >> >> of the MemoryChannel grow steadily until it reaches the max, then
>>> >> >> it
>>> >> >> starts
>>> >> >> throwing exceptions about not being able to put the batch on the
>>> >> >> channel as
>>> >> >> expected.
>>> >> >>
>>> >> >> There seems to be two basic ways to increase the throughput of the
>>> >> >> App
>>> >> >> Tier:
>>> >> >> 1.  Increase the MemoryChannel's transactionCapacity and the
>>> >> >> corresponding
>>> >> >> AvroSink's batch-size.  Both are set to 10000 for me.
>>> >> >> 2.  Increase the number of AvroSinks to drain the MemoryChannel.
>>> >> >> I'm
>>> >> >> up to
>>> >> >> 64 Sinks now which round-robin between the two Flume Agents on the
>>> >> >> HDFS
>>> >> >> tier.
>>> >> >>
>>> >> >> Both of those values seem quite high to me (batch size and number
>>> >> >> of
>>> >> >> sinks).
>>> >> >>
>>> >> >> Am I missing something as far as tuning?
>>> >> >> Which would allow for greater increase to throughput, more Sinks or
>>> >> >> larger
>>> >> >> batch size?
>>> >> >>
>>> >> >> I'm stumped here.  I still think I can get this to work. :)
>>> >> >>
>>> >> >> Any suggestions are most welcome.
>>> >> >> Thanks for your time.
>>> >> >> Chris
>>> >> >>
>>> >
>>> >
>>
>>
>

Re: Best way to increase throughput of Exec->Memory->Avro agent.

Posted by Chris Neal <cw...@gmail.com>.

So, in a 4 channel setup, would I bind each of the 124 sources to all of
the 4 channels, or divide them up and put 31 sources on each individual
channel? :)


On Tue, Mar 12, 2013 at 4:40 PM, Chris Neal <cw...@gmail.com> wrote:

> Beautiful.  Will try 4 channels in one Agent first.
> Thanks!
>
>
> On Tue, Mar 12, 2013 at 4:35 PM, Roshan Naik <ro...@hortonworks.com>wrote:
>
>> Even 16 on a single channel might be on the higher side IMHO.
>>
>> Try instead splitting into four channels with 4 sinks each... or even
>> four agents with one channel and 4 sinks each ..... it will reduce
>> contention. be careful to ensure your capacity of each channel is not
>> too high since you now have many channels.
>> -roshan
>>
>> On Tue, Mar 12, 2013 at 2:24 PM, Chris Neal <cw...@gmail.com> wrote:
>> > Thanks for the reply.  You're definitely on to something with the
>> > ever-increasing number of sinks.  :)
>> >
>> > I scaled it back to 16 AvroSinks, and used a
>> > MemoryChannel.transactionCapacity of 1000, and AvroSink.batch-size of
>> 1000.
>> > My ExecSource.batchSize is 100 (I chose this smaller number because
>> there
>> > are so many of them (124), I didn't want 10s of thousands of events
>> getting
>> > dropped on the MemoryChannel at once, rather just 1000s).  With those
>> > settings, things are keeping the MemoryChannel drained.  Finally getting
>> > somewhere! :)
>> >
>> > Much appreciate the prompt response.  If anything else comes to mind,
>> please
>> > do let me know.
>> >
>> > Thanks again.
>> > Chris
>> >
>> >
>> >
>> > On Tue, Mar 12, 2013 at 4:12 PM, Roshan Naik <ro...@hortonworks.com>
>> wrote:
>> >>
>> >> i meant 640,000 not 64,000
>> >>
>> >> On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <ro...@hortonworks.com>
>> >> wrote:
>> >> > beyond a certain # of sinks it wont help adding more. my suspicion is
>> >> > you may have gone way overboard.
>> >> >
>> >> >  if your sink-side batch size is that large and you have 64 sinks in
>> >> > the round-robin.. it will take a lot of events (64,000) to be pumped
>> >> > in by the source order before the first event can start trickling out
>> >> > of any sink.  Also memory consumption will be quite high.. each sink
>> >> > will open a transaction and hold on to 10000 events. This the cause
>> >> > for the Memory channel filling up. Until the sink side transaction is
>> >> > committed (i.e 10k events are pulled), the memory reservation on the
>> >> > channel is not relinquished. So your memory channel size will have to
>> >> > really high to support so manch sinks each with such a big batch
>> size.
>> >> >
>> >> > My gut feel is that your source-side batch size is not much of an
>> >> > issue and can be smaller. Increasing the number of sinks will only
>> >> > help if the sink is indeed the bott
>> >> >
>> >> > On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <cw...@gmail.com>
>> wrote:
>> >> >> Hi all.
>> >> >>
>> >> >> I've been working on this for quite some time, and need some advice
>> >> >> from the
>> >> >> experts.  I have a two tiered Flume architecture:
>> >> >>
>> >> >> App Tier (all on one server):
>> >> >>  124 ExecSources -> MemoryChannel -> AvroSinks
>> >> >>
>> >> >> HDFS Tier (on two servers):
>> >> >>   AvroSource -> FileChannel -> HDFSSinks
>> >> >>
>> >> >> When I run the agents, the HDFS tier is keeping up fine with the App
>> >> >> Tier.
>> >> >> queue sizes stay between 0-10000 (I have a batch size of 10000).
>>  All
>> >> >> is
>> >> >> good.
>> >> >>
>> >> >> On the App Tier, when I view the JMX data through jconsole, I watch
>> the
>> >> >> size
>> >> >> of the MemoryChannel grow steadily until it reaches the max, then it
>> >> >> starts
>> >> >> throwing exceptions about not being able to put the batch on the
>> >> >> channel as
>> >> >> expected.
>> >> >>
>> >> >> There seems to be two basic ways to increase the throughput of the
>> App
>> >> >> Tier:
>> >> >> 1.  Increase the MemoryChannel's transactionCapacity and the
>> >> >> corresponding
>> >> >> AvroSink's batch-size.  Both are set to 10000 for me.
>> >> >> 2.  Increase the number of AvroSinks to drain the MemoryChannel.
>>  I'm
>> >> >> up to
>> >> >> 64 Sinks now which round-robin between the two Flume Agents on the
>> HDFS
>> >> >> tier.
>> >> >>
>> >> >> Both of those values seem quite high to me (batch size and number of
>> >> >> sinks).
>> >> >>
>> >> >> Am I missing something as far as tuning?
>> >> >> Which would allow for greater increase to throughput, more Sinks or
>> >> >> larger
>> >> >> batch size?
>> >> >>
>> >> >> I'm stumped here.  I still think I can get this to work. :)
>> >> >>
>> >> >> Any suggestions are most welcome.
>> >> >> Thanks for your time.
>> >> >> Chris
>> >> >>
>> >
>> >
>>
>
>

Re: Best way to increase throughput of Exec->Memory->Avro agent.

Posted by Chris Neal <cw...@gmail.com>.

Beautiful.  Will try 4 channels in one Agent first.
Thanks!


On Tue, Mar 12, 2013 at 4:35 PM, Roshan Naik <ro...@hortonworks.com> wrote:

> Even 16 on a single channel might be on the higher side IMHO.
>
> Try instead splitting into four channels with 4 sinks each... or even
> four agents with one channel and 4 sinks each ..... it will reduce
> contention. be careful to ensure your capacity of each channel is not
> too high since you now have many channels.
> -roshan
>
> On Tue, Mar 12, 2013 at 2:24 PM, Chris Neal <cw...@gmail.com> wrote:
> > Thanks for the reply.  You're definitely on to something with the
> > ever-increasing number of sinks.  :)
> >
> > I scaled it back to 16 AvroSinks, and used a
> > MemoryChannel.transactionCapacity of 1000, and AvroSink.batch-size of
> 1000.
> > My ExecSource.batchSize is 100 (I chose this smaller number because there
> > are so many of them (124), I didn't want 10s of thousands of events
> getting
> > dropped on the MemoryChannel at once, rather just 1000s).  With those
> > settings, things are keeping the MemoryChannel drained.  Finally getting
> > somewhere! :)
> >
> > Much appreciate the prompt response.  If anything else comes to mind,
> please
> > do let me know.
> >
> > Thanks again.
> > Chris
> >
> >
> >
> > On Tue, Mar 12, 2013 at 4:12 PM, Roshan Naik <ro...@hortonworks.com>
> wrote:
> >>
> >> i meant 640,000 not 64,000
> >>
> >> On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <ro...@hortonworks.com>
> >> wrote:
> >> > beyond a certain # of sinks it wont help adding more. my suspicion is
> >> > you may have gone way overboard.
> >> >
> >> >  if your sink-side batch size is that large and you have 64 sinks in
> >> > the round-robin.. it will take a lot of events (64,000) to be pumped
> >> > in by the source order before the first event can start trickling out
> >> > of any sink.  Also memory consumption will be quite high.. each sink
> >> > will open a transaction and hold on to 10000 events. This the cause
> >> > for the Memory channel filling up. Until the sink side transaction is
> >> > committed (i.e 10k events are pulled), the memory reservation on the
> >> > channel is not relinquished. So your memory channel size will have to
> >> > really high to support so manch sinks each with such a big batch size.
> >> >
> >> > My gut feel is that your source-side batch size is not much of an
> >> > issue and can be smaller. Increasing the number of sinks will only
> >> > help if the sink is indeed the bott
> >> >
> >> > On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <cw...@gmail.com> wrote:
> >> >> Hi all.
> >> >>
> >> >> I've been working on this for quite some time, and need some advice
> >> >> from the
> >> >> experts.  I have a two tiered Flume architecture:
> >> >>
> >> >> App Tier (all on one server):
> >> >>  124 ExecSources -> MemoryChannel -> AvroSinks
> >> >>
> >> >> HDFS Tier (on two servers):
> >> >>   AvroSource -> FileChannel -> HDFSSinks
> >> >>
> >> >> When I run the agents, the HDFS tier is keeping up fine with the App
> >> >> Tier.
> >> >> queue sizes stay between 0-10000 (I have a batch size of 10000).  All
> >> >> is
> >> >> good.
> >> >>
> >> >> On the App Tier, when I view the JMX data through jconsole, I watch
> the
> >> >> size
> >> >> of the MemoryChannel grow steadily until it reaches the max, then it
> >> >> starts
> >> >> throwing exceptions about not being able to put the batch on the
> >> >> channel as
> >> >> expected.
> >> >>
> >> >> There seems to be two basic ways to increase the throughput of the
> App
> >> >> Tier:
> >> >> 1.  Increase the MemoryChannel's transactionCapacity and the
> >> >> corresponding
> >> >> AvroSink's batch-size.  Both are set to 10000 for me.
> >> >> 2.  Increase the number of AvroSinks to drain the MemoryChannel.  I'm
> >> >> up to
> >> >> 64 Sinks now which round-robin between the two Flume Agents on the
> HDFS
> >> >> tier.
> >> >>
> >> >> Both of those values seem quite high to me (batch size and number of
> >> >> sinks).
> >> >>
> >> >> Am I missing something as far as tuning?
> >> >> Which would allow for greater increase to throughput, more Sinks or
> >> >> larger
> >> >> batch size?
> >> >>
> >> >> I'm stumped here.  I still think I can get this to work. :)
> >> >>
> >> >> Any suggestions are most welcome.
> >> >> Thanks for your time.
> >> >> Chris
> >> >>
> >
> >
>

Re: Best way to increase throughput of Exec->Memory->Avro agent.

Posted by Roshan Naik <ro...@hortonworks.com>.

Even 16 on a single channel might be on the higher side IMHO.

Try instead splitting into four channels with 4 sinks each... or even
four agents with one channel and 4 sinks each ..... it will reduce
contention. be careful to ensure your capacity of each channel is not
too high since you now have many channels.
-roshan

On Tue, Mar 12, 2013 at 2:24 PM, Chris Neal <cw...@gmail.com> wrote:
> Thanks for the reply.  You're definitely on to something with the
> ever-increasing number of sinks.  :)
>
> I scaled it back to 16 AvroSinks, and used a
> MemoryChannel.transactionCapacity of 1000, and AvroSink.batch-size of 1000.
> My ExecSource.batchSize is 100 (I chose this smaller number because there
> are so many of them (124), I didn't want 10s of thousands of events getting
> dropped on the MemoryChannel at once, rather just 1000s).  With those
> settings, things are keeping the MemoryChannel drained.  Finally getting
> somewhere! :)
>
> Much appreciate the prompt response.  If anything else comes to mind, please
> do let me know.
>
> Thanks again.
> Chris
>
>
>
> On Tue, Mar 12, 2013 at 4:12 PM, Roshan Naik <ro...@hortonworks.com> wrote:
>>
>> i meant 640,000 not 64,000
>>
>> On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <ro...@hortonworks.com>
>> wrote:
>> > beyond a certain # of sinks it wont help adding more. my suspicion is
>> > you may have gone way overboard.
>> >
>> >  if your sink-side batch size is that large and you have 64 sinks in
>> > the round-robin.. it will take a lot of events (64,000) to be pumped
>> > in by the source order before the first event can start trickling out
>> > of any sink.  Also memory consumption will be quite high.. each sink
>> > will open a transaction and hold on to 10000 events. This the cause
>> > for the Memory channel filling up. Until the sink side transaction is
>> > committed (i.e 10k events are pulled), the memory reservation on the
>> > channel is not relinquished. So your memory channel size will have to
>> > really high to support so manch sinks each with such a big batch size.
>> >
>> > My gut feel is that your source-side batch size is not much of an
>> > issue and can be smaller. Increasing the number of sinks will only
>> > help if the sink is indeed the bott
>> >
>> > On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <cw...@gmail.com> wrote:
>> >> Hi all.
>> >>
>> >> I've been working on this for quite some time, and need some advice
>> >> from the
>> >> experts.  I have a two tiered Flume architecture:
>> >>
>> >> App Tier (all on one server):
>> >>  124 ExecSources -> MemoryChannel -> AvroSinks
>> >>
>> >> HDFS Tier (on two servers):
>> >>   AvroSource -> FileChannel -> HDFSSinks
>> >>
>> >> When I run the agents, the HDFS tier is keeping up fine with the App
>> >> Tier.
>> >> queue sizes stay between 0-10000 (I have a batch size of 10000).  All
>> >> is
>> >> good.
>> >>
>> >> On the App Tier, when I view the JMX data through jconsole, I watch the
>> >> size
>> >> of the MemoryChannel grow steadily until it reaches the max, then it
>> >> starts
>> >> throwing exceptions about not being able to put the batch on the
>> >> channel as
>> >> expected.
>> >>
>> >> There seems to be two basic ways to increase the throughput of the App
>> >> Tier:
>> >> 1.  Increase the MemoryChannel's transactionCapacity and the
>> >> corresponding
>> >> AvroSink's batch-size.  Both are set to 10000 for me.
>> >> 2.  Increase the number of AvroSinks to drain the MemoryChannel.  I'm
>> >> up to
>> >> 64 Sinks now which round-robin between the two Flume Agents on the HDFS
>> >> tier.
>> >>
>> >> Both of those values seem quite high to me (batch size and number of
>> >> sinks).
>> >>
>> >> Am I missing something as far as tuning?
>> >> Which would allow for greater increase to throughput, more Sinks or
>> >> larger
>> >> batch size?
>> >>
>> >> I'm stumped here.  I still think I can get this to work. :)
>> >>
>> >> Any suggestions are most welcome.
>> >> Thanks for your time.
>> >> Chris
>> >>
>
>

Re: Best way to increase throughput of Exec->Memory->Avro agent.

Posted by Chris Neal <cw...@gmail.com>.

Thanks for the reply.  You're definitely on to something with the
ever-increasing number of sinks.  :)

I scaled it back to 16 AvroSinks, and used a
MemoryChannel.transactionCapacity of 1000, and AvroSink.batch-size of 1000.
 My ExecSource.batchSize is 100 (I chose this smaller number because there
are so many of them (124), I didn't want 10s of thousands of events getting
dropped on the MemoryChannel at once, rather just 1000s).  With those
settings, things are keeping the MemoryChannel drained.  Finally getting
somewhere! :)

Much appreciate the prompt response.  If anything else comes to mind,
please do let me know.

Thanks again.
Chris



On Tue, Mar 12, 2013 at 4:12 PM, Roshan Naik <ro...@hortonworks.com> wrote:

> i meant 640,000 not 64,000
>
> On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <ro...@hortonworks.com>
> wrote:
> > beyond a certain # of sinks it wont help adding more. my suspicion is
> > you may have gone way overboard.
> >
> >  if your sink-side batch size is that large and you have 64 sinks in
> > the round-robin.. it will take a lot of events (64,000) to be pumped
> > in by the source order before the first event can start trickling out
> > of any sink.  Also memory consumption will be quite high.. each sink
> > will open a transaction and hold on to 10000 events. This the cause
> > for the Memory channel filling up. Until the sink side transaction is
> > committed (i.e 10k events are pulled), the memory reservation on the
> > channel is not relinquished. So your memory channel size will have to
> > really high to support so manch sinks each with such a big batch size.
> >
> > My gut feel is that your source-side batch size is not much of an
> > issue and can be smaller. Increasing the number of sinks will only
> > help if the sink is indeed the bott
> >
> > On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <cw...@gmail.com> wrote:
> >> Hi all.
> >>
> >> I've been working on this for quite some time, and need some advice
> from the
> >> experts.  I have a two tiered Flume architecture:
> >>
> >> App Tier (all on one server):
> >>  124 ExecSources -> MemoryChannel -> AvroSinks
> >>
> >> HDFS Tier (on two servers):
> >>   AvroSource -> FileChannel -> HDFSSinks
> >>
> >> When I run the agents, the HDFS tier is keeping up fine with the App
> Tier.
> >> queue sizes stay between 0-10000 (I have a batch size of 10000).  All is
> >> good.
> >>
> >> On the App Tier, when I view the JMX data through jconsole, I watch the
> size
> >> of the MemoryChannel grow steadily until it reaches the max, then it
> starts
> >> throwing exceptions about not being able to put the batch on the
> channel as
> >> expected.
> >>
> >> There seems to be two basic ways to increase the throughput of the App
> Tier:
> >> 1.  Increase the MemoryChannel's transactionCapacity and the
> corresponding
> >> AvroSink's batch-size.  Both are set to 10000 for me.
> >> 2.  Increase the number of AvroSinks to drain the MemoryChannel.  I'm
> up to
> >> 64 Sinks now which round-robin between the two Flume Agents on the HDFS
> >> tier.
> >>
> >> Both of those values seem quite high to me (batch size and number of
> sinks).
> >>
> >> Am I missing something as far as tuning?
> >> Which would allow for greater increase to throughput, more Sinks or
> larger
> >> batch size?
> >>
> >> I'm stumped here.  I still think I can get this to work. :)
> >>
> >> Any suggestions are most welcome.
> >> Thanks for your time.
> >> Chris
> >>
>

Re: Best way to increase throughput of Exec->Memory->Avro agent.

Posted by Roshan Naik <ro...@hortonworks.com>.

i meant 640,000 not 64,000

On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <ro...@hortonworks.com> wrote:
> beyond a certain # of sinks it wont help adding more. my suspicion is
> you may have gone way overboard.
>
>  if your sink-side batch size is that large and you have 64 sinks in
> the round-robin.. it will take a lot of events (64,000) to be pumped
> in by the source order before the first event can start trickling out
> of any sink.  Also memory consumption will be quite high.. each sink
> will open a transaction and hold on to 10000 events. This the cause
> for the Memory channel filling up. Until the sink side transaction is
> committed (i.e 10k events are pulled), the memory reservation on the
> channel is not relinquished. So your memory channel size will have to
> really high to support so manch sinks each with such a big batch size.
>
> My gut feel is that your source-side batch size is not much of an
> issue and can be smaller. Increasing the number of sinks will only
> help if the sink is indeed the bott
>
> On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <cw...@gmail.com> wrote:
>> Hi all.
>>
>> I've been working on this for quite some time, and need some advice from the
>> experts.  I have a two tiered Flume architecture:
>>
>> App Tier (all on one server):
>>  124 ExecSources -> MemoryChannel -> AvroSinks
>>
>> HDFS Tier (on two servers):
>>   AvroSource -> FileChannel -> HDFSSinks
>>
>> When I run the agents, the HDFS tier is keeping up fine with the App Tier.
>> queue sizes stay between 0-10000 (I have a batch size of 10000).  All is
>> good.
>>
>> On the App Tier, when I view the JMX data through jconsole, I watch the size
>> of the MemoryChannel grow steadily until it reaches the max, then it starts
>> throwing exceptions about not being able to put the batch on the channel as
>> expected.
>>
>> There seems to be two basic ways to increase the throughput of the App Tier:
>> 1.  Increase the MemoryChannel's transactionCapacity and the corresponding
>> AvroSink's batch-size.  Both are set to 10000 for me.
>> 2.  Increase the number of AvroSinks to drain the MemoryChannel.  I'm up to
>> 64 Sinks now which round-robin between the two Flume Agents on the HDFS
>> tier.
>>
>> Both of those values seem quite high to me (batch size and number of sinks).
>>
>> Am I missing something as far as tuning?
>> Which would allow for greater increase to throughput, more Sinks or larger
>> batch size?
>>
>> I'm stumped here.  I still think I can get this to work. :)
>>
>> Any suggestions are most welcome.
>> Thanks for your time.
>> Chris
>>

Re: Best way to increase throughput of Exec->Memory->Avro agent.

Posted by Roshan Naik <ro...@hortonworks.com>.

beyond a certain # of sinks it wont help adding more. my suspicion is
you may have gone way overboard.

 if your sink-side batch size is that large and you have 64 sinks in
the round-robin.. it will take a lot of events (64,000) to be pumped
in by the source order before the first event can start trickling out
of any sink.  Also memory consumption will be quite high.. each sink
will open a transaction and hold on to 10000 events. This the cause
for the Memory channel filling up. Until the sink side transaction is
committed (i.e 10k events are pulled), the memory reservation on the
channel is not relinquished. So your memory channel size will have to
really high to support so manch sinks each with such a big batch size.

My gut feel is that your source-side batch size is not much of an
issue and can be smaller. Increasing the number of sinks will only
help if the sink is indeed the bott

On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <cw...@gmail.com> wrote:
> Hi all.
>
> I've been working on this for quite some time, and need some advice from the
> experts.  I have a two tiered Flume architecture:
>
> App Tier (all on one server):
>  124 ExecSources -> MemoryChannel -> AvroSinks
>
> HDFS Tier (on two servers):
>   AvroSource -> FileChannel -> HDFSSinks
>
> When I run the agents, the HDFS tier is keeping up fine with the App Tier.
> queue sizes stay between 0-10000 (I have a batch size of 10000).  All is
> good.
>
> On the App Tier, when I view the JMX data through jconsole, I watch the size
> of the MemoryChannel grow steadily until it reaches the max, then it starts
> throwing exceptions about not being able to put the batch on the channel as
> expected.
>
> There seems to be two basic ways to increase the throughput of the App Tier:
> 1.  Increase the MemoryChannel's transactionCapacity and the corresponding
> AvroSink's batch-size.  Both are set to 10000 for me.
> 2.  Increase the number of AvroSinks to drain the MemoryChannel.  I'm up to
> 64 Sinks now which round-robin between the two Flume Agents on the HDFS
> tier.
>
> Both of those values seem quite high to me (batch size and number of sinks).
>
> Am I missing something as far as tuning?
> Which would allow for greater increase to throughput, more Sinks or larger
> batch size?
>
> I'm stumped here.  I still think I can get this to work. :)
>
> Any suggestions are most welcome.
> Thanks for your time.
> Chris
>