You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Devin Suiter RDX <ds...@rdx.com> on 2013/12/17 17:30:16 UTC

File Channel Best Practice

Hi,

There has been a lot of discussion about file channel speed today, and I
have had a dilemma I was hoping for some feedback on, since the topic is
hot.

Regarding this:
"Hi,

1) You are only using a single disk for file channel and it looks like a
single disk for both checkpoint and data directories therefore throughput
is going to be extremely slow."

How do you solve in a practical sense the requirement for file channel to
have a range of disks for best R/W speed, yet still have network visibility
to source data sources and the Hadoop cluster at the same time?

It seems like for production file channel implementation, the best solution
is to give Flume a dedicated server somewhere near the edge with a JBOD
pile properly mounted and partitioned. But that adds to implementation
cost.

The alternative seems to be to run Flume on a  physical Cloudera Manager
SCM server that has some extra disks, or run Flume agents concurrent with
datanode processes on worker nodes, but those don't seem good to do,
especially piggybacking on worker nodes, and file channel > HDFS will
compound the issue...

I know the namenode should definitely not be involved.

I suppose you could virtualize a few servers on a properly networked host
and a fast SANS/NAS connection and get by ok, but that will merge your
parallelization at some point...

Any ideas on the subject?

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

Re: File Channel Best Practice

Posted by Devin Suiter RDX <ds...@rdx.com>.

Thanks Paul, that's good to know.

My cluster is sort of a combination of test and production, so we don't
tinker with the real cluster config often, and my dev pseudo-cluster on VM
doesn't really cater well to file-channel testing, and my source is only
making 1.1 events/minute anyway so this has been tough for me to really
examine closely.

I appreciate the time you took to share.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Dec 17, 2013 at 12:43 PM, Paul Chavez <
pchavez@verticalsearchworks.com> wrote:

> We co-locate our flume agents on our data nodes in order to have access to
> many ‘spindles’ for the file channels. We have a small cluster (10 nodes)
> so these are also our task tracker nodes and we haven’t seen any huge
> performance issues.
>
>
>
> For reference, our typical event ingestion rate is between 2k and 5k
> events per second under ‘normal’ production load. I recently had to
> backfill a couple of weeks of web logs though, and took the opportunity to
> examine max throughput rates and how heavy MR load affected things. During
> that ‘test’ we stabilized at about 12K events per second written to HDFS,
> that was a single agent using 2 HDFS sinks taking from one file channel. As
> far as I could tell my bottleneck was in the Avro hop between my collector
> and writer agents, not in the HDFS sinks. When we had all MR slots used by
> large batch jobs for extended amounts of time the event throughput degraded
> to about 3500 events/sec.
>
>
>
> I know these are just anecdotal data points, but wanted to share my
> experience with flume agents located on the actual data/task nodes
> themselves. I have done very little optimization aside from separating the
> file channel data/log directories onto separate drives.
>
>
>
> -Paul Chavez
>
>
>
> *From:* Devin Suiter RDX [mailto:dsuiter@rdx.com]
> *Sent:* Tuesday, December 17, 2013 8:30 AM
> *To:* user@flume.apache.org
> *Subject:* File Channel Best Practice
>
>
>
> Hi,
>
>
>
> There has been a lot of discussion about file channel speed today, and I
> have had a dilemma I was hoping for some feedback on, since the topic is
> hot.
>
>
>
> Regarding this:
>
> "Hi,
>
>
>
> 1) You are only using a single disk for file channel and it looks like a
> single disk for both checkpoint and data directories therefore throughput
> is going to be extremely slow."
>
>
>
> How do you solve in a practical sense the requirement for file channel to
> have a range of disks for best R/W speed, yet still have network visibility
> to source data sources and the Hadoop cluster at the same time?
>
>
>
> It seems like for production file channel implementation, the best
> solution is to give Flume a dedicated server somewhere near the edge with a
> JBOD pile properly mounted and partitioned. But that adds to implementation
> cost.
>
>
>
> The alternative seems to be to run Flume on a  physical Cloudera Manager
> SCM server that has some extra disks, or run Flume agents concurrent with
> datanode processes on worker nodes, but those don't seem good to do,
> especially piggybacking on worker nodes, and file channel > HDFS will
> compound the issue...
>
>
>
> I know the namenode should definitely not be involved.
>
>
>
> I suppose you could virtualize a few servers on a properly networked host
> and a fast SANS/NAS connection and get by ok, but that will merge your
> parallelization at some point...
>
>
>
> Any ideas on the subject?
>
>
> *Devin Suiter*
>
> Jr. Data Solutions Software Engineer
>
> [image: Image removed by sender.]
>
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>

RE: File Channel Best Practice

Posted by Paul Chavez <pc...@verticalsearchworks.com>.

We co-locate our flume agents on our data nodes in order to have access to many 'spindles' for the file channels. We have a small cluster (10 nodes) so these are also our task tracker nodes and we haven't seen any huge performance issues.

For reference, our typical event ingestion rate is between 2k and 5k events per second under 'normal' production load. I recently had to backfill a couple of weeks of web logs though, and took the opportunity to examine max throughput rates and how heavy MR load affected things. During that 'test' we stabilized at about 12K events per second written to HDFS, that was a single agent using 2 HDFS sinks taking from one file channel. As far as I could tell my bottleneck was in the Avro hop between my collector and writer agents, not in the HDFS sinks. When we had all MR slots used by large batch jobs for extended amounts of time the event throughput degraded to about 3500 events/sec.

I know these are just anecdotal data points, but wanted to share my experience with flume agents located on the actual data/task nodes themselves. I have done very little optimization aside from separating the file channel data/log directories onto separate drives.

-Paul Chavez

From: Devin Suiter RDX [mailto:dsuiter@rdx.com]
Sent: Tuesday, December 17, 2013 8:30 AM
To: user@flume.apache.org
Subject: File Channel Best Practice

Hi,

There has been a lot of discussion about file channel speed today, and I have had a dilemma I was hoping for some feedback on, since the topic is hot.

Regarding this:
"Hi,

1) You are only using a single disk for file channel and it looks like a single disk for both checkpoint and data directories therefore throughput is going to be extremely slow."

How do you solve in a practical sense the requirement for file channel to have a range of disks for best R/W speed, yet still have network visibility to source data sources and the Hadoop cluster at the same time?

It seems like for production file channel implementation, the best solution is to give Flume a dedicated server somewhere near the edge with a JBOD pile properly mounted and partitioned. But that adds to implementation cost.

The alternative seems to be to run Flume on a  physical Cloudera Manager SCM server that has some extra disks, or run Flume agents concurrent with datanode processes on worker nodes, but those don't seem good to do, especially piggybacking on worker nodes, and file channel > HDFS will compound the issue...

I know the namenode should definitely not be involved.

I suppose you could virtualize a few servers on a properly networked host and a fast SANS/NAS connection and get by ok, but that will merge your parallelization at some point...

Any ideas on the subject?

Devin Suiter
Jr. Data Solutions Software Engineer
[cid:~WRD000.jpg]
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com<http://www.rdx.com/>

Re: File Channel Best Practice

Posted by Brock Noland <br...@cloudera.com>.

No problem. I am glad you started this email discussion and as I said
earlier, thank you for using our software! :)


On Wed, Dec 18, 2013 at 2:02 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

> Yes, excellent - I was a little muddy on some of the finer points, and I
> am glad you clarified for the sake of other mailing list users - I forgot I
> have the whole context in my head, but other readers might not.
>
> Thanks again!
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Wed, Dec 18, 2013 at 2:23 PM, Brock Noland <br...@cloudera.com> wrote:
>
>> Hi Devin,
>>
>> Please find my response below.
>>
>> On Wed, Dec 18, 2013 at 12:24 PM, Devin Suiter RDX <ds...@rdx.com>wrote:
>>
>>>
>>> So, if I understand your position on sizing the source properly, you are
>>> saying that the "fsync" operation is the costly part - it locks the device
>>> it is flushing to until the operation completes, and takes some time, so if
>>> you are committing small batches to the channel frequently, you are
>>> monopolizing the device frequently
>>>
>>
>> Correct, when using file channel, small batches spend most of the time
>> actually performing fsyncs.
>>
>>
>>> , but if you set the batch size at the source large enough,
>>>
>>
>> The language here is troublesome because "source" is overloaded. The term
>> "source" could refer to the flume source or the "source of events" for a
>> tiered architecture. Additionally some flume sources cannot control batch
>> size (avro source, http source, syslog) and some have a batch size + a
>> configured timeout (exec source) that still results in small batches most
>> of the time.
>>
>> When using file channel the upstream "source" should send large batches
>> of events. This might be the source connected directly to the file channel
>> or in a tiered architecture with say n application servers each running a
>> local agent which uses memory channel and then forwards events to a
>> "collector" tier which uses file channel. In either case the upstream
>> "sources" should use a large batch size.
>>
>>
>>> you will "take" from the source less frequently, with more data
>>> committed in every operation.
>>>
>>
>> The concept here is correct - larger batch sizes result in large number
>> of I/O's per fsync, thus increasing throughput of the system.
>>
>> Reading goes much faster, and HDFS will manage disk scheduling through
>>> RecordWriter in the HDFS sink, so those are not as problematic - is that
>>> accurate?
>>>
>>
>> Just to level set for anyone reading this, File Channel doesn't use HDFS,
>> HDFS is not aware of File Channel, and the disks we are referring to are
>> disks used by the File Channel not HDFS.
>>
>>
>>> So, if you are using a syslog source, that doesn't really offer a batch
>>> size parameter, would you set up a tiered flow with an Avro hop in the
>>> middle to aggregate log streams?
>>>
>>
>> Yes, that is a common and recommended configuration. Large setups will
>> have a local agent using memory channel, a first tier using memory channel
>> and then a second tier using file channel.
>>
>>
>>> Something like syslog source>--memory channel-->Avro sink > Avro source
>>> (large batch) >--file channel-->HDFS sink(s) for example?
>>>
>>
>> Avro Source doesn't have a batch size parameter....here you need to set a
>> large batch at the Avro Sink layer.
>>
>>
>>> I appreciate the help you've given on this topic. It's also good to know
>>> that the best practices are going into the doc, that will push everything
>>> forward. I've read the Packt publishing book on Flume but it didn't get
>>> into as much detail as I would like. The Cloudera blogs have been really
>>> helpful too.
>>>
>>> Thanks so much!
>>>
>>
>> No problem!  Thank you for using our software!
>>
>> Brock
>>
>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: File Channel Best Practice

Posted by Devin Suiter RDX <ds...@rdx.com>.

Yes, excellent - I was a little muddy on some of the finer points, and I am
glad you clarified for the sake of other mailing list users - I forgot I
have the whole context in my head, but other readers might not.

Thanks again!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Dec 18, 2013 at 2:23 PM, Brock Noland <br...@cloudera.com> wrote:

> Hi Devin,
>
> Please find my response below.
>
> On Wed, Dec 18, 2013 at 12:24 PM, Devin Suiter RDX <ds...@rdx.com>wrote:
>
>>
>> So, if I understand your position on sizing the source properly, you are
>> saying that the "fsync" operation is the costly part - it locks the device
>> it is flushing to until the operation completes, and takes some time, so if
>> you are committing small batches to the channel frequently, you are
>> monopolizing the device frequently
>>
>
> Correct, when using file channel, small batches spend most of the time
> actually performing fsyncs.
>
>
>> , but if you set the batch size at the source large enough,
>>
>
> The language here is troublesome because "source" is overloaded. The term
> "source" could refer to the flume source or the "source of events" for a
> tiered architecture. Additionally some flume sources cannot control batch
> size (avro source, http source, syslog) and some have a batch size + a
> configured timeout (exec source) that still results in small batches most
> of the time.
>
> When using file channel the upstream "source" should send large batches of
> events. This might be the source connected directly to the file channel or
> in a tiered architecture with say n application servers each running a
> local agent which uses memory channel and then forwards events to a
> "collector" tier which uses file channel. In either case the upstream
> "sources" should use a large batch size.
>
>
>> you will "take" from the source less frequently, with more data committed
>> in every operation.
>>
>
> The concept here is correct - larger batch sizes result in large number of
> I/O's per fsync, thus increasing throughput of the system.
>
> Reading goes much faster, and HDFS will manage disk scheduling through
>> RecordWriter in the HDFS sink, so those are not as problematic - is that
>> accurate?
>>
>
> Just to level set for anyone reading this, File Channel doesn't use HDFS,
> HDFS is not aware of File Channel, and the disks we are referring to are
> disks used by the File Channel not HDFS.
>
>
>> So, if you are using a syslog source, that doesn't really offer a batch
>> size parameter, would you set up a tiered flow with an Avro hop in the
>> middle to aggregate log streams?
>>
>
> Yes, that is a common and recommended configuration. Large setups will
> have a local agent using memory channel, a first tier using memory channel
> and then a second tier using file channel.
>
>
>> Something like syslog source>--memory channel-->Avro sink > Avro source
>> (large batch) >--file channel-->HDFS sink(s) for example?
>>
>
> Avro Source doesn't have a batch size parameter....here you need to set a
> large batch at the Avro Sink layer.
>
>
>> I appreciate the help you've given on this topic. It's also good to know
>> that the best practices are going into the doc, that will push everything
>> forward. I've read the Packt publishing book on Flume but it didn't get
>> into as much detail as I would like. The Cloudera blogs have been really
>> helpful too.
>>
>> Thanks so much!
>>
>
> No problem!  Thank you for using our software!
>
> Brock
>

Re: File Channel Best Practice

Posted by Brock Noland <br...@cloudera.com>.

Hi Devin,

Please find my response below.

On Wed, Dec 18, 2013 at 12:24 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

>
> So, if I understand your position on sizing the source properly, you are
> saying that the "fsync" operation is the costly part - it locks the device
> it is flushing to until the operation completes, and takes some time, so if
> you are committing small batches to the channel frequently, you are
> monopolizing the device frequently
>

Correct, when using file channel, small batches spend most of the time
actually performing fsyncs.

> , but if you set the batch size at the source large enough,
>

The language here is troublesome because "source" is overloaded. The term
"source" could refer to the flume source or the "source of events" for a
tiered architecture. Additionally some flume sources cannot control batch
size (avro source, http source, syslog) and some have a batch size + a
configured timeout (exec source) that still results in small batches most
of the time.

When using file channel the upstream "source" should send large batches of
events. This might be the source connected directly to the file channel or
in a tiered architecture with say n application servers each running a
local agent which uses memory channel and then forwards events to a
"collector" tier which uses file channel. In either case the upstream
"sources" should use a large batch size.

> you will "take" from the source less frequently, with more data committed
> in every operation.
>

The concept here is correct - larger batch sizes result in large number of
I/O's per fsync, thus increasing throughput of the system.

Reading goes much faster, and HDFS will manage disk scheduling through
> RecordWriter in the HDFS sink, so those are not as problematic - is that
> accurate?
>

Just to level set for anyone reading this, File Channel doesn't use HDFS,
HDFS is not aware of File Channel, and the disks we are referring to are
disks used by the File Channel not HDFS.

> So, if you are using a syslog source, that doesn't really offer a batch
> size parameter, would you set up a tiered flow with an Avro hop in the
> middle to aggregate log streams?
>

Yes, that is a common and recommended configuration. Large setups will have
a local agent using memory channel, a first tier using memory channel and
then a second tier using file channel.

> Something like syslog source>--memory channel-->Avro sink > Avro source
> (large batch) >--file channel-->HDFS sink(s) for example?
>

Avro Source doesn't have a batch size parameter....here you need to set a
large batch at the Avro Sink layer.

> I appreciate the help you've given on this topic. It's also good to know
> that the best practices are going into the doc, that will push everything
> forward. I've read the Packt publishing book on Flume but it didn't get
> into as much detail as I would like. The Cloudera blogs have been really
> helpful too.
>
> Thanks so much!
>

No problem!  Thank you for using our software!

Brock

Re: File Channel Best Practice

Posted by Devin Suiter RDX <ds...@rdx.com>.

Brock,

I saw your reply on this come through the other day, and meant to respond
but my day got away from me.

So, if I understand your position on sizing the source properly, you are
saying that the "fsync" operation is the costly part - it locks the device
it is flushing to until the operation completes, and takes some time, so if
you are committing small batches to the channel frequently, you are
monopolizing the device frequently, but if you set the batch size at the
source large enough, you will "take" from the source less frequently, with
more data committed in every operation. Reading goes much faster, and HDFS
will manage disk scheduling through RecordWriter in the HDFS sink, so those
are not as problematic - is that accurate?

So, if you are using a syslog source, that doesn't really offer a batch
size parameter, would you set up a tiered flow with an Avro hop in the
middle to aggregate log streams? Something like syslog source>--memory
channel-->Avro sink > Avro source (large batch) >--file channel-->HDFS
sink(s) for example?

I appreciate the help you've given on this topic. It's also good to know
that the best practices are going into the doc, that will push everything
forward. I've read the Packt publishing book on Flume but it didn't get
into as much detail as I would like. The Cloudera blogs have been really
helpful too.

Thanks so much!


*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Dec 18, 2013 at 12:51 PM, Brock Noland <br...@cloudera.com> wrote:

> FYI I am trying to capture some of the best practices in the Flume doc
> itself:
>
> https://issues.apache.org/jira/browse/FLUME-2277
>
>
> On Tue, Dec 17, 2013 at 12:17 PM, Brock Noland <br...@cloudera.com> wrote:
>
>> Hi,
>>
>> I'd also add the biggest issue I see with the file channel is batch size
>> at the source. Long story short is that file channel was written to
>> guarantee no data loss. In order to do that when a transaction is committed
>> we need to perform a "fsync" on the disk the transaction was written to.
>> fsync's are very expensive so in order to obtain good performance, the
>> source must have written a large batch of data. Here is some more
>> information on this topic:
>>
>> http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/
>>
>> http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
>>
>> Brock
>>
>>
>> On Tue, Dec 17, 2013 at 11:50 AM, iain wright <ia...@gmail.com> wrote:
>>
>>> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
>>> this as it seems like a good use case.
>>>
>>> something like:
>>>
>>> pool
>>>   sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
>>> buffers in event of sudden power loss)
>>>   mirror
>>>     sda1
>>>     sda2
>>>   mirror
>>>     sda3
>>>     sda4
>>>
>>> theres probably further tuning that can be done as well within ZFS, but
>>> i believe the ZIL will allow for immediate responses to flumes
>>> checkpoint/data fsync's while the "actual data" is flushed asynchronously
>>> to the spindles.
>>>
>>> Haven't tried this and YMMV. Some good reading available here:
>>> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/
>>>
>>> Cheers
>>>
>>>
>>> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <ds...@rdx.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> There has been a lot of discussion about file channel speed today, and
>>>> I have had a dilemma I was hoping for some feedback on, since the topic is
>>>> hot.
>>>>
>>>>  Regarding this:
>>>> "Hi,
>>>>
>>>> 1) You are only using a single disk for file channel and it looks like
>>>> a single disk for both checkpoint and data directories therefore throughput
>>>> is going to be extremely slow."
>>>>
>>>> How do you solve in a practical sense the requirement for file channel
>>>> to have a range of disks for best R/W speed, yet still have network
>>>> visibility to source data sources and the Hadoop cluster at the same time?
>>>>
>>>> It seems like for production file channel implementation, the best
>>>> solution is to give Flume a dedicated server somewhere near the edge with a
>>>> JBOD pile properly mounted and partitioned. But that adds to implementation
>>>> cost.
>>>>
>>>> The alternative seems to be to run Flume on a  physical Cloudera
>>>> Manager SCM server that has some extra disks, or run Flume agents
>>>> concurrent with datanode processes on worker nodes, but those don't seem
>>>> good to do, especially piggybacking on worker nodes, and file channel >
>>>> HDFS will compound the issue...
>>>>
>>>> I know the namenode should definitely not be involved.
>>>>
>>>> I suppose you could virtualize a few servers on a properly networked
>>>> host and a fast SANS/NAS connection and get by ok, but that will merge your
>>>> parallelization at some point...
>>>>
>>>> Any ideas on the subject?
>>>>
>>>> *Devin Suiter*
>>>> Jr. Data Solutions Software Engineer
>>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>>> Google Voice: 412-256-8556 | www.rdx.com
>>>>
>>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>>
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>

Re: File Channel Best Practice

Posted by iain wright <ia...@gmail.com>.

Hi Brock,

Just curious here and please forgive my ignorance :)

In terms of batching at the source for a file channel, is there a
combination of time and quota for the source polling?

For instance does there have to be 1000 new events to load anything into
the channel when using a 1k batch size, or if say 5 seconds passes and only
250 new events are available in the source will it grab those on some time
based interval?

Thank you,

-- 
Iain Wright
Cell: (562) 852-5916

<http://www.labctsi.org/>
This email message is confidential, intended only for the recipient(s)
named above and may contain information that is privileged, exempt from
disclosure under applicable law. If you are not the intended recipient, do
not disclose or disseminate the message to anyone except the intended
recipient. If you have received this message in error, or are not the named
recipient(s), please immediately notify the sender by return email, and
delete all copies of this message.


On Wed, Dec 18, 2013 at 9:51 AM, Brock Noland <br...@cloudera.com> wrote:

> FYI I am trying to capture some of the best practices in the Flume doc
> itself:
>
> https://issues.apache.org/jira/browse/FLUME-2277
>
>
> On Tue, Dec 17, 2013 at 12:17 PM, Brock Noland <br...@cloudera.com> wrote:
>
>> Hi,
>>
>> I'd also add the biggest issue I see with the file channel is batch size
>> at the source. Long story short is that file channel was written to
>> guarantee no data loss. In order to do that when a transaction is committed
>> we need to perform a "fsync" on the disk the transaction was written to.
>> fsync's are very expensive so in order to obtain good performance, the
>> source must have written a large batch of data. Here is some more
>> information on this topic:
>>
>> http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/
>>
>> http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
>>
>> Brock
>>
>>
>> On Tue, Dec 17, 2013 at 11:50 AM, iain wright <ia...@gmail.com> wrote:
>>
>>> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
>>> this as it seems like a good use case.
>>>
>>> something like:
>>>
>>> pool
>>>   sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
>>> buffers in event of sudden power loss)
>>>   mirror
>>>     sda1
>>>     sda2
>>>   mirror
>>>     sda3
>>>     sda4
>>>
>>> theres probably further tuning that can be done as well within ZFS, but
>>> i believe the ZIL will allow for immediate responses to flumes
>>> checkpoint/data fsync's while the "actual data" is flushed asynchronously
>>> to the spindles.
>>>
>>> Haven't tried this and YMMV. Some good reading available here:
>>> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/
>>>
>>> Cheers
>>>
>>>
>>> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <ds...@rdx.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> There has been a lot of discussion about file channel speed today, and
>>>> I have had a dilemma I was hoping for some feedback on, since the topic is
>>>> hot.
>>>>
>>>>  Regarding this:
>>>> "Hi,
>>>>
>>>> 1) You are only using a single disk for file channel and it looks like
>>>> a single disk for both checkpoint and data directories therefore throughput
>>>> is going to be extremely slow."
>>>>
>>>> How do you solve in a practical sense the requirement for file channel
>>>> to have a range of disks for best R/W speed, yet still have network
>>>> visibility to source data sources and the Hadoop cluster at the same time?
>>>>
>>>> It seems like for production file channel implementation, the best
>>>> solution is to give Flume a dedicated server somewhere near the edge with a
>>>> JBOD pile properly mounted and partitioned. But that adds to implementation
>>>> cost.
>>>>
>>>> The alternative seems to be to run Flume on a  physical Cloudera
>>>> Manager SCM server that has some extra disks, or run Flume agents
>>>> concurrent with datanode processes on worker nodes, but those don't seem
>>>> good to do, especially piggybacking on worker nodes, and file channel >
>>>> HDFS will compound the issue...
>>>>
>>>> I know the namenode should definitely not be involved.
>>>>
>>>> I suppose you could virtualize a few servers on a properly networked
>>>> host and a fast SANS/NAS connection and get by ok, but that will merge your
>>>> parallelization at some point...
>>>>
>>>> Any ideas on the subject?
>>>>
>>>> *Devin Suiter*
>>>> Jr. Data Solutions Software Engineer
>>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>>> Google Voice: 412-256-8556 | www.rdx.com
>>>>
>>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>>
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>

Re: File Channel Best Practice

Posted by Brock Noland <br...@cloudera.com>.

FYI I am trying to capture some of the best practices in the Flume doc
itself:

https://issues.apache.org/jira/browse/FLUME-2277


On Tue, Dec 17, 2013 at 12:17 PM, Brock Noland <br...@cloudera.com> wrote:

> Hi,
>
> I'd also add the biggest issue I see with the file channel is batch size
> at the source. Long story short is that file channel was written to
> guarantee no data loss. In order to do that when a transaction is committed
> we need to perform a "fsync" on the disk the transaction was written to.
> fsync's are very expensive so in order to obtain good performance, the
> source must have written a large batch of data. Here is some more
> information on this topic:
>
> http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/
>
> http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
>
> Brock
>
>
> On Tue, Dec 17, 2013 at 11:50 AM, iain wright <ia...@gmail.com> wrote:
>
>> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
>> this as it seems like a good use case.
>>
>> something like:
>>
>> pool
>>   sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
>> buffers in event of sudden power loss)
>>   mirror
>>     sda1
>>     sda2
>>   mirror
>>     sda3
>>     sda4
>>
>> theres probably further tuning that can be done as well within ZFS, but i
>> believe the ZIL will allow for immediate responses to flumes
>> checkpoint/data fsync's while the "actual data" is flushed asynchronously
>> to the spindles.
>>
>> Haven't tried this and YMMV. Some good reading available here:
>> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/
>>
>> Cheers
>>
>>
>> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <ds...@rdx.com> wrote:
>>
>>> Hi,
>>>
>>> There has been a lot of discussion about file channel speed today, and I
>>> have had a dilemma I was hoping for some feedback on, since the topic is
>>> hot.
>>>
>>>  Regarding this:
>>> "Hi,
>>>
>>> 1) You are only using a single disk for file channel and it looks like a
>>> single disk for both checkpoint and data directories therefore throughput
>>> is going to be extremely slow."
>>>
>>> How do you solve in a practical sense the requirement for file channel
>>> to have a range of disks for best R/W speed, yet still have network
>>> visibility to source data sources and the Hadoop cluster at the same time?
>>>
>>> It seems like for production file channel implementation, the best
>>> solution is to give Flume a dedicated server somewhere near the edge with a
>>> JBOD pile properly mounted and partitioned. But that adds to implementation
>>> cost.
>>>
>>> The alternative seems to be to run Flume on a  physical Cloudera Manager
>>> SCM server that has some extra disks, or run Flume agents concurrent with
>>> datanode processes on worker nodes, but those don't seem good to do,
>>> especially piggybacking on worker nodes, and file channel > HDFS will
>>> compound the issue...
>>>
>>> I know the namenode should definitely not be involved.
>>>
>>> I suppose you could virtualize a few servers on a properly networked
>>> host and a fast SANS/NAS connection and get by ok, but that will merge your
>>> parallelization at some point...
>>>
>>> Any ideas on the subject?
>>>
>>> *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>



-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: File Channel Best Practice

Posted by Brock Noland <br...@cloudera.com>.

Hi,

I'd also add the biggest issue I see with the file channel is batch size at
the source. Long story short is that file channel was written to guarantee
no data loss. In order to do that when a transaction is committed we need
to perform a "fsync" on the disk the transaction was written to. fsync's
are very expensive so in order to obtain good performance, the source must
have written a large batch of data. Here is some more information on this
topic:

http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/
http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/

Brock


On Tue, Dec 17, 2013 at 11:50 AM, iain wright <ia...@gmail.com> wrote:

> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
> this as it seems like a good use case.
>
> something like:
>
> pool
>   sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
> buffers in event of sudden power loss)
>   mirror
>     sda1
>     sda2
>   mirror
>     sda3
>     sda4
>
> theres probably further tuning that can be done as well within ZFS, but i
> believe the ZIL will allow for immediate responses to flumes
> checkpoint/data fsync's while the "actual data" is flushed asynchronously
> to the spindles.
>
> Haven't tried this and YMMV. Some good reading available here:
> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/
>
> Cheers
>
>
> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <ds...@rdx.com> wrote:
>
>> Hi,
>>
>> There has been a lot of discussion about file channel speed today, and I
>> have had a dilemma I was hoping for some feedback on, since the topic is
>> hot.
>>
>>  Regarding this:
>> "Hi,
>>
>> 1) You are only using a single disk for file channel and it looks like a
>> single disk for both checkpoint and data directories therefore throughput
>> is going to be extremely slow."
>>
>> How do you solve in a practical sense the requirement for file channel to
>> have a range of disks for best R/W speed, yet still have network visibility
>> to source data sources and the Hadoop cluster at the same time?
>>
>> It seems like for production file channel implementation, the best
>> solution is to give Flume a dedicated server somewhere near the edge with a
>> JBOD pile properly mounted and partitioned. But that adds to implementation
>> cost.
>>
>> The alternative seems to be to run Flume on a  physical Cloudera Manager
>> SCM server that has some extra disks, or run Flume agents concurrent with
>> datanode processes on worker nodes, but those don't seem good to do,
>> especially piggybacking on worker nodes, and file channel > HDFS will
>> compound the issue...
>>
>> I know the namenode should definitely not be involved.
>>
>> I suppose you could virtualize a few servers on a properly networked host
>> and a fast SANS/NAS connection and get by ok, but that will merge your
>> parallelization at some point...
>>
>> Any ideas on the subject?
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: File Channel Best Practice

Posted by iain wright <ia...@gmail.com>.

Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
this as it seems like a good use case.

something like:

pool
  sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
buffers in event of sudden power loss)
  mirror
    sda1
    sda2
  mirror
    sda3
    sda4

theres probably further tuning that can be done as well within ZFS, but i
believe the ZIL will allow for immediate responses to flumes
checkpoint/data fsync's while the "actual data" is flushed asynchronously
to the spindles.

Haven't tried this and YMMV. Some good reading available here:
https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/

Cheers


On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <ds...@rdx.com> wrote:

> Hi,
>
> There has been a lot of discussion about file channel speed today, and I
> have had a dilemma I was hoping for some feedback on, since the topic is
> hot.
>
>  Regarding this:
> "Hi,
>
> 1) You are only using a single disk for file channel and it looks like a
> single disk for both checkpoint and data directories therefore throughput
> is going to be extremely slow."
>
> How do you solve in a practical sense the requirement for file channel to
> have a range of disks for best R/W speed, yet still have network visibility
> to source data sources and the Hadoop cluster at the same time?
>
> It seems like for production file channel implementation, the best
> solution is to give Flume a dedicated server somewhere near the edge with a
> JBOD pile properly mounted and partitioned. But that adds to implementation
> cost.
>
> The alternative seems to be to run Flume on a  physical Cloudera Manager
> SCM server that has some extra disks, or run Flume agents concurrent with
> datanode processes on worker nodes, but those don't seem good to do,
> especially piggybacking on worker nodes, and file channel > HDFS will
> compound the issue...
>
> I know the namenode should definitely not be involved.
>
> I suppose you could virtualize a few servers on a properly networked host
> and a fast SANS/NAS connection and get by ok, but that will merge your
> parallelization at some point...
>
> Any ideas on the subject?
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>