You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Sebastiano Di Paola <se...@gmail.com> on 2014/09/03 08:53:53 UTC

performances tuning...

Hi there,
I'm a completely newbie of Flume, so I probably made a mistake in my
configuration but I cannot point it out.
I want to achieve transfer maximum performances.
My flume machine has 16GB RAM and 8 Cores
I'm using a very simple Flume architecture:
Source -> Memory Channel -> Sink
Source is of type netcat
and Sink is hdfs
The machine has 1Gb ethernet directly connected to the switch of the hadoop
cluster.
The point is that Flume is sooo slow in loading the data into my hdfs
filesystem.
(i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from the
same machine I will reach approx 250 Mb/s as transfer rate, while
transferring the same file with this Flume architecture is like 2-3 Mb/s).
(the cluster is composed of 10 machines, and was totally idle while I did
this test, so was not under stress) (the traffic rate was measured on the
flume machine output interface in both exeperiments)
(myfile has 10 million of lines of average size of 150 bytes each)

For what I understood till now It doesn't seem a source issue as the memory
channel tends to fill up if I decrease the channel capacity (but even make
it very very very very big it does not affect sink perfomances), so it
seems to me that the problem is related to sink.
In order to test this point I've also tried to change the source using
"exec" type and simply executing "cat myfile"  but the result hasn't
changed....


Here's my used config...

 # list the sources, sinks and channels for the agent
test.sources = r1
test.channels = c1
test.sinks = s1

# exec attempt
test.sources.r1.type = exec
test.sources.r1.command = cat /tmp/myfile

# my netcat attempt
#test.sources.r1.type = netcat
#test.sources.r1.bind = localhost
#test.sources.r1.port = 6666

# my file channel attempt
#test.channels.c1.type = file

#my memory channel attempt
test.channels.c1.type = memory
test.channels.c1.capacity = 1000000
test.channels.c1.transactionCapacity = 10000

# how to properly set those parameter?? even if I enable those nothing
changes
# in my performances (what it the buffer percentage used for?)
#test.channels.c1.byteCapacityBufferPercentage = 50
#test.channels.c1.byteCapacity = 100000000

# set channel for source
test.sources.r1.channels = c1
# set channel for sink
test.sinks.s1.channel = c1

test.sinks.s1.type = hdfs
test.sinks.s1.hdfs.useLocalTimeStamp = true

test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
test.sinks.s1.hdfs.filePrefix = log-data
test.sinks.s1.hdfs.inUseSuffix = .dat

# how to set this parameter??? (i basically want to send as much data as I
can)
test.sinks.s1.hdfs.batchSize = 10000

#test.sinks.s1.hdfs.round = true
#test.sinks.s1.hdfs.roundValue = 5
#test.sinks.s1.hdfs.roundUnit = minute

test.sinks.s1.hdfs.rollSize = 0
test.sinks.s1.hdfs.rollCount = 0
test.sinks.s1.hdfs.rollInterval = 0

# compression attempt
#test.sinks.s1.hdsf.fileType = CompressedStream
#test.sinks.s1.hdfs.codeC=gzip
#test.sinks.s1.hdfs.codeC=BZip2Codec
#test.sinks.s1.hdfs.callTimeout = 120000

Can someone show me how to find this bottleneck/ configuration mistake? (I
can't believe that those are flume performance on my machine)

Thanks a lot if you can help me
Regards.
Sebastiano

Re: performances tuning...

Posted by Guillermo Ortiz <ko...@gmail.com>.

Which was your final configuration? What speed did you get?

2014-09-03 11:18 GMT+02:00 Sebastiano Di Paola <sebastiano.dipaola@gmail.com
>:

> I raised batchSize of a 100 factor,  added more heap space and speed
> increased...
> still not reached the same speed as using "hdfs dfs -copyFromLocal" but
> I'm pretty sure it's a tuning problem.
> thanks a lot for your hint.
> Regards
> Seba
>
>
> On Wed, Sep 3, 2014 at 9:55 AM, Sandeep Khurana <sk...@gmail.com>
> wrote:
>
>> Since you mentioned "average size of 150 bytes each" is your each
>> record, I will try increasing the batch size to a higher value.
>>
>>
>> "HDFS batch size determines the number of events to take from the
>> channel and send in one go."
>>
>> So in 1 shot you are sending 1500000 bytes to hdfs.
>>
>>
>> On Wed, Sep 3, 2014 at 1:18 PM, Sebastiano Di Paola <
>> sebastiano.dipaola@gmail.com> wrote:
>>
>>> In my experiment, I just want to transfer a single file...just to test
>>> what performances I can achieve...
>>> so rolling file on hdfs at this point is not vital.
>>> Anyway I did some test rolling file every 300 seconds.
>>> What I can't explain to myself is the "slow" output from the sink...the
>>> memory channel overflows (if it's not big enough so it seems that the souce
>>> is able to produce a higher data rate than the data rate the sink is able
>>> to process and send on hdfs)
>>> I'm not sure if it can helps to pinpoint my "configuration mistake", but
>>> I'm using Flume 1.5.0.1 (tried also Flume 1.5.0)
>>> Regards.
>>> Seba
>>>
>>>
>>> On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <sk...@gmail.com>
>>> wrote:
>>>
>>>> I see that you have below settings set to zero. You dont want rolling
>>>> to hdfs to  happen based upon any of the size, count or time  interval?
>>>>
>>>> test.sinks.s1.hdfs.rollSize = 0
>>>> test.sinks.s1.hdfs.rollCount = 0
>>>> test.sinks.s1.hdfs.rollInterval = 0
>>>>
>>>>
>>>> On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola <
>>>> sebastiano.dipaola@gmail.com> wrote:
>>>>
>>>>> Hi Paul,
>>>>> thank for your answer.
>>>>> As I' m a newbie of Flume How can I attach multiple sinks to the same
>>>>> channel? (does they read data in a round robin fashon from the memory
>>>>> channel?)
>>>>>  (does this create multiple files on the hdfs?, because this is not
>>>>> what I'm expecting to have I have a 500MB data file at the source and I
>>>>> would like to have only one file on HDFS)
>>>>>
>>>>> I can't believe that I cannot achieve such a performance with a single
>>>>> sink. I'm pretty sure it's a configuration issue!
>>>>> Beside this how to tune the batchSize parameter? (Of course I have
>>>>> already tried to set it like 10 times the number I have in my config, but
>>>>> no relevant improvements)
>>>>> Regards.
>>>>> Seba
>>>>>
>>>>>
>>>>> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pc...@ntent.com> wrote:
>>>>>
>>>>>>  Start adding additional HDFS sinks attached to the same channel.
>>>>>> You can also tune batch sizes when writing to HDFS to increase per sink
>>>>>> performance.
>>>>>>
>>>>>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
>>>>>> sebastiano.dipaola@gmail.com> wrote:
>>>>>>
>>>>>>   Hi there,
>>>>>> I'm a completely newbie of Flume, so I probably made a mistake in my
>>>>>> configuration but I cannot point it out.
>>>>>> I want to achieve transfer maximum performances.
>>>>>> My flume machine has 16GB RAM and 8 Cores
>>>>>> I'm using a very simple Flume architecture:
>>>>>> Source -> Memory Channel -> Sink
>>>>>> Source is of type netcat
>>>>>> and Sink is hdfs
>>>>>> The machine has 1Gb ethernet directly connected to the switch of the
>>>>>> hadoop cluster.
>>>>>> The point is that Flume is sooo slow in loading the data into my hdfs
>>>>>> filesystem.
>>>>>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile
>>>>>> from the same machine I will reach approx 250 Mb/s as transfer rate, while
>>>>>> transferring the same file with this Flume architecture is like 2-3 Mb/s).
>>>>>> (the cluster is composed of 10 machines, and was totally idle while I did
>>>>>> this test, so was not under stress) (the traffic rate was measured on the
>>>>>> flume machine output interface in both exeperiments)
>>>>>> (myfile has 10 million of lines of average size of 150 bytes each)
>>>>>>
>>>>>>  For what I understood till now It doesn't seem a source issue as
>>>>>> the memory channel tends to fill up if I decrease the channel capacity (but
>>>>>> even make it very very very very big it does not affect sink perfomances),
>>>>>> so it seems to me that the problem is related to sink.
>>>>>> In order to test this point I've also tried to change the source
>>>>>> using "exec" type and simply executing "cat myfile"  but the result hasn't
>>>>>> changed....
>>>>>>
>>>>>>
>>>>>>  Here's my used config...
>>>>>>
>>>>>>   # list the sources, sinks and channels for the agent
>>>>>> test.sources = r1
>>>>>> test.channels = c1
>>>>>>  test.sinks = s1
>>>>>>
>>>>>>  # exec attempt
>>>>>> test.sources.r1.type = exec
>>>>>> test.sources.r1.command = cat /tmp/myfile
>>>>>>
>>>>>>  # my netcat attempt
>>>>>> #test.sources.r1.type = netcat
>>>>>> #test.sources.r1.bind = localhost
>>>>>> #test.sources.r1.port = 6666
>>>>>>
>>>>>>  # my file channel attempt
>>>>>> #test.channels.c1.type = file
>>>>>>
>>>>>> #my memory channel attempt
>>>>>> test.channels.c1.type = memory
>>>>>> test.channels.c1.capacity = 1000000
>>>>>> test.channels.c1.transactionCapacity = 10000
>>>>>>
>>>>>>  # how to properly set those parameter?? even if I enable those
>>>>>> nothing changes
>>>>>> # in my performances (what it the buffer percentage used for?)
>>>>>> #test.channels.c1.byteCapacityBufferPercentage = 50
>>>>>> #test.channels.c1.byteCapacity = 100000000
>>>>>>
>>>>>>  # set channel for source
>>>>>> test.sources.r1.channels = c1
>>>>>> # set channel for sink
>>>>>> test.sinks.s1.channel = c1
>>>>>>
>>>>>>  test.sinks.s1.type = hdfs
>>>>>> test.sinks.s1.hdfs.useLocalTimeStamp = true
>>>>>>
>>>>>>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
>>>>>> test.sinks.s1.hdfs.filePrefix = log-data
>>>>>> test.sinks.s1.hdfs.inUseSuffix = .dat
>>>>>>
>>>>>>  # how to set this parameter??? (i basically want to send as much
>>>>>> data as I can)
>>>>>> test.sinks.s1.hdfs.batchSize = 10000
>>>>>>
>>>>>> #test.sinks.s1.hdfs.round = true
>>>>>> #test.sinks.s1.hdfs.roundValue = 5
>>>>>> #test.sinks.s1.hdfs.roundUnit = minute
>>>>>>
>>>>>> test.sinks.s1.hdfs.rollSize = 0
>>>>>> test.sinks.s1.hdfs.rollCount = 0
>>>>>> test.sinks.s1.hdfs.rollInterval = 0
>>>>>>
>>>>>> # compression attempt
>>>>>> #test.sinks.s1.hdsf.fileType = CompressedStream
>>>>>> #test.sinks.s1.hdfs.codeC=gzip
>>>>>> #test.sinks.s1.hdfs.codeC=BZip2Codec
>>>>>> #test.sinks.s1.hdfs.callTimeout = 120000
>>>>>>
>>>>>>  Can someone show me how to find this bottleneck/ configuration
>>>>>> mistake? (I can't believe that those are flume performance on my machine)
>>>>>>
>>>>>>  Thanks a lot if you can help me
>>>>>> Regards.
>>>>>> Sebastiano
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks and regards
>>>> Sandeep Khurana
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks and regards
>> Sandeep Khurana
>>
>
>

Re: performances tuning...

Posted by Sebastiano Di Paola <se...@gmail.com>.

I raised batchSize of a 100 factor,  added more heap space and speed
increased...
still not reached the same speed as using "hdfs dfs -copyFromLocal" but I'm
pretty sure it's a tuning problem.
thanks a lot for your hint.
Regards
Seba


On Wed, Sep 3, 2014 at 9:55 AM, Sandeep Khurana <sk...@gmail.com>
wrote:

> Since you mentioned "average size of 150 bytes each" is your each record,
> I will try increasing the batch size to a higher value.
>
>
> "HDFS batch size determines the number of events to take from the channel and
> send in one go."
>
> So in 1 shot you are sending 1500000 bytes to hdfs.
>
>
> On Wed, Sep 3, 2014 at 1:18 PM, Sebastiano Di Paola <
> sebastiano.dipaola@gmail.com> wrote:
>
>> In my experiment, I just want to transfer a single file...just to test
>> what performances I can achieve...
>> so rolling file on hdfs at this point is not vital.
>> Anyway I did some test rolling file every 300 seconds.
>> What I can't explain to myself is the "slow" output from the sink...the
>> memory channel overflows (if it's not big enough so it seems that the souce
>> is able to produce a higher data rate than the data rate the sink is able
>> to process and send on hdfs)
>> I'm not sure if it can helps to pinpoint my "configuration mistake", but
>> I'm using Flume 1.5.0.1 (tried also Flume 1.5.0)
>> Regards.
>> Seba
>>
>>
>> On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <sk...@gmail.com>
>> wrote:
>>
>>> I see that you have below settings set to zero. You dont want rolling to
>>> hdfs to  happen based upon any of the size, count or time  interval?
>>>
>>> test.sinks.s1.hdfs.rollSize = 0
>>> test.sinks.s1.hdfs.rollCount = 0
>>> test.sinks.s1.hdfs.rollInterval = 0
>>>
>>>
>>> On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola <
>>> sebastiano.dipaola@gmail.com> wrote:
>>>
>>>> Hi Paul,
>>>> thank for your answer.
>>>> As I' m a newbie of Flume How can I attach multiple sinks to the same
>>>> channel? (does they read data in a round robin fashon from the memory
>>>> channel?)
>>>>  (does this create multiple files on the hdfs?, because this is not
>>>> what I'm expecting to have I have a 500MB data file at the source and I
>>>> would like to have only one file on HDFS)
>>>>
>>>> I can't believe that I cannot achieve such a performance with a single
>>>> sink. I'm pretty sure it's a configuration issue!
>>>> Beside this how to tune the batchSize parameter? (Of course I have
>>>> already tried to set it like 10 times the number I have in my config, but
>>>> no relevant improvements)
>>>> Regards.
>>>> Seba
>>>>
>>>>
>>>> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pc...@ntent.com> wrote:
>>>>
>>>>>  Start adding additional HDFS sinks attached to the same channel. You
>>>>> can also tune batch sizes when writing to HDFS to increase per sink
>>>>> performance.
>>>>>
>>>>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
>>>>> sebastiano.dipaola@gmail.com> wrote:
>>>>>
>>>>>   Hi there,
>>>>> I'm a completely newbie of Flume, so I probably made a mistake in my
>>>>> configuration but I cannot point it out.
>>>>> I want to achieve transfer maximum performances.
>>>>> My flume machine has 16GB RAM and 8 Cores
>>>>> I'm using a very simple Flume architecture:
>>>>> Source -> Memory Channel -> Sink
>>>>> Source is of type netcat
>>>>> and Sink is hdfs
>>>>> The machine has 1Gb ethernet directly connected to the switch of the
>>>>> hadoop cluster.
>>>>> The point is that Flume is sooo slow in loading the data into my hdfs
>>>>> filesystem.
>>>>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile
>>>>> from the same machine I will reach approx 250 Mb/s as transfer rate, while
>>>>> transferring the same file with this Flume architecture is like 2-3 Mb/s).
>>>>> (the cluster is composed of 10 machines, and was totally idle while I did
>>>>> this test, so was not under stress) (the traffic rate was measured on the
>>>>> flume machine output interface in both exeperiments)
>>>>> (myfile has 10 million of lines of average size of 150 bytes each)
>>>>>
>>>>>  For what I understood till now It doesn't seem a source issue as the
>>>>> memory channel tends to fill up if I decrease the channel capacity (but
>>>>> even make it very very very very big it does not affect sink perfomances),
>>>>> so it seems to me that the problem is related to sink.
>>>>> In order to test this point I've also tried to change the source using
>>>>> "exec" type and simply executing "cat myfile"  but the result hasn't
>>>>> changed....
>>>>>
>>>>>
>>>>>  Here's my used config...
>>>>>
>>>>>   # list the sources, sinks and channels for the agent
>>>>> test.sources = r1
>>>>> test.channels = c1
>>>>>  test.sinks = s1
>>>>>
>>>>>  # exec attempt
>>>>> test.sources.r1.type = exec
>>>>> test.sources.r1.command = cat /tmp/myfile
>>>>>
>>>>>  # my netcat attempt
>>>>> #test.sources.r1.type = netcat
>>>>> #test.sources.r1.bind = localhost
>>>>> #test.sources.r1.port = 6666
>>>>>
>>>>>  # my file channel attempt
>>>>> #test.channels.c1.type = file
>>>>>
>>>>> #my memory channel attempt
>>>>> test.channels.c1.type = memory
>>>>> test.channels.c1.capacity = 1000000
>>>>> test.channels.c1.transactionCapacity = 10000
>>>>>
>>>>>  # how to properly set those parameter?? even if I enable those
>>>>> nothing changes
>>>>> # in my performances (what it the buffer percentage used for?)
>>>>> #test.channels.c1.byteCapacityBufferPercentage = 50
>>>>> #test.channels.c1.byteCapacity = 100000000
>>>>>
>>>>>  # set channel for source
>>>>> test.sources.r1.channels = c1
>>>>> # set channel for sink
>>>>> test.sinks.s1.channel = c1
>>>>>
>>>>>  test.sinks.s1.type = hdfs
>>>>> test.sinks.s1.hdfs.useLocalTimeStamp = true
>>>>>
>>>>>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
>>>>> test.sinks.s1.hdfs.filePrefix = log-data
>>>>> test.sinks.s1.hdfs.inUseSuffix = .dat
>>>>>
>>>>>  # how to set this parameter??? (i basically want to send as much
>>>>> data as I can)
>>>>> test.sinks.s1.hdfs.batchSize = 10000
>>>>>
>>>>> #test.sinks.s1.hdfs.round = true
>>>>> #test.sinks.s1.hdfs.roundValue = 5
>>>>> #test.sinks.s1.hdfs.roundUnit = minute
>>>>>
>>>>> test.sinks.s1.hdfs.rollSize = 0
>>>>> test.sinks.s1.hdfs.rollCount = 0
>>>>> test.sinks.s1.hdfs.rollInterval = 0
>>>>>
>>>>> # compression attempt
>>>>> #test.sinks.s1.hdsf.fileType = CompressedStream
>>>>> #test.sinks.s1.hdfs.codeC=gzip
>>>>> #test.sinks.s1.hdfs.codeC=BZip2Codec
>>>>> #test.sinks.s1.hdfs.callTimeout = 120000
>>>>>
>>>>>  Can someone show me how to find this bottleneck/ configuration
>>>>> mistake? (I can't believe that those are flume performance on my machine)
>>>>>
>>>>>  Thanks a lot if you can help me
>>>>> Regards.
>>>>> Sebastiano
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks and regards
>>> Sandeep Khurana
>>>
>>
>>
>
>
> --
> Thanks and regards
> Sandeep Khurana
>

Re: performances tuning...

Posted by Sandeep Khurana <sk...@gmail.com>.

Since you mentioned "average size of 150 bytes each" is your each record, I
will try increasing the batch size to a higher value.


"HDFS batch size determines the number of events to take from the channel and
send in one go."

So in 1 shot you are sending 1500000 bytes to hdfs.


On Wed, Sep 3, 2014 at 1:18 PM, Sebastiano Di Paola <
sebastiano.dipaola@gmail.com> wrote:

> In my experiment, I just want to transfer a single file...just to test
> what performances I can achieve...
> so rolling file on hdfs at this point is not vital.
> Anyway I did some test rolling file every 300 seconds.
> What I can't explain to myself is the "slow" output from the sink...the
> memory channel overflows (if it's not big enough so it seems that the souce
> is able to produce a higher data rate than the data rate the sink is able
> to process and send on hdfs)
> I'm not sure if it can helps to pinpoint my "configuration mistake", but
> I'm using Flume 1.5.0.1 (tried also Flume 1.5.0)
> Regards.
> Seba
>
>
> On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <sk...@gmail.com>
> wrote:
>
>> I see that you have below settings set to zero. You dont want rolling to
>> hdfs to  happen based upon any of the size, count or time  interval?
>>
>> test.sinks.s1.hdfs.rollSize = 0
>> test.sinks.s1.hdfs.rollCount = 0
>> test.sinks.s1.hdfs.rollInterval = 0
>>
>>
>> On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola <
>> sebastiano.dipaola@gmail.com> wrote:
>>
>>> Hi Paul,
>>> thank for your answer.
>>> As I' m a newbie of Flume How can I attach multiple sinks to the same
>>> channel? (does they read data in a round robin fashon from the memory
>>> channel?)
>>>  (does this create multiple files on the hdfs?, because this is not what
>>> I'm expecting to have I have a 500MB data file at the source and I would
>>> like to have only one file on HDFS)
>>>
>>> I can't believe that I cannot achieve such a performance with a single
>>> sink. I'm pretty sure it's a configuration issue!
>>> Beside this how to tune the batchSize parameter? (Of course I have
>>> already tried to set it like 10 times the number I have in my config, but
>>> no relevant improvements)
>>> Regards.
>>> Seba
>>>
>>>
>>> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pc...@ntent.com> wrote:
>>>
>>>>  Start adding additional HDFS sinks attached to the same channel. You
>>>> can also tune batch sizes when writing to HDFS to increase per sink
>>>> performance.
>>>>
>>>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
>>>> sebastiano.dipaola@gmail.com> wrote:
>>>>
>>>>   Hi there,
>>>> I'm a completely newbie of Flume, so I probably made a mistake in my
>>>> configuration but I cannot point it out.
>>>> I want to achieve transfer maximum performances.
>>>> My flume machine has 16GB RAM and 8 Cores
>>>> I'm using a very simple Flume architecture:
>>>> Source -> Memory Channel -> Sink
>>>> Source is of type netcat
>>>> and Sink is hdfs
>>>> The machine has 1Gb ethernet directly connected to the switch of the
>>>> hadoop cluster.
>>>> The point is that Flume is sooo slow in loading the data into my hdfs
>>>> filesystem.
>>>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from
>>>> the same machine I will reach approx 250 Mb/s as transfer rate, while
>>>> transferring the same file with this Flume architecture is like 2-3 Mb/s).
>>>> (the cluster is composed of 10 machines, and was totally idle while I did
>>>> this test, so was not under stress) (the traffic rate was measured on the
>>>> flume machine output interface in both exeperiments)
>>>> (myfile has 10 million of lines of average size of 150 bytes each)
>>>>
>>>>  For what I understood till now It doesn't seem a source issue as the
>>>> memory channel tends to fill up if I decrease the channel capacity (but
>>>> even make it very very very very big it does not affect sink perfomances),
>>>> so it seems to me that the problem is related to sink.
>>>> In order to test this point I've also tried to change the source using
>>>> "exec" type and simply executing "cat myfile"  but the result hasn't
>>>> changed....
>>>>
>>>>
>>>>  Here's my used config...
>>>>
>>>>   # list the sources, sinks and channels for the agent
>>>> test.sources = r1
>>>> test.channels = c1
>>>>  test.sinks = s1
>>>>
>>>>  # exec attempt
>>>> test.sources.r1.type = exec
>>>> test.sources.r1.command = cat /tmp/myfile
>>>>
>>>>  # my netcat attempt
>>>> #test.sources.r1.type = netcat
>>>> #test.sources.r1.bind = localhost
>>>> #test.sources.r1.port = 6666
>>>>
>>>>  # my file channel attempt
>>>> #test.channels.c1.type = file
>>>>
>>>> #my memory channel attempt
>>>> test.channels.c1.type = memory
>>>> test.channels.c1.capacity = 1000000
>>>> test.channels.c1.transactionCapacity = 10000
>>>>
>>>>  # how to properly set those parameter?? even if I enable those
>>>> nothing changes
>>>> # in my performances (what it the buffer percentage used for?)
>>>> #test.channels.c1.byteCapacityBufferPercentage = 50
>>>> #test.channels.c1.byteCapacity = 100000000
>>>>
>>>>  # set channel for source
>>>> test.sources.r1.channels = c1
>>>> # set channel for sink
>>>> test.sinks.s1.channel = c1
>>>>
>>>>  test.sinks.s1.type = hdfs
>>>> test.sinks.s1.hdfs.useLocalTimeStamp = true
>>>>
>>>>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
>>>> test.sinks.s1.hdfs.filePrefix = log-data
>>>> test.sinks.s1.hdfs.inUseSuffix = .dat
>>>>
>>>>  # how to set this parameter??? (i basically want to send as much data
>>>> as I can)
>>>> test.sinks.s1.hdfs.batchSize = 10000
>>>>
>>>> #test.sinks.s1.hdfs.round = true
>>>> #test.sinks.s1.hdfs.roundValue = 5
>>>> #test.sinks.s1.hdfs.roundUnit = minute
>>>>
>>>> test.sinks.s1.hdfs.rollSize = 0
>>>> test.sinks.s1.hdfs.rollCount = 0
>>>> test.sinks.s1.hdfs.rollInterval = 0
>>>>
>>>> # compression attempt
>>>> #test.sinks.s1.hdsf.fileType = CompressedStream
>>>> #test.sinks.s1.hdfs.codeC=gzip
>>>> #test.sinks.s1.hdfs.codeC=BZip2Codec
>>>> #test.sinks.s1.hdfs.callTimeout = 120000
>>>>
>>>>  Can someone show me how to find this bottleneck/ configuration
>>>> mistake? (I can't believe that those are flume performance on my machine)
>>>>
>>>>  Thanks a lot if you can help me
>>>> Regards.
>>>> Sebastiano
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks and regards
>> Sandeep Khurana
>>
>
>


-- 
Thanks and regards
Sandeep Khurana

Re: performances tuning...

Posted by Sebastiano Di Paola <se...@gmail.com>.

In my experiment, I just want to transfer a single file...just to test what
performances I can achieve...
so rolling file on hdfs at this point is not vital.
Anyway I did some test rolling file every 300 seconds.
What I can't explain to myself is the "slow" output from the sink...the
memory channel overflows (if it's not big enough so it seems that the souce
is able to produce a higher data rate than the data rate the sink is able
to process and send on hdfs)
I'm not sure if it can helps to pinpoint my "configuration mistake", but
I'm using Flume 1.5.0.1 (tried also Flume 1.5.0)
Regards.
Seba


On Wed, Sep 3, 2014 at 9:38 AM, Sandeep Khurana <sk...@gmail.com>
wrote:

> I see that you have below settings set to zero. You dont want rolling to
> hdfs to  happen based upon any of the size, count or time  interval?
>
> test.sinks.s1.hdfs.rollSize = 0
> test.sinks.s1.hdfs.rollCount = 0
> test.sinks.s1.hdfs.rollInterval = 0
>
>
> On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola <
> sebastiano.dipaola@gmail.com> wrote:
>
>> Hi Paul,
>> thank for your answer.
>> As I' m a newbie of Flume How can I attach multiple sinks to the same
>> channel? (does they read data in a round robin fashon from the memory
>> channel?)
>>  (does this create multiple files on the hdfs?, because this is not what
>> I'm expecting to have I have a 500MB data file at the source and I would
>> like to have only one file on HDFS)
>>
>> I can't believe that I cannot achieve such a performance with a single
>> sink. I'm pretty sure it's a configuration issue!
>> Beside this how to tune the batchSize parameter? (Of course I have
>> already tried to set it like 10 times the number I have in my config, but
>> no relevant improvements)
>> Regards.
>> Seba
>>
>>
>> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pc...@ntent.com> wrote:
>>
>>>  Start adding additional HDFS sinks attached to the same channel. You
>>> can also tune batch sizes when writing to HDFS to increase per sink
>>> performance.
>>>
>>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
>>> sebastiano.dipaola@gmail.com> wrote:
>>>
>>>   Hi there,
>>> I'm a completely newbie of Flume, so I probably made a mistake in my
>>> configuration but I cannot point it out.
>>> I want to achieve transfer maximum performances.
>>> My flume machine has 16GB RAM and 8 Cores
>>> I'm using a very simple Flume architecture:
>>> Source -> Memory Channel -> Sink
>>> Source is of type netcat
>>> and Sink is hdfs
>>> The machine has 1Gb ethernet directly connected to the switch of the
>>> hadoop cluster.
>>> The point is that Flume is sooo slow in loading the data into my hdfs
>>> filesystem.
>>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from
>>> the same machine I will reach approx 250 Mb/s as transfer rate, while
>>> transferring the same file with this Flume architecture is like 2-3 Mb/s).
>>> (the cluster is composed of 10 machines, and was totally idle while I did
>>> this test, so was not under stress) (the traffic rate was measured on the
>>> flume machine output interface in both exeperiments)
>>> (myfile has 10 million of lines of average size of 150 bytes each)
>>>
>>>  For what I understood till now It doesn't seem a source issue as the
>>> memory channel tends to fill up if I decrease the channel capacity (but
>>> even make it very very very very big it does not affect sink perfomances),
>>> so it seems to me that the problem is related to sink.
>>> In order to test this point I've also tried to change the source using
>>> "exec" type and simply executing "cat myfile"  but the result hasn't
>>> changed....
>>>
>>>
>>>  Here's my used config...
>>>
>>>   # list the sources, sinks and channels for the agent
>>> test.sources = r1
>>> test.channels = c1
>>>  test.sinks = s1
>>>
>>>  # exec attempt
>>> test.sources.r1.type = exec
>>> test.sources.r1.command = cat /tmp/myfile
>>>
>>>  # my netcat attempt
>>> #test.sources.r1.type = netcat
>>> #test.sources.r1.bind = localhost
>>> #test.sources.r1.port = 6666
>>>
>>>  # my file channel attempt
>>> #test.channels.c1.type = file
>>>
>>> #my memory channel attempt
>>> test.channels.c1.type = memory
>>> test.channels.c1.capacity = 1000000
>>> test.channels.c1.transactionCapacity = 10000
>>>
>>>  # how to properly set those parameter?? even if I enable those nothing
>>> changes
>>> # in my performances (what it the buffer percentage used for?)
>>> #test.channels.c1.byteCapacityBufferPercentage = 50
>>> #test.channels.c1.byteCapacity = 100000000
>>>
>>>  # set channel for source
>>> test.sources.r1.channels = c1
>>> # set channel for sink
>>> test.sinks.s1.channel = c1
>>>
>>>  test.sinks.s1.type = hdfs
>>> test.sinks.s1.hdfs.useLocalTimeStamp = true
>>>
>>>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
>>> test.sinks.s1.hdfs.filePrefix = log-data
>>> test.sinks.s1.hdfs.inUseSuffix = .dat
>>>
>>>  # how to set this parameter??? (i basically want to send as much data
>>> as I can)
>>> test.sinks.s1.hdfs.batchSize = 10000
>>>
>>> #test.sinks.s1.hdfs.round = true
>>> #test.sinks.s1.hdfs.roundValue = 5
>>> #test.sinks.s1.hdfs.roundUnit = minute
>>>
>>> test.sinks.s1.hdfs.rollSize = 0
>>> test.sinks.s1.hdfs.rollCount = 0
>>> test.sinks.s1.hdfs.rollInterval = 0
>>>
>>> # compression attempt
>>> #test.sinks.s1.hdsf.fileType = CompressedStream
>>> #test.sinks.s1.hdfs.codeC=gzip
>>> #test.sinks.s1.hdfs.codeC=BZip2Codec
>>> #test.sinks.s1.hdfs.callTimeout = 120000
>>>
>>>  Can someone show me how to find this bottleneck/ configuration
>>> mistake? (I can't believe that those are flume performance on my machine)
>>>
>>>  Thanks a lot if you can help me
>>> Regards.
>>> Sebastiano
>>>
>>>
>>>
>>
>
>
> --
> Thanks and regards
> Sandeep Khurana
>

Re: performances tuning...

Posted by Sandeep Khurana <sk...@gmail.com>.

I see that you have below settings set to zero. You dont want rolling to
hdfs to  happen based upon any of the size, count or time  interval?

test.sinks.s1.hdfs.rollSize = 0
test.sinks.s1.hdfs.rollCount = 0
test.sinks.s1.hdfs.rollInterval = 0


On Wed, Sep 3, 2014 at 1:06 PM, Sebastiano Di Paola <
sebastiano.dipaola@gmail.com> wrote:

> Hi Paul,
> thank for your answer.
> As I' m a newbie of Flume How can I attach multiple sinks to the same
> channel? (does they read data in a round robin fashon from the memory
> channel?)
> (does this create multiple files on the hdfs?, because this is not what
> I'm expecting to have I have a 500MB data file at the source and I would
> like to have only one file on HDFS)
>
> I can't believe that I cannot achieve such a performance with a single
> sink. I'm pretty sure it's a configuration issue!
> Beside this how to tune the batchSize parameter? (Of course I have already
> tried to set it like 10 times the number I have in my config, but no
> relevant improvements)
> Regards.
> Seba
>
>
> On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pc...@ntent.com> wrote:
>
>>  Start adding additional HDFS sinks attached to the same channel. You
>> can also tune batch sizes when writing to HDFS to increase per sink
>> performance.
>>
>> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
>> sebastiano.dipaola@gmail.com> wrote:
>>
>>   Hi there,
>> I'm a completely newbie of Flume, so I probably made a mistake in my
>> configuration but I cannot point it out.
>> I want to achieve transfer maximum performances.
>> My flume machine has 16GB RAM and 8 Cores
>> I'm using a very simple Flume architecture:
>> Source -> Memory Channel -> Sink
>> Source is of type netcat
>> and Sink is hdfs
>> The machine has 1Gb ethernet directly connected to the switch of the
>> hadoop cluster.
>> The point is that Flume is sooo slow in loading the data into my hdfs
>> filesystem.
>> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from
>> the same machine I will reach approx 250 Mb/s as transfer rate, while
>> transferring the same file with this Flume architecture is like 2-3 Mb/s).
>> (the cluster is composed of 10 machines, and was totally idle while I did
>> this test, so was not under stress) (the traffic rate was measured on the
>> flume machine output interface in both exeperiments)
>> (myfile has 10 million of lines of average size of 150 bytes each)
>>
>>  For what I understood till now It doesn't seem a source issue as the
>> memory channel tends to fill up if I decrease the channel capacity (but
>> even make it very very very very big it does not affect sink perfomances),
>> so it seems to me that the problem is related to sink.
>> In order to test this point I've also tried to change the source using
>> "exec" type and simply executing "cat myfile"  but the result hasn't
>> changed....
>>
>>
>>  Here's my used config...
>>
>>   # list the sources, sinks and channels for the agent
>> test.sources = r1
>> test.channels = c1
>>  test.sinks = s1
>>
>>  # exec attempt
>> test.sources.r1.type = exec
>> test.sources.r1.command = cat /tmp/myfile
>>
>>  # my netcat attempt
>> #test.sources.r1.type = netcat
>> #test.sources.r1.bind = localhost
>> #test.sources.r1.port = 6666
>>
>>  # my file channel attempt
>> #test.channels.c1.type = file
>>
>> #my memory channel attempt
>> test.channels.c1.type = memory
>> test.channels.c1.capacity = 1000000
>> test.channels.c1.transactionCapacity = 10000
>>
>>  # how to properly set those parameter?? even if I enable those nothing
>> changes
>> # in my performances (what it the buffer percentage used for?)
>> #test.channels.c1.byteCapacityBufferPercentage = 50
>> #test.channels.c1.byteCapacity = 100000000
>>
>>  # set channel for source
>> test.sources.r1.channels = c1
>> # set channel for sink
>> test.sinks.s1.channel = c1
>>
>>  test.sinks.s1.type = hdfs
>> test.sinks.s1.hdfs.useLocalTimeStamp = true
>>
>>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
>> test.sinks.s1.hdfs.filePrefix = log-data
>> test.sinks.s1.hdfs.inUseSuffix = .dat
>>
>>  # how to set this parameter??? (i basically want to send as much data
>> as I can)
>> test.sinks.s1.hdfs.batchSize = 10000
>>
>> #test.sinks.s1.hdfs.round = true
>> #test.sinks.s1.hdfs.roundValue = 5
>> #test.sinks.s1.hdfs.roundUnit = minute
>>
>> test.sinks.s1.hdfs.rollSize = 0
>> test.sinks.s1.hdfs.rollCount = 0
>> test.sinks.s1.hdfs.rollInterval = 0
>>
>> # compression attempt
>> #test.sinks.s1.hdsf.fileType = CompressedStream
>> #test.sinks.s1.hdfs.codeC=gzip
>> #test.sinks.s1.hdfs.codeC=BZip2Codec
>> #test.sinks.s1.hdfs.callTimeout = 120000
>>
>>  Can someone show me how to find this bottleneck/ configuration mistake?
>> (I can't believe that those are flume performance on my machine)
>>
>>  Thanks a lot if you can help me
>> Regards.
>> Sebastiano
>>
>>
>>
>


-- 
Thanks and regards
Sandeep Khurana

Re: performances tuning...

Posted by Sebastiano Di Paola <se...@gmail.com>.

Hi Paul,
thank for your answer.
As I' m a newbie of Flume How can I attach multiple sinks to the same
channel? (does they read data in a round robin fashon from the memory
channel?)
(does this create multiple files on the hdfs?, because this is not what I'm
expecting to have I have a 500MB data file at the source and I would like
to have only one file on HDFS)

I can't believe that I cannot achieve such a performance with a single
sink. I'm pretty sure it's a configuration issue!
Beside this how to tune the batchSize parameter? (Of course I have already
tried to set it like 10 times the number I have in my config, but no
relevant improvements)
Regards.
Seba


On Wed, Sep 3, 2014 at 9:11 AM, Paul Chavez <pc...@ntent.com> wrote:

>  Start adding additional HDFS sinks attached to the same channel. You can
> also tune batch sizes when writing to HDFS to increase per sink performance.
>
> On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <
> sebastiano.dipaola@gmail.com> wrote:
>
>   Hi there,
> I'm a completely newbie of Flume, so I probably made a mistake in my
> configuration but I cannot point it out.
> I want to achieve transfer maximum performances.
> My flume machine has 16GB RAM and 8 Cores
> I'm using a very simple Flume architecture:
> Source -> Memory Channel -> Sink
> Source is of type netcat
> and Sink is hdfs
> The machine has 1Gb ethernet directly connected to the switch of the
> hadoop cluster.
> The point is that Flume is sooo slow in loading the data into my hdfs
> filesystem.
> (i.e. using hdfs dfs -copyFromLocal myfile */flume/events/*myfile from
> the same machine I will reach approx 250 Mb/s as transfer rate, while
> transferring the same file with this Flume architecture is like 2-3 Mb/s).
> (the cluster is composed of 10 machines, and was totally idle while I did
> this test, so was not under stress) (the traffic rate was measured on the
> flume machine output interface in both exeperiments)
> (myfile has 10 million of lines of average size of 150 bytes each)
>
>  For what I understood till now It doesn't seem a source issue as the
> memory channel tends to fill up if I decrease the channel capacity (but
> even make it very very very very big it does not affect sink perfomances),
> so it seems to me that the problem is related to sink.
> In order to test this point I've also tried to change the source using
> "exec" type and simply executing "cat myfile"  but the result hasn't
> changed....
>
>
>  Here's my used config...
>
>   # list the sources, sinks and channels for the agent
> test.sources = r1
> test.channels = c1
>  test.sinks = s1
>
>  # exec attempt
> test.sources.r1.type = exec
> test.sources.r1.command = cat /tmp/myfile
>
>  # my netcat attempt
> #test.sources.r1.type = netcat
> #test.sources.r1.bind = localhost
> #test.sources.r1.port = 6666
>
>  # my file channel attempt
> #test.channels.c1.type = file
>
> #my memory channel attempt
> test.channels.c1.type = memory
> test.channels.c1.capacity = 1000000
> test.channels.c1.transactionCapacity = 10000
>
>  # how to properly set those parameter?? even if I enable those nothing
> changes
> # in my performances (what it the buffer percentage used for?)
> #test.channels.c1.byteCapacityBufferPercentage = 50
> #test.channels.c1.byteCapacity = 100000000
>
>  # set channel for source
> test.sources.r1.channels = c1
> # set channel for sink
> test.sinks.s1.channel = c1
>
>  test.sinks.s1.type = hdfs
> test.sinks.s1.hdfs.useLocalTimeStamp = true
>
>  test.sinks.s1.hdfs.path = hdfs://mynodemanager*:9000/flume/events/*
> test.sinks.s1.hdfs.filePrefix = log-data
> test.sinks.s1.hdfs.inUseSuffix = .dat
>
>  # how to set this parameter??? (i basically want to send as much data as
> I can)
> test.sinks.s1.hdfs.batchSize = 10000
>
> #test.sinks.s1.hdfs.round = true
> #test.sinks.s1.hdfs.roundValue = 5
> #test.sinks.s1.hdfs.roundUnit = minute
>
> test.sinks.s1.hdfs.rollSize = 0
> test.sinks.s1.hdfs.rollCount = 0
> test.sinks.s1.hdfs.rollInterval = 0
>
> # compression attempt
> #test.sinks.s1.hdsf.fileType = CompressedStream
> #test.sinks.s1.hdfs.codeC=gzip
> #test.sinks.s1.hdfs.codeC=BZip2Codec
> #test.sinks.s1.hdfs.callTimeout = 120000
>
>  Can someone show me how to find this bottleneck/ configuration mistake?
> (I can't believe that those are flume performance on my machine)
>
>  Thanks a lot if you can help me
> Regards.
> Sebastiano
>
>
>

Re: performances tuning...

Posted by Paul Chavez <pc...@ntent.com>.

Start adding additional HDFS sinks attached to the same channel. You can also tune batch sizes when writing to HDFS to increase per sink performance.

On Sep 2, 2014, at 11:54 PM, "Sebastiano Di Paola" <se...@gmail.com>> wrote:

Hi there,
I'm a completely newbie of Flume, so I probably made a mistake in my configuration but I cannot point it out.
I want to achieve transfer maximum performances.
My flume machine has 16GB RAM and 8 Cores
I'm using a very simple Flume architecture:
Source -> Memory Channel -> Sink
Source is of type netcat
and Sink is hdfs
The machine has 1Gb ethernet directly connected to the switch of the hadoop cluster.
The point is that Flume is sooo slow in loading the data into my hdfs filesystem.
(i.e. using hdfs dfs -copyFromLocal myfile /flume/events/myfile from the same machine I will reach approx 250 Mb/s as transfer rate, while transferring the same file with this Flume architecture is like 2-3 Mb/s). (the cluster is composed of 10 machines, and was totally idle while I did this test, so was not under stress) (the traffic rate was measured on the flume machine output interface in both exeperiments)
(myfile has 10 million of lines of average size of 150 bytes each)

For what I understood till now It doesn't seem a source issue as the memory channel tends to fill up if I decrease the channel capacity (but even make it very very very very big it does not affect sink perfomances), so it seems to me that the problem is related to sink.
In order to test this point I've also tried to change the source using "exec" type and simply executing "cat myfile"  but the result hasn't changed....


Here's my used config...

 # list the sources, sinks and channels for the agent
test.sources = r1
test.channels = c1
test.sinks = s1

# exec attempt
test.sources.r1.type = exec
test.sources.r1.command = cat /tmp/myfile

# my netcat attempt
#test.sources.r1.type = netcat
#test.sources.r1.bind = localhost
#test.sources.r1.port = 6666

# my file channel attempt
#test.channels.c1.type = file

#my memory channel attempt
test.channels.c1.type = memory
test.channels.c1.capacity = 1000000
test.channels.c1.transactionCapacity = 10000

# how to properly set those parameter?? even if I enable those nothing changes
# in my performances (what it the buffer percentage used for?)
#test.channels.c1.byteCapacityBufferPercentage = 50
#test.channels.c1.byteCapacity = 100000000

# set channel for source
test.sources.r1.channels = c1
# set channel for sink
test.sinks.s1.channel = c1

test.sinks.s1.type = hdfs
test.sinks.s1.hdfs.useLocalTimeStamp = true

test.sinks.s1.hdfs.path = hdfs://mynodemanager:9000/flume/events/
test.sinks.s1.hdfs.filePrefix = log-data
test.sinks.s1.hdfs.inUseSuffix = .dat

# how to set this parameter??? (i basically want to send as much data as I can)
test.sinks.s1.hdfs.batchSize = 10000

#test.sinks.s1.hdfs.round = true
#test.sinks.s1.hdfs.roundValue = 5
#test.sinks.s1.hdfs.roundUnit = minute

test.sinks.s1.hdfs.rollSize = 0
test.sinks.s1.hdfs.rollCount = 0
test.sinks.s1.hdfs.rollInterval = 0

# compression attempt
#test.sinks.s1.hdsf.fileType = CompressedStream
#test.sinks.s1.hdfs.codeC=gzip
#test.sinks.s1.hdfs.codeC=BZip2Codec
#test.sinks.s1.hdfs.callTimeout = 120000

Can someone show me how to find this bottleneck/ configuration mistake? (I can't believe that those are flume performance on my machine)

Thanks a lot if you can help me
Regards.
Sebastiano