You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Shara Shi <sh...@dhgate.com> on 2012/08/28 05:19:04 UTC

答复: 答复: HDFS SINK Performacne

Hi Denny

 

It is 20MB /min , I confirmed 

I sent data from avro-client from local to flume agent , I really got
20MB/min

So I try to find out the reason why. 

 

Regards 

Shara

发件人: Denny Ye [mailto:dennyy99@gmail.com] 
发送时间: 2012年8月28日 11:02
收件人: user@flume.apache.org
主题: Re: 答复: HDFS SINK Performacne

 

20MB/min or 20MB/sec?

I doubt that it may have presentation mistake. Can you confirm it?

 

-Regards

Denny Ye

2012/8/28 Shara Shi <sh...@dhgate.com>

Hi Denny

 

The throughput is 45MB/sec is OK for me . 

But I just got 20M / Minutes 

What’s wrong with my configuration?

 

Regards

Shara

 

 

发件人: Denny Ye [mailto:dennyy99@gmail.com] 
发送时间: 2012年8月27日 20:05
收件人: user@flume.apache.org
主题: Re: HDFS SINK Performacne

 

hi Shara,

    You are using MemoryChannel as repository. I tested it with outcomes:
45MB/sec without full GC in local updated code. Is this your goal? or more
high throughput?

 

-Regards

Denny Ye

2012/8/27 Shara Shi <sh...@dhgate.com>

Hi All, 

 

Whatever I have tuned parameters of hdfs sink, It can’t get higher
performance over than 20MB per minutes.

Is that normal? I think it is weird.

How can I improve it

 

Regards

Ruihong Shi

==========================================

 

# or more contributor license agreements.  See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership.  The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License.  You may obtain a copy of the License at

#

#  http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing,

# software distributed under the License is distributed on an

# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

# KIND, either express or implied.  See the License for the

# specific language governing permissions and limitations

# under the License.

 

# Define a memory channel called ch1 on collector1

collector2.channels.ch2.type = memory

collector2.channels.ch2.capacity=500000

collector2.channels.ch2.keep-alive=1

 

 

# Define an Avro source called avro-source1 on agent1 and tell it

# to bind to 0.0.0.0:41414. Connect it to channel ch1.

collector2.sources.avro-source1.channels = ch2

collector2.sources.avro-source1.type = avro

collector2.sources.avro-source1.bind = 0.0.0.0

collector2.sources.avro-source1.port = 41415

collector2.sources.avro-soruce1.threads = 10

 

 

# Define a hdfs sink

collector2.sinks.hdfs.channel = ch2

collector2.sinks.hdfs.type= hdfs

collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata
/exec/%Y/%m/%d/%H

collector2.sinks.hdfs.batchsize=50000

collector2.sinks.hdfs.runner.type=polling

collector2.sinks.hdfs.runner.polling.interval = 1

collector2.sinks.hdfs.hdfs.rollInterval = 120

collector2.sinks.hdfs.hdfs.rollSize =0

collector2.sinks.hdfs.hdfs.rollCount = 300000

collector2.sinks.hdfs.hdfs.fileType=DataStream

collector2.sinks.hdfs.hdfs.round =true

collector2.sinks.hdfs.hdfs.roundValue = 10

collector2.sinks.hdfs.hdfs.roundUnit = minute

collector2.sinks.hdfs.hdfs.threadsPoolSize = 10

collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10

 

# Finally, now that we've defined all of our components, tell

# agent1 which ones we want to activate.

Re: 答复: 答复: 答复: HDFS SINK Performacne

Posted by Brock Noland <br...@cloudera.com>.

Do you have a batch size configured for HDFSSink?

On Tue, Aug 28, 2012 at 12:42 AM, Shara Shi <sh...@dhgate.com> wrote:

> HI Patrick
>
> I try to send a data file over than 200MB via flume avro-client to a flume
> agent with HDFS sink.
> I think most of events are in Channel(memory) , but flush to hdsf(disc) is
> very slow.
> If I use hadoop fs -put xxx xxx , the performance is ok just use server
> seconds.
>
> My event is big over than 1k.
> I use flume-1.2.0 and my hadoop cluster is CDH4.
>
> Regards
> Shara
>
> -----邮件原件-----
> 发件人: Patrick Wendell [mailto:pwendell@gmail.com]
> 发送时间: 2012年8月28日 13:11
> 收件人: user@flume.apache.org
> 主题: Re: 答复: 答复: HDFS SINK Performacne
>
> Hey,
>
> Can you let us know what rate data is arriving at collector2 at? How many
> events/second and bytes/second, roughly?
>
> Also, why is your batch size so large? I'm not sure, but I think it may
> wait
> until it has received batchSize events before it decides to flush them to
> HDFS...  so this may create strange results depending on how many
> events/second you have.
>
> - Patrick
>
> On Mon, Aug 27, 2012 at 9:48 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > Do you get better performance when you directly write to the cluster?
> > Can you perform some tests writing to cluster directly and compare?
> >
> >
> > On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <sh...@dhgate.com>
> wrote:
> >>
> >> Hi Denny
> >>
> >>
> >>
> >> It is 20MB /min , I confirmed
> >>
> >> I sent data from avro-client from local to flume agent , I really got
> >> 20MB/min
> >>
> >> So I try to find out the reason why.
> >>
> >>
> >>
> >> Regards
> >>
> >> Shara
> >>
> >> 发件人: Denny Ye [mailto:dennyy99@gmail.com]
> >> 发送时间: 2012年8月28日 11:02
> >> 收件人: user@flume.apache.org
> >> 主题: Re: 答复: HDFS SINK Performacne
> >>
> >>
> >>
> >> 20MB/min or 20MB/sec?
> >>
> >> I doubt that it may have presentation mistake. Can you confirm it?
> >>
> >>
> >>
> >> -Regards
> >>
> >> Denny Ye
> >>
> >> 2012/8/28 Shara Shi <sh...@dhgate.com>
> >>
> >> Hi Denny
> >>
> >>
> >>
> >> The throughput is 45MB/sec is OK for me .
> >>
> >> But I just got 20M / Minutes
> >>
> >> What’s wrong with my configuration?
> >>
> >>
> >>
> >> Regards
> >>
> >> Shara
> >>
> >>
> >>
> >>
> >>
> >> 发件人: Denny Ye [mailto:dennyy99@gmail.com]
> >> 发送时间: 2012年8月27日 20:05
> >> 收件人: user@flume.apache.org
> >> 主题: Re: HDFS SINK Performacne
> >>
> >>
> >>
> >> hi Shara,
> >>
> >>     You are using MemoryChannel as repository. I tested it with
> outcomes:
> >> 45MB/sec without full GC in local updated code. Is this your goal? or
> >> more high throughput?
> >>
> >>
> >>
> >> -Regards
> >>
> >> Denny Ye
> >>
> >> 2012/8/27 Shara Shi <sh...@dhgate.com>
> >>
> >> Hi All,
> >>
> >>
> >>
> >> Whatever I have tuned parameters of hdfs sink, It can’t get higher
> >> performance over than 20MB per minutes.
> >>
> >> Is that normal? I think it is weird.
> >>
> >> How can I improve it
> >>
> >>
> >>
> >> Regards
> >>
> >> Ruihong Shi
> >>
> >> ==========================================
> >>
> >>
> >>
> >> # or more contributor license agreements.  See the NOTICE file
> >>
> >> # distributed with this work for additional information
> >>
> >> # regarding copyright ownership.  The ASF licenses this file
> >>
> >> # to you under the Apache License, Version 2.0 (the
> >>
> >> # "License"); you may not use this file except in compliance
> >>
> >> # with the License.  You may obtain a copy of the License at
> >>
> >> #
> >>
> >> #  http://www.apache.org/licenses/LICENSE-2.0
> >>
> >> #
> >>
> >> # Unless required by applicable law or agreed to in writing,
> >>
> >> # software distributed under the License is distributed on an
> >>
> >> # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> >>
> >> # KIND, either express or implied.  See the License for the
> >>
> >> # specific language governing permissions and limitations
> >>
> >> # under the License.
> >>
> >>
> >>
> >> # Define a memory channel called ch1 on collector1
> >>
> >> collector2.channels.ch2.type = memory
> >>
> >> collector2.channels.ch2.capacity=500000
> >>
> >> collector2.channels.ch2.keep-alive=1
> >>
> >>
> >>
> >>
> >>
> >> # Define an Avro source called avro-source1 on agent1 and tell it
> >>
> >> # to bind to 0.0.0.0:41414. Connect it to channel ch1.
> >>
> >> collector2.sources.avro-source1.channels = ch2
> >>
> >> collector2.sources.avro-source1.type = avro
> >>
> >> collector2.sources.avro-source1.bind = 0.0.0.0
> >>
> >> collector2.sources.avro-source1.port = 41415
> >>
> >> collector2.sources.avro-soruce1.threads = 10
> >>
> >>
> >>
> >>
> >>
> >> # Define a hdfs sink
> >>
> >> collector2.sinks.hdfs.channel = ch2
> >>
> >> collector2.sinks.hdfs.type= hdfs
> >>
> >>
> >> collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/
> >> webdata/exec/%Y/%m/%d/%H
> >>
> >> collector2.sinks.hdfs.batchsize=50000
> >>
> >> collector2.sinks.hdfs.runner.type=polling
> >>
> >> collector2.sinks.hdfs.runner.polling.interval = 1
> >>
> >> collector2.sinks.hdfs.hdfs.rollInterval = 120
> >>
> >> collector2.sinks.hdfs.hdfs.rollSize =0
> >>
> >> collector2.sinks.hdfs.hdfs.rollCount = 300000
> >>
> >> collector2.sinks.hdfs.hdfs.fileType=DataStream
> >>
> >> collector2.sinks.hdfs.hdfs.round =true
> >>
> >> collector2.sinks.hdfs.hdfs.roundValue = 10
> >>
> >> collector2.sinks.hdfs.hdfs.roundUnit = minute
> >>
> >> collector2.sinks.hdfs.hdfs.threadsPoolSize = 10
> >>
> >> collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10
> >>
> >>
> >>
> >> # Finally, now that we've defined all of our components, tell
> >>
> >> # agent1 which ones we want to activate.
> >>
> >>
> >>
> >>
> >
> >
>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

答复: 答复: 答复: HDFS SINK Performacne

Posted by Shara Shi <sh...@dhgate.com>.

HI Patrick

I try to send a data file over than 200MB via flume avro-client to a flume
agent with HDFS sink.
I think most of events are in Channel(memory) , but flush to hdsf(disc) is
very slow. 
If I use hadoop fs -put xxx xxx , the performance is ok just use server
seconds. 

My event is big over than 1k.  
I use flume-1.2.0 and my hadoop cluster is CDH4. 

Regards
Shara 

-----邮件原件-----
发件人: Patrick Wendell [mailto:pwendell@gmail.com] 
发送时间: 2012年8月28日 13:11
收件人: user@flume.apache.org
主题: Re: 答复: 答复: HDFS SINK Performacne

Hey,

Can you let us know what rate data is arriving at collector2 at? How many
events/second and bytes/second, roughly?

Also, why is your batch size so large? I'm not sure, but I think it may wait
until it has received batchSize events before it decides to flush them to
HDFS...  so this may create strange results depending on how many
events/second you have.

- Patrick

On Mon, Aug 27, 2012 at 9:48 PM, Mohit Anchlia <mo...@gmail.com>
wrote:
> Do you get better performance when you directly write to the cluster? 
> Can you perform some tests writing to cluster directly and compare?
>
>
> On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <sh...@dhgate.com> wrote:
>>
>> Hi Denny
>>
>>
>>
>> It is 20MB /min , I confirmed
>>
>> I sent data from avro-client from local to flume agent , I really got 
>> 20MB/min
>>
>> So I try to find out the reason why.
>>
>>
>>
>> Regards
>>
>> Shara
>>
>> 发件人: Denny Ye [mailto:dennyy99@gmail.com]
>> 发送时间: 2012年8月28日 11:02
>> 收件人: user@flume.apache.org
>> 主题: Re: 答复: HDFS SINK Performacne
>>
>>
>>
>> 20MB/min or 20MB/sec?
>>
>> I doubt that it may have presentation mistake. Can you confirm it?
>>
>>
>>
>> -Regards
>>
>> Denny Ye
>>
>> 2012/8/28 Shara Shi <sh...@dhgate.com>
>>
>> Hi Denny
>>
>>
>>
>> The throughput is 45MB/sec is OK for me .
>>
>> But I just got 20M / Minutes
>>
>> What’s wrong with my configuration?
>>
>>
>>
>> Regards
>>
>> Shara
>>
>>
>>
>>
>>
>> 发件人: Denny Ye [mailto:dennyy99@gmail.com]
>> 发送时间: 2012年8月27日 20:05
>> 收件人: user@flume.apache.org
>> 主题: Re: HDFS SINK Performacne
>>
>>
>>
>> hi Shara,
>>
>>     You are using MemoryChannel as repository. I tested it with outcomes:
>> 45MB/sec without full GC in local updated code. Is this your goal? or 
>> more high throughput?
>>
>>
>>
>> -Regards
>>
>> Denny Ye
>>
>> 2012/8/27 Shara Shi <sh...@dhgate.com>
>>
>> Hi All,
>>
>>
>>
>> Whatever I have tuned parameters of hdfs sink, It can’t get higher 
>> performance over than 20MB per minutes.
>>
>> Is that normal? I think it is weird.
>>
>> How can I improve it
>>
>>
>>
>> Regards
>>
>> Ruihong Shi
>>
>> ==========================================
>>
>>
>>
>> # or more contributor license agreements.  See the NOTICE file
>>
>> # distributed with this work for additional information
>>
>> # regarding copyright ownership.  The ASF licenses this file
>>
>> # to you under the Apache License, Version 2.0 (the
>>
>> # "License"); you may not use this file except in compliance
>>
>> # with the License.  You may obtain a copy of the License at
>>
>> #
>>
>> #  http://www.apache.org/licenses/LICENSE-2.0
>>
>> #
>>
>> # Unless required by applicable law or agreed to in writing,
>>
>> # software distributed under the License is distributed on an
>>
>> # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
>>
>> # KIND, either express or implied.  See the License for the
>>
>> # specific language governing permissions and limitations
>>
>> # under the License.
>>
>>
>>
>> # Define a memory channel called ch1 on collector1
>>
>> collector2.channels.ch2.type = memory
>>
>> collector2.channels.ch2.capacity=500000
>>
>> collector2.channels.ch2.keep-alive=1
>>
>>
>>
>>
>>
>> # Define an Avro source called avro-source1 on agent1 and tell it
>>
>> # to bind to 0.0.0.0:41414. Connect it to channel ch1.
>>
>> collector2.sources.avro-source1.channels = ch2
>>
>> collector2.sources.avro-source1.type = avro
>>
>> collector2.sources.avro-source1.bind = 0.0.0.0
>>
>> collector2.sources.avro-source1.port = 41415
>>
>> collector2.sources.avro-soruce1.threads = 10
>>
>>
>>
>>
>>
>> # Define a hdfs sink
>>
>> collector2.sinks.hdfs.channel = ch2
>>
>> collector2.sinks.hdfs.type= hdfs
>>
>>
>> collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/
>> webdata/exec/%Y/%m/%d/%H
>>
>> collector2.sinks.hdfs.batchsize=50000
>>
>> collector2.sinks.hdfs.runner.type=polling
>>
>> collector2.sinks.hdfs.runner.polling.interval = 1
>>
>> collector2.sinks.hdfs.hdfs.rollInterval = 120
>>
>> collector2.sinks.hdfs.hdfs.rollSize =0
>>
>> collector2.sinks.hdfs.hdfs.rollCount = 300000
>>
>> collector2.sinks.hdfs.hdfs.fileType=DataStream
>>
>> collector2.sinks.hdfs.hdfs.round =true
>>
>> collector2.sinks.hdfs.hdfs.roundValue = 10
>>
>> collector2.sinks.hdfs.hdfs.roundUnit = minute
>>
>> collector2.sinks.hdfs.hdfs.threadsPoolSize = 10
>>
>> collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10
>>
>>
>>
>> # Finally, now that we've defined all of our components, tell
>>
>> # agent1 which ones we want to activate.
>>
>>
>>
>>
>
>

Re: 答复: 答复: HDFS SINK Performacne

Posted by Patrick Wendell <pw...@gmail.com>.

Hey,

Can you let us know what rate data is arriving at collector2 at? How
many events/second and bytes/second, roughly?

Also, why is your batch size so large? I'm not sure, but I think it
may wait until it has received batchSize events before it decides to
flush them to HDFS...  so this may create strange results depending on
how many events/second you have.

- Patrick

On Mon, Aug 27, 2012 at 9:48 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> Do you get better performance when you directly write to the cluster? Can
> you perform some tests writing to cluster directly and compare?
>
>
> On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <sh...@dhgate.com> wrote:
>>
>> Hi Denny
>>
>>
>>
>> It is 20MB /min , I confirmed
>>
>> I sent data from avro-client from local to flume agent , I really got
>> 20MB/min
>>
>> So I try to find out the reason why.
>>
>>
>>
>> Regards
>>
>> Shara
>>
>> 发件人: Denny Ye [mailto:dennyy99@gmail.com]
>> 发送时间: 2012年8月28日 11:02
>> 收件人: user@flume.apache.org
>> 主题: Re: 答复: HDFS SINK Performacne
>>
>>
>>
>> 20MB/min or 20MB/sec?
>>
>> I doubt that it may have presentation mistake. Can you confirm it?
>>
>>
>>
>> -Regards
>>
>> Denny Ye
>>
>> 2012/8/28 Shara Shi <sh...@dhgate.com>
>>
>> Hi Denny
>>
>>
>>
>> The throughput is 45MB/sec is OK for me .
>>
>> But I just got 20M / Minutes
>>
>> What’s wrong with my configuration?
>>
>>
>>
>> Regards
>>
>> Shara
>>
>>
>>
>>
>>
>> 发件人: Denny Ye [mailto:dennyy99@gmail.com]
>> 发送时间: 2012年8月27日 20:05
>> 收件人: user@flume.apache.org
>> 主题: Re: HDFS SINK Performacne
>>
>>
>>
>> hi Shara,
>>
>>     You are using MemoryChannel as repository. I tested it with outcomes:
>> 45MB/sec without full GC in local updated code. Is this your goal? or more
>> high throughput?
>>
>>
>>
>> -Regards
>>
>> Denny Ye
>>
>> 2012/8/27 Shara Shi <sh...@dhgate.com>
>>
>> Hi All,
>>
>>
>>
>> Whatever I have tuned parameters of hdfs sink, It can’t get higher
>> performance over than 20MB per minutes.
>>
>> Is that normal? I think it is weird.
>>
>> How can I improve it
>>
>>
>>
>> Regards
>>
>> Ruihong Shi
>>
>> ==========================================
>>
>>
>>
>> # or more contributor license agreements.  See the NOTICE file
>>
>> # distributed with this work for additional information
>>
>> # regarding copyright ownership.  The ASF licenses this file
>>
>> # to you under the Apache License, Version 2.0 (the
>>
>> # "License"); you may not use this file except in compliance
>>
>> # with the License.  You may obtain a copy of the License at
>>
>> #
>>
>> #  http://www.apache.org/licenses/LICENSE-2.0
>>
>> #
>>
>> # Unless required by applicable law or agreed to in writing,
>>
>> # software distributed under the License is distributed on an
>>
>> # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
>>
>> # KIND, either express or implied.  See the License for the
>>
>> # specific language governing permissions and limitations
>>
>> # under the License.
>>
>>
>>
>> # Define a memory channel called ch1 on collector1
>>
>> collector2.channels.ch2.type = memory
>>
>> collector2.channels.ch2.capacity=500000
>>
>> collector2.channels.ch2.keep-alive=1
>>
>>
>>
>>
>>
>> # Define an Avro source called avro-source1 on agent1 and tell it
>>
>> # to bind to 0.0.0.0:41414. Connect it to channel ch1.
>>
>> collector2.sources.avro-source1.channels = ch2
>>
>> collector2.sources.avro-source1.type = avro
>>
>> collector2.sources.avro-source1.bind = 0.0.0.0
>>
>> collector2.sources.avro-source1.port = 41415
>>
>> collector2.sources.avro-soruce1.threads = 10
>>
>>
>>
>>
>>
>> # Define a hdfs sink
>>
>> collector2.sinks.hdfs.channel = ch2
>>
>> collector2.sinks.hdfs.type= hdfs
>>
>>
>> collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata/exec/%Y/%m/%d/%H
>>
>> collector2.sinks.hdfs.batchsize=50000
>>
>> collector2.sinks.hdfs.runner.type=polling
>>
>> collector2.sinks.hdfs.runner.polling.interval = 1
>>
>> collector2.sinks.hdfs.hdfs.rollInterval = 120
>>
>> collector2.sinks.hdfs.hdfs.rollSize =0
>>
>> collector2.sinks.hdfs.hdfs.rollCount = 300000
>>
>> collector2.sinks.hdfs.hdfs.fileType=DataStream
>>
>> collector2.sinks.hdfs.hdfs.round =true
>>
>> collector2.sinks.hdfs.hdfs.roundValue = 10
>>
>> collector2.sinks.hdfs.hdfs.roundUnit = minute
>>
>> collector2.sinks.hdfs.hdfs.threadsPoolSize = 10
>>
>> collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10
>>
>>
>>
>> # Finally, now that we've defined all of our components, tell
>>
>> # agent1 which ones we want to activate.
>>
>>
>>
>>
>
>

答复: 答复: 答复: HDFS SINK Performacne

Posted by Shara Shi <sh...@dhgate.com>.

HI Anchlia ,

 

If I use hadoop fs �Cput xxx xxx , the performance is ok much faster than
flume��s .

 

Regards

Shara

 

������: Mohit Anchlia [mailto:mohitanchlia@gmail.com] 
����ʱ��: 2012��8��28�� 12:49
�ռ���: user@flume.apache.org
����: Re: ��: ��: HDFS SINK Performacne

 

Do you get better performance when you directly write to the cluster? Can
you perform some tests writing to cluster directly and compare?

On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <sh...@dhgate.com> wrote:

Hi Denny

 

It is 20MB /min , I confirmed 

I sent data from avro-client from local to flume agent , I really got
20MB/min

So I try to find out the reason why. 

 

Regards 

Shara

������: Denny Ye [mailto:dennyy99@gmail.com] 

����ʱ��: 2012��8��28�� 11:02
�ռ���: user@flume.apache.org
����: Re: ��: HDFS SINK Performacne 

 

20MB/min or 20MB/sec?

I doubt that it may have presentation mistake. Can you confirm it?

 

-Regards

Denny Ye

2012/8/28 Shara Shi <sh...@dhgate.com>

Hi Denny

 

The throughput is 45MB/sec is OK for me . 

But I just got 20M / Minutes 

What��s wrong with my configuration?

 

Regards

Shara

 

 

������: Denny Ye [mailto:dennyy99@gmail.com] 
����ʱ��: 2012��8��27�� 20:05
�ռ���: user@flume.apache.org
����: Re: HDFS SINK Performacne

 

hi Shara,

    You are using MemoryChannel as repository. I tested it with outcomes:
45MB/sec without full GC in local updated code. Is this your goal? or more
high throughput?

 

-Regards

Denny Ye

2012/8/27 Shara Shi <sh...@dhgate.com>

Hi All, 

 

Whatever I have tuned parameters of hdfs sink, It can��t get higher
performance over than 20MB per minutes.

Is that normal? I think it is weird.

How can I improve it

 

Regards

Ruihong Shi

==========================================

 

# or more contributor license agreements.  See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership.  The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License.  You may obtain a copy of the License at

#

#  http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing,

# software distributed under the License is distributed on an

# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

# KIND, either express or implied.  See the License for the

# specific language governing permissions and limitations

# under the License.

 

# Define a memory channel called ch1 on collector1

collector2.channels.ch2.type = memory

collector2.channels.ch2.capacity=500000

collector2.channels.ch2.keep-alive=1

 

 

# Define an Avro source called avro-source1 on agent1 and tell it

# to bind to 0.0.0.0:41414 <http://0.0.0.0:41414/> . Connect it to channel
ch1.

collector2.sources.avro-source1.channels = ch2

collector2.sources.avro-source1.type = avro

collector2.sources.avro-source1.bind = 0.0.0.0

collector2.sources.avro-source1.port = 41415

collector2.sources.avro-soruce1.threads = 10

 

 

# Define a hdfs sink

collector2.sinks.hdfs.channel = ch2

collector2.sinks.hdfs.type= hdfs

collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata
/exec/%Y/%m/%d/%H

collector2.sinks.hdfs.batchsize=50000

collector2.sinks.hdfs.runner.type=polling

collector2.sinks.hdfs.runner.polling.interval = 1

collector2.sinks.hdfs.hdfs.rollInterval = 120

collector2.sinks.hdfs.hdfs.rollSize =0

collector2.sinks.hdfs.hdfs.rollCount = 300000

collector2.sinks.hdfs.hdfs.fileType=DataStream

collector2.sinks.hdfs.hdfs.round =true

collector2.sinks.hdfs.hdfs.roundValue = 10

collector2.sinks.hdfs.hdfs.roundUnit = minute

collector2.sinks.hdfs.hdfs.threadsPoolSize = 10

collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10

 

# Finally, now that we've defined all of our components, tell

# agent1 which ones we want to activate.

Re: 答复: 答复: HDFS SINK Performacne

Posted by Mohit Anchlia <mo...@gmail.com>.

Do you get better performance when you directly write to the cluster? Can
you perform some tests writing to cluster directly and compare?

On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <sh...@dhgate.com> wrote:

>  Hi Denny****
>
> ** **
>
> It is 20MB /min , I confirmed ****
>
> I sent data from avro-client from local to flume agent , I really got
> 20MB/min****
>
> So I try to find out the reason why. ****
>
> ** **
>
> Regards ****
>
> Shara****
>
> *发件人:* Denny Ye [mailto:dennyy99@gmail.com]
> *发送时间:* 2012年8月28日 11:02
> *收件人:* user@flume.apache.org
> *主题:* Re: 答复: HDFS SINK Performacne****
>
>  ** **
>
> 20MB/min or 20MB/sec?****
>
> I doubt that it may have presentation mistake. Can you confirm it?****
>
> ** **
>
> -Regards****
>
> Denny Ye****
>
> 2012/8/28 Shara Shi <sh...@dhgate.com>****
>
> Hi Denny****
>
>  ****
>
> The throughput is 45MB/sec is OK for me . ****
>
> But I just got 20M / Minutes ****
>
> What’s wrong with my configuration?****
>
>  ****
>
> Regards****
>
> Shara****
>
>  ****
>
>  ****
>
> *发件人:* Denny Ye [mailto:dennyy99@gmail.com]
> *发送时间:* 2012年8月27日 20:05
> *收件人:* user@flume.apache.org
> *主题:* Re: HDFS SINK Performacne****
>
>  ****
>
> hi Shara,****
>
>     You are using MemoryChannel as repository. I tested it with outcomes:
> 45MB/sec without full GC in local updated code. Is this your goal? or more
> high throughput?****
>
>  ****
>
> -Regards****
>
> Denny Ye****
>
> 2012/8/27 Shara Shi <sh...@dhgate.com>****
>
> Hi All, ****
>
>  ****
>
> Whatever I have tuned parameters of hdfs sink, It can’t get higher
> performance over than 20MB per minutes.****
>
> Is that normal? I think it is weird.****
>
> How can I improve it****
>
>  ****
>
> Regards****
>
> Ruihong Shi****
>
> ==========================================****
>
>  ****
>
> # or more contributor license agreements.  See the NOTICE file****
>
> # distributed with this work for additional information****
>
> # regarding copyright ownership.  The ASF licenses this file****
>
> # to you under the Apache License, Version 2.0 (the****
>
> # "License"); you may not use this file except in compliance****
>
> # with the License.  You may obtain a copy of the License at****
>
> #****
>
> #  http://www.apache.org/licenses/LICENSE-2.0****
>
> #****
>
> # Unless required by applicable law or agreed to in writing,****
>
> # software distributed under the License is distributed on an****
>
> # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY****
>
> # KIND, either express or implied.  See the License for the****
>
> # specific language governing permissions and limitations****
>
> # under the License.****
>
>  ****
>
> # Define a memory channel called ch1 on collector1****
>
> collector2.channels.ch2.type = memory****
>
> collector2.channels.ch2.capacity=500000****
>
> collector2.channels.ch2.keep-alive=1****
>
>  ****
>
>  ****
>
> # Define an Avro source called avro-source1 on agent1 and tell it****
>
> # to bind to 0.0.0.0:41414. Connect it to channel ch1.****
>
> collector2.sources.avro-source1.channels = ch2****
>
> collector2.sources.avro-source1.type = avro****
>
> collector2.sources.avro-source1.bind = 0.0.0.0****
>
> collector2.sources.avro-source1.port = 41415****
>
> collector2.sources.avro-soruce1.threads = 10****
>
>  ****
>
>  ****
>
> # Define a hdfs sink****
>
> collector2.sinks.hdfs.channel = ch2****
>
> collector2.sinks.hdfs.type= hdfs****
>
>
> collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata/exec/%Y/%m/%d/%H
> ****
>
> collector2.sinks.hdfs.batchsize=50000****
>
> collector2.sinks.hdfs.runner.type=polling****
>
> collector2.sinks.hdfs.runner.polling.interval = 1****
>
> collector2.sinks.hdfs.hdfs.rollInterval = 120****
>
> collector2.sinks.hdfs.hdfs.rollSize =0****
>
> collector2.sinks.hdfs.hdfs.rollCount = 300000****
>
> collector2.sinks.hdfs.hdfs.fileType=DataStream****
>
> collector2.sinks.hdfs.hdfs.round =true****
>
> collector2.sinks.hdfs.hdfs.roundValue = 10****
>
> collector2.sinks.hdfs.hdfs.roundUnit = minute****
>
> collector2.sinks.hdfs.hdfs.threadsPoolSize = 10****
>
> collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10****
>
>  ****
>
> # Finally, now that we've defined all of our components, tell****
>
> # agent1 which ones we want to activate.****
>
>  ****
>
> ** **
>