You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Todd Lipcon <to...@cloudera.com> on 2010/01/13 22:31:53 UTC

Re: Exponential performance decay when inserting large number of blocks

Hi Zlatin,

This is a very interesting test you've run, and certainly not expected
results. I know of many clusters happily chugging along with millions of
blocks, so problems at 400K are very strange. By any chance were you able to
collect profiling information from the NameNode while running this test?

That said, I hope you've set the block size to 1KB for the purpose of this
test and not because you expect to run that in production. Recommended block
sizes are at least 64MB and often 128MB or 256MB for larger clusters.

Thanks
-Todd

On Wed, Jan 13, 2010 at 1:21 PM, <Zl...@barclayscapital.com>wrote:

> Greetings,
>
> I am testing how HDFS scales with very large number of blocks.  I did
> the following setup:
>
> Set the default blocks size to 1KB
> Started 8 insert processes, each inserting a 16MB file
> Repeated the insert 3 times, keeping the already inserted files in HDFS
> Repeated the entire experiment on one cluster with 4 and another with 11
> identical datanodes (allocated through HOD)
>
> Results:
> The first 128MB (2^18 blocks) insert finished in 5 minutes.  The second
> in 12 minutes.  The third didn't finish within 1 hour.  The 11-node
> cluster was marginally faster.
>
> Throughout this I was storing all available metrics.  There were no
> signs of insufficient memory on any of the nodes; CPU usage and garbage
> collections were constant throughout.  If anyone is interested I can
> provide the recorded metrics.  I've attached a chart that looks clearly
> logarithmic.
>
> Can anyone please point to what could be the bottleneck here?  I'm
> evaluating HDFS for usage scenarios requiring 2^(a lot more than 18)
> blocks.
>
> Bes <<insertion_rate_4_and_11_datanodes.JPG>> t Regards,
> Zlatin Balevsky
>
> _______________________________________________
>
> This e-mail may contain information that is confidential, privileged or
> otherwise protected from disclosure. If you are not an intended recipient of
> this e-mail, do not duplicate or redistribute it by any means. Please delete
> it and any attachments and notify the sender that you have received it in
> error. Unless specifically indicated, this e-mail is not an offer to buy or
> sell or a solicitation to buy or sell any securities, investment products or
> other financial product or service, an official confirmation of any
> transaction, or an official statement of Barclays. Any views or opinions
> presented are solely those of the author and do not necessarily represent
> those of Barclays. This e-mail is subject to terms available at the
> following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
> you consent to the foregoing.  Barclays Capital is the investment banking
> division of Barclays Bank PLC, a company registered in England (number
> 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
>  This email may relate to or be sent from other members of the Barclays
> Group.
> _______________________________________________
>

Re: Exponential performance decay - mystery solved

Posted by Eli Collins <el...@cloudera.com>.
Hey Zlatin,

That makes sense. No apologies necessary, was a very useful exercise.

Thanks,
Eli


On Thu, Jan 21, 2010 at 11:41 AM,  <Zl...@barclayscapital.com> wrote:
>
>
> Alright, the problem was caused by me setting the frequency of a block
> report to 30 seconds.  The idea behind that was to create more load on
> the Namenode, but I didn't notice that those block reports were taking
> increasing amounts of time to generate.  During that time, a lock was
> held which I'm guessing didn't allow the reporting datanode to perform
> its functions.
>
> On my hardware, with 100,000 blocks the report takes over 7 seconds.  So
> every datanode was unavailable for 7 out of every 30 seconds.  Changing
> the interval to a more reasonable value restored the insertion speed to
> linear.
>
> Apologies for creating this confusion, nevertheless it was a useful
> thing to learn.
>
> Regards,
> Zlatin
>
> -----Original Message-----
> From: Eli Collins [mailto:eli@cloudera.com]
> Sent: Thursday, January 21, 2010 2:02 PM
> To: hdfs-user@hadoop.apache.org
> Subject: Re: Exponential performance decay - possible lead
>
>>
>> The messages are of the following:
>>
>> 2010-01-18 14:51:25,694 WARN org.apache.hadoop.hdfs.StateChange:
>> BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request
>> received for blk_-5804440919363539694_1026 on ip.removed:port.removed
>> size 1024
>
> This is odd, you should't be getting this warning, I don't see it when
> running your benchmark on my cluster. Are there other relevant/warnings
> errors in the NN or DN logs?
>
> Thanks,
> Eli
> _______________________________________________
>
> This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
> _______________________________________________
>

RE: Exponential performance decay - mystery solved

Posted by Zl...@barclayscapital.com.
Happy to report this doesn't happen with 0.21 even with block report interval of 30 seconds.

Zlatin

________________________________
From: Raghu Angadi [mailto:rangadi@apache.org]
Sent: Thursday, January 21, 2010 7:19 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay - mystery solved


http://issues.apache.org/jira/browse/HADOOP-4584 is supposed to fix this exact problem with the block reports. Were you running 0.21 or 0.20?

Raghu.

On Thu, Jan 21, 2010 at 11:41 AM, <Zl...@barclayscapital.com>> wrote:


Alright, the problem was caused by me setting the frequency of a block
report to 30 seconds.  The idea behind that was to create more load on
the Namenode, but I didn't notice that those block reports were taking
increasing amounts of time to generate.  During that time, a lock was
held which I'm guessing didn't allow the reporting datanode to perform
its functions.

On my hardware, with 100,000 blocks the report takes over 7 seconds.  So
every datanode was unavailable for 7 out of every 30 seconds.  Changing
the interval to a more reasonable value restored the insertion speed to
linear.

Apologies for creating this confusion, nevertheless it was a useful
thing to learn.

Regards,
Zlatin

-----Original Message-----
From: Eli Collins [mailto:eli@cloudera.com<ma...@cloudera.com>]
Sent: Thursday, January 21, 2010 2:02 PM
To: hdfs-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Exponential performance decay - possible lead

>
> The messages are of the following:
>
> 2010-01-18 14:51:25,694 WARN org.apache.hadoop.hdfs.StateChange:
> BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request
> received for blk_-5804440919363539694_1026 on ip.removed:port.removed
> size 1024

This is odd, you should't be getting this warning, I don't see it when
running your benchmark on my cluster. Are there other relevant/warnings
errors in the NN or DN logs?

Thanks,
Eli
_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer<http://www.barcap.com/emaildisclaimer>. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________


Re: Exponential performance decay - mystery solved

Posted by Raghu Angadi <ra...@apache.org>.
http://issues.apache.org/jira/browse/HADOOP-4584 is supposed to fix this
exact problem with the block reports. Were you running 0.21 or 0.20?

Raghu.

On Thu, Jan 21, 2010 at 11:41 AM, <Zl...@barclayscapital.com>wrote:

>
>
> Alright, the problem was caused by me setting the frequency of a block
> report to 30 seconds.  The idea behind that was to create more load on
> the Namenode, but I didn't notice that those block reports were taking
> increasing amounts of time to generate.  During that time, a lock was
> held which I'm guessing didn't allow the reporting datanode to perform
> its functions.
>
> On my hardware, with 100,000 blocks the report takes over 7 seconds.  So
> every datanode was unavailable for 7 out of every 30 seconds.  Changing
> the interval to a more reasonable value restored the insertion speed to
> linear.
>
> Apologies for creating this confusion, nevertheless it was a useful
> thing to learn.
>
> Regards,
> Zlatin
>
> -----Original Message-----
> From: Eli Collins [mailto:eli@cloudera.com]
> Sent: Thursday, January 21, 2010 2:02 PM
> To: hdfs-user@hadoop.apache.org
> Subject: Re: Exponential performance decay - possible lead
>
> >
> > The messages are of the following:
> >
> > 2010-01-18 14:51:25,694 WARN org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request
> > received for blk_-5804440919363539694_1026 on ip.removed:port.removed
> > size 1024
>
> This is odd, you should't be getting this warning, I don't see it when
> running your benchmark on my cluster. Are there other relevant/warnings
> errors in the NN or DN logs?
>
> Thanks,
> Eli
> _______________________________________________
>
> This e-mail may contain information that is confidential, privileged or
> otherwise protected from disclosure. If you are not an intended recipient of
> this e-mail, do not duplicate or redistribute it by any means. Please delete
> it and any attachments and notify the sender that you have received it in
> error. Unless specifically indicated, this e-mail is not an offer to buy or
> sell or a solicitation to buy or sell any securities, investment products or
> other financial product or service, an official confirmation of any
> transaction, or an official statement of Barclays. Any views or opinions
> presented are solely those of the author and do not necessarily represent
> those of Barclays. This e-mail is subject to terms available at the
> following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
> you consent to the foregoing.  Barclays Capital is the investment banking
> division of Barclays Bank PLC, a company registered in England (number
> 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
>  This email may relate to or be sent from other members of the Barclays
> Group.
> _______________________________________________
>

RE: Exponential performance decay - mystery solved

Posted by Zl...@barclayscapital.com.
My 2c: if it is not possible to move the i/o operations listFiles() and
length() outside the lock on FSVolumeSet, maybe set a flag that a block
report is in progress so that the rest of the datanode doesn't just
hang. 
 
Thanks,
Zlatin

________________________________

From: Dhruba Borthakur [mailto:dhruba@gmail.com] 
Sent: Thursday, January 21, 2010 3:38 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay - mystery solved


Some of this delay in generating block reports might be mitigated via
http://issues.apache.org/jira/browse/HDFS-854 

thanks,
dhruba


On Thu, Jan 21, 2010 at 11:41 AM, <Zl...@barclayscapital.com>
wrote:




	Alright, the problem was caused by me setting the frequency of a
block
	report to 30 seconds.  The idea behind that was to create more
load on
	the Namenode, but I didn't notice that those block reports were
taking
	increasing amounts of time to generate.  During that time, a
lock was
	held which I'm guessing didn't allow the reporting datanode to
perform
	its functions.
	
	On my hardware, with 100,000 blocks the report takes over 7
seconds.  So
	every datanode was unavailable for 7 out of every 30 seconds.
Changing
	the interval to a more reasonable value restored the insertion
speed to
	linear.
	
	Apologies for creating this confusion, nevertheless it was a
useful
	thing to learn.
	
	Regards,
	Zlatin
	
	-----Original Message-----
	From: Eli Collins [mailto:eli@cloudera.com]
	Sent: Thursday, January 21, 2010 2:02 PM
	To: hdfs-user@hadoop.apache.org
	Subject: Re: Exponential performance decay - possible lead
	
	>
	> The messages are of the following:
	>
	> 2010-01-18 14:51:25,694 WARN
org.apache.hadoop.hdfs.StateChange:
	> BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock
request
	> received for blk_-5804440919363539694_1026 on
ip.removed:port.removed
	> size 1024
	
	This is odd, you should't be getting this warning, I don't see
it when
	running your benchmark on my cluster. Are there other
relevant/warnings
	errors in the NN or DN logs?
	
	Thanks,
	Eli
	_______________________________________________
	
	This e-mail may contain information that is confidential,
privileged or otherwise protected from disclosure. If you are not an
intended recipient of this e-mail, do not duplicate or redistribute it
by any means. Please delete it and any attachments and notify the sender
that you have received it in error. Unless specifically indicated, this
e-mail is not an offer to buy or sell or a solicitation to buy or sell
any securities, investment products or other financial product or
service, an official confirmation of any transaction, or an official
statement of Barclays. Any views or opinions presented are solely those
of the author and do not necessarily represent those of Barclays. This
e-mail is subject to terms available at the following link:
www.barcap.com/emaildisclaimer. By messaging with Barclays you consent
to the foregoing.  Barclays Capital is the investment banking division
of Barclays Bank PLC, a company registered in England (number 1026167)
with its registered office at 1 Churchill Place, London, E14 5HP.  This
email may relate to or be sent from other members of the Barclays Group.
	_______________________________________________
	




-- 
Connect to me at http://www.facebook.com/dhruba


_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

Re: Exponential performance decay - mystery solved

Posted by Dhruba Borthakur <dh...@gmail.com>.
Some of this delay in generating block reports might be mitigated via
http://issues.apache.org/jira/browse/HDFS-854

thanks,
dhruba


On Thu, Jan 21, 2010 at 11:41 AM, <Zl...@barclayscapital.com>wrote:

>
>
> Alright, the problem was caused by me setting the frequency of a block
> report to 30 seconds.  The idea behind that was to create more load on
> the Namenode, but I didn't notice that those block reports were taking
> increasing amounts of time to generate.  During that time, a lock was
> held which I'm guessing didn't allow the reporting datanode to perform
> its functions.
>
> On my hardware, with 100,000 blocks the report takes over 7 seconds.  So
> every datanode was unavailable for 7 out of every 30 seconds.  Changing
> the interval to a more reasonable value restored the insertion speed to
> linear.
>
> Apologies for creating this confusion, nevertheless it was a useful
> thing to learn.
>
> Regards,
> Zlatin
>
> -----Original Message-----
> From: Eli Collins [mailto:eli@cloudera.com]
> Sent: Thursday, January 21, 2010 2:02 PM
> To: hdfs-user@hadoop.apache.org
> Subject: Re: Exponential performance decay - possible lead
>
> >
> > The messages are of the following:
> >
> > 2010-01-18 14:51:25,694 WARN org.apache.hadoop.hdfs.StateChange:
> > BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request
> > received for blk_-5804440919363539694_1026 on ip.removed:port.removed
> > size 1024
>
> This is odd, you should't be getting this warning, I don't see it when
> running your benchmark on my cluster. Are there other relevant/warnings
> errors in the NN or DN logs?
>
> Thanks,
> Eli
> _______________________________________________
>
> This e-mail may contain information that is confidential, privileged or
> otherwise protected from disclosure. If you are not an intended recipient of
> this e-mail, do not duplicate or redistribute it by any means. Please delete
> it and any attachments and notify the sender that you have received it in
> error. Unless specifically indicated, this e-mail is not an offer to buy or
> sell or a solicitation to buy or sell any securities, investment products or
> other financial product or service, an official confirmation of any
> transaction, or an official statement of Barclays. Any views or opinions
> presented are solely those of the author and do not necessarily represent
> those of Barclays. This e-mail is subject to terms available at the
> following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
> you consent to the foregoing.  Barclays Capital is the investment banking
> division of Barclays Bank PLC, a company registered in England (number
> 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
>  This email may relate to or be sent from other members of the Barclays
> Group.
> _______________________________________________
>



-- 
Connect to me at http://www.facebook.com/dhruba

RE: Exponential performance decay - mystery solved

Posted by Zl...@barclayscapital.com.

Alright, the problem was caused by me setting the frequency of a block
report to 30 seconds.  The idea behind that was to create more load on
the Namenode, but I didn't notice that those block reports were taking
increasing amounts of time to generate.  During that time, a lock was
held which I'm guessing didn't allow the reporting datanode to perform
its functions.

On my hardware, with 100,000 blocks the report takes over 7 seconds.  So
every datanode was unavailable for 7 out of every 30 seconds.  Changing
the interval to a more reasonable value restored the insertion speed to
linear.

Apologies for creating this confusion, nevertheless it was a useful
thing to learn.

Regards,
Zlatin

-----Original Message-----
From: Eli Collins [mailto:eli@cloudera.com] 
Sent: Thursday, January 21, 2010 2:02 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay - possible lead

>
> The messages are of the following:
>
> 2010-01-18 14:51:25,694 WARN org.apache.hadoop.hdfs.StateChange: 
> BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request 
> received for blk_-5804440919363539694_1026 on ip.removed:port.removed 
> size 1024

This is odd, you should't be getting this warning, I don't see it when
running your benchmark on my cluster. Are there other relevant/warnings
errors in the NN or DN logs?

Thanks,
Eli
_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

Re: Exponential performance decay - possible lead

Posted by Eli Collins <el...@cloudera.com>.
>
> The messages are of the following:
>
> 2010-01-18 14:51:25,694 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_-5804440919363539694_1026 on ip.removed:port.removed size 1024

This is odd, you should't be getting this warning, I don't see it when
running your benchmark on my cluster. Are there other
relevant/warnings errors in the NN or DN logs?

Thanks,
Eli

RE: Exponential performance decay - possible lead

Posted by Zl...@barclayscapital.com.
Hi everyone,

I think I have a lead on this issue.  I was watching the namenode metrics in the ganglia interface and noticed a discrepancy between the rate of log messages at INFO level and those at WARN level.  I'm attaching the two graphs.

The graphs for log messages at INFO level at the Namenode and the Datanodes have the same logarithmic shape just like the graphs for number of blocks inserted, block insertion ops, etc.  However, the graph for messages at WARN level at the Namenode has a linear shape.  So even though the insertion rate is slowing down, the rate of these warning messages is constant.  

The messages are of the following:

2010-01-18 14:51:25,694 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_-5804440919363539694_1026 on ip.removed:port.removed size 1024

Let me know if I can help any further.
Zlatin Balevsky

-----Original Message-----
From: Balevsky, Zlatin: IT (NYK) 
Sent: Friday, January 15, 2010 8:26 AM
To: hdfs-user@hadoop.apache.org
Subject: RE: Exponential performance decay when inserting large number of blocks

In the first few tests I created total of 24 files.  In the test with the last graph 88 files before the break and 16 after.  The files were 

#!/bin/bash
HDFS_URI=$1
INSERT_NO=$2
for  i in $(seq 1 8)
do
 path/to/hadoop/bin/hadoop fs -put bigData hdfs://$HDFS_URI/bigData$INSERT_NO-$i > put$i.log 2>&1 & done 

And I invoke it manually, increasing $2 each run.  
-----Original Message-----
From: Eli Collins [mailto:eli@cloudera.com]
Sent: Thursday, January 14, 2010 9:13 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay when inserting large number of blocks

On Thu, Jan 14, 2010 at 3:53 PM,  <Zl...@barclayscapital.com> wrote:
> And the results after several hour wait are the same.  The cluster was 
> absolutely idle between the insertions.  Attached is a graph.

How many files are you creating?  Can you post the script you're using to drive the test?

Thanks,
Eli
_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

RE: Exponential performance decay when inserting large number of blocks

Posted by Zl...@barclayscapital.com.
In the first few tests I created total of 24 files.  In the test with the last graph 88 files before the break and 16 after.  The files were 

#!/bin/bash
HDFS_URI=$1
INSERT_NO=$2
for  i in $(seq 1 8)
do
 path/to/hadoop/bin/hadoop fs -put bigData hdfs://$HDFS_URI/bigData$INSERT_NO-$i > put$i.log 2>&1 &
done 

And I invoke it manually, increasing $2 each run.  
-----Original Message-----
From: Eli Collins [mailto:eli@cloudera.com] 
Sent: Thursday, January 14, 2010 9:13 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay when inserting large number of blocks

On Thu, Jan 14, 2010 at 3:53 PM,  <Zl...@barclayscapital.com> wrote:
> And the results after several hour wait are the same.  The cluster was 
> absolutely idle between the insertions.  Attached is a graph.

How many files are you creating?  Can you post the script you're using to drive the test?

Thanks,
Eli
_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

Re: Exponential performance decay when inserting large number of blocks

Posted by Eli Collins <el...@cloudera.com>.
On Thu, Jan 14, 2010 at 3:53 PM,  <Zl...@barclayscapital.com> wrote:
> And the results after several hour wait are the same.  The cluster was
> absolutely idle between the insertions.  Attached is a graph.

How many files are you creating?  Can you post the script you're using
to drive the test?

Thanks,
Eli

RE: Exponential performance decay when inserting large number of blocks

Posted by Zl...@barclayscapital.com.
And the results after several hour wait are the same.  The cluster was
absolutely idle between the insertions.  Attached is a graph.
 
Zlatin

________________________________

From: Balevsky, Zlatin: IT (NYK) 
Sent: Thursday, January 14, 2010 12:59 PM
To: hdfs-user@hadoop.apache.org
Subject: RE: Exponential performance decay when inserting large number
of blocks


Alex,
 
that would have to be some very low-level issue like the NIC not
properly cleaning up buffers or something at the switch level.  If that
were the case it would have become visible between tests of different
clusters.  It is not disk access issue as I'm currently inserting the
same 512MB file with 128kb block size, and it is not insertion client
issue as that gets restarted often.
 
Zlatin

________________________________

From: alex kamil [mailto:alex.kamil@gmail.com] 
Sent: Thursday, January 14, 2010 12:52 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay when inserting large number
of blocks


Todd, but the "exponential decay" in insert rate may be caused by a
problem on the source as well as a problem on the "target". I think it
is worth exploring the source option as well.


On Thu, Jan 14, 2010 at 12:47 PM, Todd Lipcon <to...@cloudera.com> wrote:


	On Thu, Jan 14, 2010 at 9:28 AM, alex kamil <
alex.kamil@gmail.com> wrote:
	

		>I'm doing the insert from a node on the same rack as
the cluster but it is not part of it.
		
		So it looks like you are copying from a single node, I'd
try to run the inserts from multiple nodes in parallel, to avoid IO
and/or CPU and/or network bottleneck on the "source" node. Try to upload
from multiple locations. 



	You're still missing the point here, Alex: it's not about the
performance, it's about the scaling curve.

	
	-Todd
	 


		On Thu, Jan 14, 2010 at 12:13 PM, <
Zlatin.Balevsky@barclayscapital.com> wrote:
		

			
			
			More general info:
			 
			I'm doing the insert from a node on the same
rack as the cluster but it is not part of it.  Data is being read from a
local disk and the datanodes store to the local partitions as well.  The
filesystem is ext3, but if it were an inode issue the 4-node cluster
would perform much worse than the 11-node.  No MR jobs or any other
activity is present - these are test clusters that I create and remove
with HOD.  I'm using revision 897952 (the hdfs ui reports 897347 for
some reason) , checked out the branch-0.20 a few days ago.
			 
			Todd:
			 
			I will repeat the test, waiting several hours
after the first round of inserts.  Unless the balancer daemon starts by
default, I have not started it.  The datablocks seemed uniformly spread
amongst the datanodes.  I've added two additional metrics to be recorded
by the datanode - DataNode.xmitsInProgress and
DataNode.getXCeiverCount().  These are polled every 10 seconds.  If
anyone wants me to add additional metrics at any component let me know.
			 
			
			> The test case is making files with a ton of
blocks. Appending a block to an end of a file might be O(n) - 
			> usually this isn't a problem since even large
files are almost always <100 blocks, and the majority <10. 
			> In the test, there are files with 50,000+
blocks, so O(n) runtime anywhere in the blocklist for a file is pretty
bad.
			
			
			The files in the first test are 16k blocks each.
I am inserting the same file under different filenames in consecutive
runs.  If that were the reason, the first insert should take the same
amount of time as the last.  Nevertheless, I will run the next test with
4k blocks per file and increase the number of consecutive insertions.
			 
			Dhruba:
			Unless there is a very high number of
collisions, the hashmap should perform in constant time.  Even if there
were collisions, I would be seeing much higher CPU usage on the
NameNode.  According to the metrics I've already sent, towards the end
of the test the capacity of the BlockMap was 512k and the load
approaching 0.66.  
			 
			Best Regards,
			Zlatin Balevsky
			 
			P.S. I could not find contact info for the HOD
developers.  I'd like to ask them to document the "walltime" and
"idleness-limit" parameters!
			 
________________________________

			From: Dhruba Borthakur [mailto:dhruba@gmail.com]

			Sent: Thursday, January 14, 2010 9:04 AM 

			To: hdfs-user@hadoop.apache.org
			Subject: Re: Exponential performance decay when
inserting large number of blocks
			

			Here is another thing that came to my mind.
			
			The Namenode has a hash map in memory where it
inserts all blocks. when a new block needs to be allocated, the namenode
first generates a random number and checks to see if ti exists in the
hashmap. If it does not exist in the hash map, then that number is the
block id of the to-be-allocated block. The namenode then inserts this
number into the hash map and sends it to te client. The client receives
it as the blockid and uses it to write data to the datanode(s).
			
			One possibility is that that the time to do a
hash-lookup varies depending on the number of blocks in the hash.
			
			dhruba
			
			
			
			
			
			On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <
alex.kamil@gmail.com> wrote:
			

				>launched 8 instances of the bin/hadoop
fs -put utility
				Zlatin, may be a silly question, are you
running dfs -put locally on each datanode,  or from a single box 
				Also where are you copying the data
from, do you have local copies on each node before the insert or all
your files reside on a single server, or may be on NFS?
				i would also chk the network stats on
datanodes and namenode and see if the nics are not saturated, i guess
you have enough bandwidth but may be there is some issue with NIC on the
namenode or something, i saw strange things happening. you can probably
monitor the number of conections/sockets, bandwidth, IO waits, # of
threads 
				if you are writing to dfs from a single
location may be there is a problem on a single node to handle all this
outbound traffic, if you are distributing files in parallel from
multiple nodes, than mat be there is an inbound congestion on namenode
or something like that
				
				if its not the case, i'd explore using
distcp utility for copying data in parallel  (it comes with the distro)
				also if you really hit a wall, and have
some time, i'd take look at alternatives to Filesystem API, may be
simething like Fuse-DFS and other packages supported by libhdfs (
http://wiki.apache.org/hadoop/LibHDFS)
				
				
				
				On Wed, Jan 13, 2010 at 11:00 PM, Todd
Lipcon <to...@cloudera.com> wrote:
				

				Err, ignore that attachment - attached
the wrong graph with the right labels! 

				Here's the right graph.

				-Todd 


				On Wed, Jan 13, 2010 at 7:53 PM, Todd
Lipcon <to...@cloudera.com> wrote:
				

				On Wed, Jan 13, 2010 at 6:59 PM, Eric
Sammer <er...@lifeless.net> wrote:
				

				On 1/13/10 8:12 PM, 
Zlatin.Balevsky@barclayscapital.com wrote:
				> Alex, Dhruba
				>
				> I repeated the experiment increasing
the block size to 32k.  Still doing
				> 8 inserts in parallel, file size now
is 512 MB; 11 datanodes.  I was
				> also running iostat on one of the
datanodes.  Did not notice anything
				> that would explain an exponential
slowdown.  There was more activity
				> while the inserts were active but far
from the limits of the disk system.
				
				
				While creating many blocks, could it be
that the replication pipe lining
				is eating up the available handler
threads on the data nodes? By
				increasing the block size you would see
better performance because the
				system spends more time writing data to
local disk and less time dealing
				with things like replication "overhead."
At a small block size, I could
				imagine you're artificially creating a
situation where you saturate the
				default size configured thread pools or
something weird like that.
				
				If you're doing 8 inserts in parallel
from one machine with 11 nodes
				this seems unlikely, but it might be
worth looking into. The question is
				if testing with an artificially small
block size like this is even a
				viable test. At some point the overhead
of talking to the name node,
				selecting data nodes for a block, and
setting up replication pipe lines
				could become some abnormally high
percentage of the run time.
				
				


				The concern isn't why the insertion is
slow, but rather why the scaling curve looks the way it does. Looking at
the data, it looks like the insertion rate (blocks per second) is
actually related as 1/n where N is the number of blocks. Attaching
another graph of the same data which I think is a little clearer to
read.
				 

				Also, I wonder if the cluster is trying
to rebalance blocks toward the
				end of your runtime (if the balancer
daemon is running) and this is
				causing additional shuffling of data.
				


				That's certainly one possibility.

				Zlatin: here's a test to try: after the
FS is full with 400,000 blocks, let the cluster sit for a few hours,
then come back and start another insertion. Is the rate slow, or does it
return to the fast starting speed?

				
				-Todd






			-- 
			Connect to me at http://www.facebook.com/dhruba
			
			_______________________________________________

			 

			This e-mail may contain information that is
confidential, privileged or otherwise protected from disclosure. If you
are not an intended recipient of this e-mail, do not duplicate or
redistribute it by any means. Please delete it and any attachments and
notify the sender that you have received it in error. Unless
specifically indicated, this e-mail is not an offer to buy or sell or a
solicitation to buy or sell any securities, investment products or other
financial product or service, an official confirmation of any
transaction, or an official statement of Barclays. Any views or opinions
presented are solely those of the author and do not necessarily
represent those of Barclays. This e-mail is subject to terms available
at the following link: www.barcap.com/emaildisclaimer. By messaging with
Barclays you consent to the foregoing.  Barclays Capital is the
investment banking division of Barclays Bank PLC, a company registered
in England (number 1026167) with its registered office at 1 Churchill
Place, London, E14 5HP.  This email may relate to or be sent from other
members of the Barclays Group.

			_______________________________________________




_______________________________________________

 

This e-mail may contain information that is confidential, privileged or
otherwise protected from disclosure. If you are not an intended
recipient of this e-mail, do not duplicate or redistribute it by any
means. Please delete it and any attachments and notify the sender that
you have received it in error. Unless specifically indicated, this
e-mail is not an offer to buy or sell or a solicitation to buy or sell
any securities, investment products or other financial product or
service, an official confirmation of any transaction, or an official
statement of Barclays. Any views or opinions presented are solely those
of the author and do not necessarily represent those of Barclays. This
e-mail is subject to terms available at the following link: 
www.barcap.com/emaildisclaimer. By messaging with Barclays you consent
to the foregoing.  Barclays Capital is the investment banking division
of Barclays Bank PLC, a company registered in England (number 1026167)
with its registered office at 1 Churchill Place, London, E14 5HP.  This
email may relate to or be sent from other members of the Barclays Group.

_______________________________________________


_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

RE: Exponential performance decay when inserting large number of blocks

Posted by Zl...@barclayscapital.com.
Alex,
 
that would have to be some very low-level issue like the NIC not
properly cleaning up buffers or something at the switch level.  If that
were the case it would have become visible between tests of different
clusters.  It is not disk access issue as I'm currently inserting the
same 512MB file with 128kb block size, and it is not insertion client
issue as that gets restarted often.
 
Zlatin

________________________________

From: alex kamil [mailto:alex.kamil@gmail.com] 
Sent: Thursday, January 14, 2010 12:52 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay when inserting large number
of blocks


Todd, but the "exponential decay" in insert rate may be caused by a
problem on the source as well as a problem on the "target". I think it
is worth exploring the source option as well.


On Thu, Jan 14, 2010 at 12:47 PM, Todd Lipcon <to...@cloudera.com> wrote:


	On Thu, Jan 14, 2010 at 9:28 AM, alex kamil
<al...@gmail.com> wrote:
	

		>I'm doing the insert from a node on the same rack as
the cluster but it is not part of it.
		
		So it looks like you are copying from a single node, I'd
try to run the inserts from multiple nodes in parallel, to avoid IO
and/or CPU and/or network bottleneck on the "source" node. Try to upload
from multiple locations. 



	You're still missing the point here, Alex: it's not about the
performance, it's about the scaling curve.

	
	-Todd
	 


		On Thu, Jan 14, 2010 at 12:13 PM,
<Zl...@barclayscapital.com> wrote:
		

			
			
			More general info:
			 
			I'm doing the insert from a node on the same
rack as the cluster but it is not part of it.  Data is being read from a
local disk and the datanodes store to the local partitions as well.  The
filesystem is ext3, but if it were an inode issue the 4-node cluster
would perform much worse than the 11-node.  No MR jobs or any other
activity is present - these are test clusters that I create and remove
with HOD.  I'm using revision 897952 (the hdfs ui reports 897347 for
some reason) , checked out the branch-0.20 a few days ago.
			 
			Todd:
			 
			I will repeat the test, waiting several hours
after the first round of inserts.  Unless the balancer daemon starts by
default, I have not started it.  The datablocks seemed uniformly spread
amongst the datanodes.  I've added two additional metrics to be recorded
by the datanode - DataNode.xmitsInProgress and
DataNode.getXCeiverCount().  These are polled every 10 seconds.  If
anyone wants me to add additional metrics at any component let me know.
			 
			
			> The test case is making files with a ton of
blocks. Appending a block to an end of a file might be O(n) - 
			> usually this isn't a problem since even large
files are almost always <100 blocks, and the majority <10. 
			> In the test, there are files with 50,000+
blocks, so O(n) runtime anywhere in the blocklist for a file is pretty
bad.
			
			
			The files in the first test are 16k blocks each.
I am inserting the same file under different filenames in consecutive
runs.  If that were the reason, the first insert should take the same
amount of time as the last.  Nevertheless, I will run the next test with
4k blocks per file and increase the number of consecutive insertions.
			 
			Dhruba:
			Unless there is a very high number of
collisions, the hashmap should perform in constant time.  Even if there
were collisions, I would be seeing much higher CPU usage on the
NameNode.  According to the metrics I've already sent, towards the end
of the test the capacity of the BlockMap was 512k and the load
approaching 0.66.  
			 
			Best Regards,
			Zlatin Balevsky
			 
			P.S. I could not find contact info for the HOD
developers.  I'd like to ask them to document the "walltime" and
"idleness-limit" parameters!
			 
________________________________

			From: Dhruba Borthakur [mailto:dhruba@gmail.com]

			Sent: Thursday, January 14, 2010 9:04 AM 

			To: hdfs-user@hadoop.apache.org
			Subject: Re: Exponential performance decay when
inserting large number of blocks
			

			Here is another thing that came to my mind.
			
			The Namenode has a hash map in memory where it
inserts all blocks. when a new block needs to be allocated, the namenode
first generates a random number and checks to see if ti exists in the
hashmap. If it does not exist in the hash map, then that number is the
block id of the to-be-allocated block. The namenode then inserts this
number into the hash map and sends it to te client. The client receives
it as the blockid and uses it to write data to the datanode(s).
			
			One possibility is that that the time to do a
hash-lookup varies depending on the number of blocks in the hash.
			
			dhruba
			
			
			
			
			
			On Wed, Jan 13, 2010 at 8:57 PM, alex kamil
<al...@gmail.com> wrote:
			

				>launched 8 instances of the bin/hadoop
fs -put utility
				Zlatin, may be a silly question, are you
running dfs -put locally on each datanode,  or from a single box 
				Also where are you copying the data
from, do you have local copies on each node before the insert or all
your files reside on a single server, or may be on NFS?
				i would also chk the network stats on
datanodes and namenode and see if the nics are not saturated, i guess
you have enough bandwidth but may be there is some issue with NIC on the
namenode or something, i saw strange things happening. you can probably
monitor the number of conections/sockets, bandwidth, IO waits, # of
threads 
				if you are writing to dfs from a single
location may be there is a problem on a single node to handle all this
outbound traffic, if you are distributing files in parallel from
multiple nodes, than mat be there is an inbound congestion on namenode
or something like that
				
				if its not the case, i'd explore using
distcp utility for copying data in parallel  (it comes with the distro)
				also if you really hit a wall, and have
some time, i'd take look at alternatives to Filesystem API, may be
simething like Fuse-DFS and other packages supported by libhdfs
(http://wiki.apache.org/hadoop/LibHDFS)
				
				
				
				On Wed, Jan 13, 2010 at 11:00 PM, Todd
Lipcon <to...@cloudera.com> wrote:
				

				Err, ignore that attachment - attached
the wrong graph with the right labels! 

				Here's the right graph.

				-Todd 


				On Wed, Jan 13, 2010 at 7:53 PM, Todd
Lipcon <to...@cloudera.com> wrote:
				

				On Wed, Jan 13, 2010 at 6:59 PM, Eric
Sammer <er...@lifeless.net> wrote:
				

				On 1/13/10 8:12 PM,
Zlatin.Balevsky@barclayscapital.com wrote:
				> Alex, Dhruba
				>
				> I repeated the experiment increasing
the block size to 32k.  Still doing
				> 8 inserts in parallel, file size now
is 512 MB; 11 datanodes.  I was
				> also running iostat on one of the
datanodes.  Did not notice anything
				> that would explain an exponential
slowdown.  There was more activity
				> while the inserts were active but far
from the limits of the disk system.
				
				
				While creating many blocks, could it be
that the replication pipe lining
				is eating up the available handler
threads on the data nodes? By
				increasing the block size you would see
better performance because the
				system spends more time writing data to
local disk and less time dealing
				with things like replication "overhead."
At a small block size, I could
				imagine you're artificially creating a
situation where you saturate the
				default size configured thread pools or
something weird like that.
				
				If you're doing 8 inserts in parallel
from one machine with 11 nodes
				this seems unlikely, but it might be
worth looking into. The question is
				if testing with an artificially small
block size like this is even a
				viable test. At some point the overhead
of talking to the name node,
				selecting data nodes for a block, and
setting up replication pipe lines
				could become some abnormally high
percentage of the run time.
				
				


				The concern isn't why the insertion is
slow, but rather why the scaling curve looks the way it does. Looking at
the data, it looks like the insertion rate (blocks per second) is
actually related as 1/n where N is the number of blocks. Attaching
another graph of the same data which I think is a little clearer to
read.
				 

				Also, I wonder if the cluster is trying
to rebalance blocks toward the
				end of your runtime (if the balancer
daemon is running) and this is
				causing additional shuffling of data.
				


				That's certainly one possibility.

				Zlatin: here's a test to try: after the
FS is full with 400,000 blocks, let the cluster sit for a few hours,
then come back and start another insertion. Is the rate slow, or does it
return to the fast starting speed?

				
				-Todd






			-- 
			Connect to me at http://www.facebook.com/dhruba
			
			_______________________________________________

			 

			This e-mail may contain information that is
confidential, privileged or otherwise protected from disclosure. If you
are not an intended recipient of this e-mail, do not duplicate or
redistribute it by any means. Please delete it and any attachments and
notify the sender that you have received it in error. Unless
specifically indicated, this e-mail is not an offer to buy or sell or a
solicitation to buy or sell any securities, investment products or other
financial product or service, an official confirmation of any
transaction, or an official statement of Barclays. Any views or opinions
presented are solely those of the author and do not necessarily
represent those of Barclays. This e-mail is subject to terms available
at the following link: www.barcap.com/emaildisclaimer. By messaging with
Barclays you consent to the foregoing.  Barclays Capital is the
investment banking division of Barclays Bank PLC, a company registered
in England (number 1026167) with its registered office at 1 Churchill
Place, London, E14 5HP.  This email may relate to or be sent from other
members of the Barclays Group.

			_______________________________________________





_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

Re: Exponential performance decay when inserting large number of blocks

Posted by alex kamil <al...@gmail.com>.
Todd, but the "exponential decay" in insert rate may be caused by a problem
on the source as well as a problem on the "target". I think it is worth
exploring the source option as well.

On Thu, Jan 14, 2010 at 12:47 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Thu, Jan 14, 2010 at 9:28 AM, alex kamil <al...@gmail.com> wrote:
>
>> >I'm doing the insert from a node on the same rack as the cluster but it
>> is not part of it.
>> So it looks like you are copying from a single node, I'd try to run the
>> inserts from multiple nodes in parallel, to avoid IO and/or CPU and/or
>> network bottleneck on the "source" node. Try to upload from multiple
>> locations.
>>
>>
> You're still missing the point here, Alex: it's not about the performance,
> it's about the scaling curve.
>
> -Todd
>
>
>>
>> On Thu, Jan 14, 2010 at 12:13 PM, <Zl...@barclayscapital.com>wrote:
>>
>>>   More general info:
>>>
>>> I'm doing the insert from a node on the same rack as the cluster but it
>>> is not part of it.  Data is being read from a local disk and the datanodes
>>> store to the local partitions as well.  The filesystem is ext3, but if it
>>> were an inode issue the 4-node cluster would perform much worse than the
>>> 11-node.  No MR jobs or any other activity is present - these are test
>>> clusters that I create and remove with HOD.  I'm using revision 897952 (the
>>> hdfs ui reports 897347 for some reason) , checked out the branch-0.20 a few
>>> days ago.
>>>
>>> Todd:
>>>
>>> I will repeat the test, waiting several hours after the first round of
>>> inserts.  Unless the balancer daemon starts by default, I have not started
>>> it.  The datablocks seemed uniformly spread amongst the datanodes.  I've
>>> added two additional metrics to be recorded by the datanode -
>>> DataNode.xmitsInProgress and DataNode.getXCeiverCount().  These are polled
>>> every 10 seconds.  If anyone wants me to add additional metrics at any
>>> component let me know.
>>>
>>>  > The test case is making files with a ton of blocks. Appending a block
>>> to an end of a file might be O(n) -
>>> > usually this isn't a problem since even large files are almost always
>>> <100 blocks, and the majority <10.
>>> > In the test, there are files with 50,000+ blocks, so O(n) runtime
>>> anywhere in the blocklist for a file is pretty bad.
>>>
>>>  The files in the first test are 16k blocks each.  I am inserting the
>>> same file under different filenames in consecutive runs.  If that were the
>>> reason, the first insert should take the same amount of time as the last.
>>> Nevertheless, I will run the next test with 4k blocks per file and increase
>>> the number of consecutive insertions.
>>>
>>> Dhruba:
>>> Unless there is a very high number of collisions, the hashmap should
>>> perform in constant time.  Even if there were collisions, I would be seeing
>>> much higher CPU usage on the NameNode.  According to the metrics I've
>>> already sent, towards the end of the test the capacity of the BlockMap
>>> was 512k and the load approaching 0.66.
>>>
>>> Best Regards,
>>> Zlatin Balevsky
>>>
>>> P.S. I could not find contact info for the HOD developers.  I'd like to
>>> ask them to document the "walltime" and "idleness-limit" parameters!
>>>
>>>  ------------------------------
>>> *From:* Dhruba Borthakur [mailto:dhruba@gmail.com]
>>> *Sent:* Thursday, January 14, 2010 9:04 AM
>>>
>>> *To:* hdfs-user@hadoop.apache.org
>>> *Subject:* Re: Exponential performance decay when inserting large number
>>> of blocks
>>>
>>> Here is another thing that came to my mind.
>>>
>>> The Namenode has a hash map in memory where it inserts all blocks. when a
>>> new block needs to be allocated, the namenode first generates a random
>>> number and checks to see if ti exists in the hashmap. If it does not exist
>>> in the hash map, then that number is the block id of the to-be-allocated
>>> block. The namenode then inserts this number into the hash map and sends it
>>> to te client. The client receives it as the blockid and uses it to write
>>> data to the datanode(s).
>>>
>>> One possibility is that that the time to do a hash-lookup varies
>>> depending on the number of blocks in the hash.
>>>
>>> dhruba
>>>
>>>
>>>
>>>
>>> On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <al...@gmail.com>wrote:
>>>
>>>> >launched 8 instances of the bin/hadoop fs -put utility
>>>> Zlatin, may be a silly question, are you running dfs -put locally on
>>>> each datanode,  or from a single box
>>>> Also where are you copying the data from, do you have local copies on
>>>> each node before the insert or all your files reside on a single server, or
>>>> may be on NFS?
>>>> i would also chk the network stats on datanodes and namenode and see if
>>>> the nics are not saturated, i guess you have enough bandwidth but may be
>>>> there is some issue with NIC on the namenode or something, i saw strange
>>>> things happening. you can probably monitor the number of conections/sockets,
>>>> bandwidth, IO waits, # of threads
>>>> if you are writing to dfs from a single location may be there is a
>>>> problem on a single node to handle all this outbound traffic, if you are
>>>> distributing files in parallel from multiple nodes, than mat be there is an
>>>> inbound congestion on namenode or something like that
>>>>
>>>> if its not the case, i'd explore using distcp utility for copying data
>>>> in parallel  (it comes with the distro)
>>>> also if you really hit a wall, and have some time, i'd take look at
>>>> alternatives to Filesystem API, may be simething like Fuse-DFS and other
>>>> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)
>>>>
>>>>
>>>> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <to...@cloudera.com>wrote:
>>>>
>>>>> Err, ignore that attachment - attached the wrong graph with the right
>>>>> labels!
>>>>>
>>>>> Here's the right graph.
>>>>>
>>>>> -Todd
>>>>>
>>>>>
>>>>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <to...@cloudera.com>wrote:
>>>>>
>>>>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <er...@lifeless.net>wrote:
>>>>>>
>>>>>>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
>>>>>>> > Alex, Dhruba
>>>>>>> >
>>>>>>> > I repeated the experiment increasing the block size to 32k.  Still
>>>>>>> doing
>>>>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I
>>>>>>> was
>>>>>>> > also running iostat on one of the datanodes.  Did not notice
>>>>>>> anything
>>>>>>> > that would explain an exponential slowdown.  There was more
>>>>>>> activity
>>>>>>> > while the inserts were active but far from the limits of the disk
>>>>>>> system.
>>>>>>>
>>>>>>> While creating many blocks, could it be that the replication pipe
>>>>>>> lining
>>>>>>> is eating up the available handler threads on the data nodes? By
>>>>>>> increasing the block size you would see better performance because
>>>>>>> the
>>>>>>> system spends more time writing data to local disk and less time
>>>>>>> dealing
>>>>>>> with things like replication "overhead." At a small block size, I
>>>>>>> could
>>>>>>> imagine you're artificially creating a situation where you saturate
>>>>>>> the
>>>>>>> default size configured thread pools or something weird like that.
>>>>>>>
>>>>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>>>>>>> this seems unlikely, but it might be worth looking into. The question
>>>>>>> is
>>>>>>> if testing with an artificially small block size like this is even a
>>>>>>> viable test. At some point the overhead of talking to the name node,
>>>>>>> selecting data nodes for a block, and setting up replication pipe
>>>>>>> lines
>>>>>>> could become some abnormally high percentage of the run time.
>>>>>>>
>>>>>>>
>>>>>> The concern isn't why the insertion is slow, but rather why the
>>>>>> scaling curve looks the way it does. Looking at the data, it looks like the
>>>>>> insertion rate (blocks per second) is actually related as 1/n where N is the
>>>>>> number of blocks. Attaching another graph of the same data which I think is
>>>>>> a little clearer to read.
>>>>>>
>>>>>>
>>>>>>> Also, I wonder if the cluster is trying to rebalance blocks toward
>>>>>>> the
>>>>>>> end of your runtime (if the balancer daemon is running) and this is
>>>>>>> causing additional shuffling of data.
>>>>>>>
>>>>>>
>>>>>> That's certainly one possibility.
>>>>>>
>>>>>> Zlatin: here's a test to try: after the FS is full with 400,000
>>>>>> blocks, let the cluster sit for a few hours, then come back and start
>>>>>> another insertion. Is the rate slow, or does it return to the fast starting
>>>>>> speed?
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Connect to me at http://www.facebook.com/dhruba
>>>
>>> _______________________________________________
>>>
>>>
>>>
>>> This e-mail may contain information that is confidential, privileged or
>>> otherwise protected from disclosure. If you are not an intended recipient of
>>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>>> it and any attachments and notify the sender that you have received it in
>>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>>> sell or a solicitation to buy or sell any securities, investment products or
>>> other financial product or service, an official confirmation of any
>>> transaction, or an official statement of Barclays. Any views or opinions
>>> presented are solely those of the author and do not necessarily represent
>>> those of Barclays. This e-mail is subject to terms available at the
>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>> Barclays you consent to the foregoing.  Barclays Capital is the
>>> investment banking division of Barclays Bank PLC, a company registered in
>>> England (number 1026167) with its registered office at 1 Churchill Place,
>>> London, E14 5HP.  This email may relate to or be sent from other members
>>> of the Barclays Group.**
>>>
>>> _______________________________________________
>>>
>>
>>
>

Re: Exponential performance decay when inserting large number of blocks

Posted by Todd Lipcon <to...@cloudera.com>.
On Thu, Jan 14, 2010 at 9:28 AM, alex kamil <al...@gmail.com> wrote:

> >I'm doing the insert from a node on the same rack as the cluster but it
> is not part of it.
> So it looks like you are copying from a single node, I'd try to run the
> inserts from multiple nodes in parallel, to avoid IO and/or CPU and/or
> network bottleneck on the "source" node. Try to upload from multiple
> locations.
>
>
You're still missing the point here, Alex: it's not about the performance,
it's about the scaling curve.

-Todd


>
> On Thu, Jan 14, 2010 at 12:13 PM, <Zl...@barclayscapital.com>wrote:
>
>>   More general info:
>>
>> I'm doing the insert from a node on the same rack as the cluster but it is
>> not part of it.  Data is being read from a local disk and the datanodes
>> store to the local partitions as well.  The filesystem is ext3, but if it
>> were an inode issue the 4-node cluster would perform much worse than the
>> 11-node.  No MR jobs or any other activity is present - these are test
>> clusters that I create and remove with HOD.  I'm using revision 897952 (the
>> hdfs ui reports 897347 for some reason) , checked out the branch-0.20 a few
>> days ago.
>>
>> Todd:
>>
>> I will repeat the test, waiting several hours after the first round of
>> inserts.  Unless the balancer daemon starts by default, I have not started
>> it.  The datablocks seemed uniformly spread amongst the datanodes.  I've
>> added two additional metrics to be recorded by the datanode -
>> DataNode.xmitsInProgress and DataNode.getXCeiverCount().  These are polled
>> every 10 seconds.  If anyone wants me to add additional metrics at any
>> component let me know.
>>
>>  > The test case is making files with a ton of blocks. Appending a block
>> to an end of a file might be O(n) -
>> > usually this isn't a problem since even large files are almost always
>> <100 blocks, and the majority <10.
>> > In the test, there are files with 50,000+ blocks, so O(n) runtime
>> anywhere in the blocklist for a file is pretty bad.
>>
>>  The files in the first test are 16k blocks each.  I am inserting the
>> same file under different filenames in consecutive runs.  If that were the
>> reason, the first insert should take the same amount of time as the last.
>> Nevertheless, I will run the next test with 4k blocks per file and increase
>> the number of consecutive insertions.
>>
>> Dhruba:
>> Unless there is a very high number of collisions, the hashmap should
>> perform in constant time.  Even if there were collisions, I would be seeing
>> much higher CPU usage on the NameNode.  According to the metrics I've
>> already sent, towards the end of the test the capacity of the BlockMap
>> was 512k and the load approaching 0.66.
>>
>> Best Regards,
>> Zlatin Balevsky
>>
>> P.S. I could not find contact info for the HOD developers.  I'd like to
>> ask them to document the "walltime" and "idleness-limit" parameters!
>>
>>  ------------------------------
>> *From:* Dhruba Borthakur [mailto:dhruba@gmail.com]
>> *Sent:* Thursday, January 14, 2010 9:04 AM
>>
>> *To:* hdfs-user@hadoop.apache.org
>> *Subject:* Re: Exponential performance decay when inserting large number
>> of blocks
>>
>> Here is another thing that came to my mind.
>>
>> The Namenode has a hash map in memory where it inserts all blocks. when a
>> new block needs to be allocated, the namenode first generates a random
>> number and checks to see if ti exists in the hashmap. If it does not exist
>> in the hash map, then that number is the block id of the to-be-allocated
>> block. The namenode then inserts this number into the hash map and sends it
>> to te client. The client receives it as the blockid and uses it to write
>> data to the datanode(s).
>>
>> One possibility is that that the time to do a hash-lookup varies depending
>> on the number of blocks in the hash.
>>
>> dhruba
>>
>>
>>
>>
>> On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <al...@gmail.com> wrote:
>>
>>> >launched 8 instances of the bin/hadoop fs -put utility
>>> Zlatin, may be a silly question, are you running dfs -put locally on each
>>> datanode,  or from a single box
>>> Also where are you copying the data from, do you have local copies on
>>> each node before the insert or all your files reside on a single server, or
>>> may be on NFS?
>>> i would also chk the network stats on datanodes and namenode and see if
>>> the nics are not saturated, i guess you have enough bandwidth but may be
>>> there is some issue with NIC on the namenode or something, i saw strange
>>> things happening. you can probably monitor the number of conections/sockets,
>>> bandwidth, IO waits, # of threads
>>> if you are writing to dfs from a single location may be there is a
>>> problem on a single node to handle all this outbound traffic, if you are
>>> distributing files in parallel from multiple nodes, than mat be there is an
>>> inbound congestion on namenode or something like that
>>>
>>> if its not the case, i'd explore using distcp utility for copying data in
>>> parallel  (it comes with the distro)
>>> also if you really hit a wall, and have some time, i'd take look at
>>> alternatives to Filesystem API, may be simething like Fuse-DFS and other
>>> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)
>>>
>>>
>>> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> Err, ignore that attachment - attached the wrong graph with the right
>>>> labels!
>>>>
>>>> Here's the right graph.
>>>>
>>>> -Todd
>>>>
>>>>
>>>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>>
>>>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <er...@lifeless.net>wrote:
>>>>>
>>>>>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
>>>>>> > Alex, Dhruba
>>>>>> >
>>>>>> > I repeated the experiment increasing the block size to 32k.  Still
>>>>>> doing
>>>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
>>>>>> > also running iostat on one of the datanodes.  Did not notice
>>>>>> anything
>>>>>> > that would explain an exponential slowdown.  There was more activity
>>>>>> > while the inserts were active but far from the limits of the disk
>>>>>> system.
>>>>>>
>>>>>> While creating many blocks, could it be that the replication pipe
>>>>>> lining
>>>>>> is eating up the available handler threads on the data nodes? By
>>>>>> increasing the block size you would see better performance because the
>>>>>> system spends more time writing data to local disk and less time
>>>>>> dealing
>>>>>> with things like replication "overhead." At a small block size, I
>>>>>> could
>>>>>> imagine you're artificially creating a situation where you saturate
>>>>>> the
>>>>>> default size configured thread pools or something weird like that.
>>>>>>
>>>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>>>>>> this seems unlikely, but it might be worth looking into. The question
>>>>>> is
>>>>>> if testing with an artificially small block size like this is even a
>>>>>> viable test. At some point the overhead of talking to the name node,
>>>>>> selecting data nodes for a block, and setting up replication pipe
>>>>>> lines
>>>>>> could become some abnormally high percentage of the run time.
>>>>>>
>>>>>>
>>>>> The concern isn't why the insertion is slow, but rather why the scaling
>>>>> curve looks the way it does. Looking at the data, it looks like the
>>>>> insertion rate (blocks per second) is actually related as 1/n where N is the
>>>>> number of blocks. Attaching another graph of the same data which I think is
>>>>> a little clearer to read.
>>>>>
>>>>>
>>>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the
>>>>>> end of your runtime (if the balancer daemon is running) and this is
>>>>>> causing additional shuffling of data.
>>>>>>
>>>>>
>>>>> That's certainly one possibility.
>>>>>
>>>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks,
>>>>> let the cluster sit for a few hours, then come back and start another
>>>>> insertion. Is the rate slow, or does it return to the fast starting speed?
>>>>>
>>>>> -Todd
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Connect to me at http://www.facebook.com/dhruba
>>
>> _______________________________________________
>>
>>
>>
>> This e-mail may contain information that is confidential, privileged or
>> otherwise protected from disclosure. If you are not an intended recipient of
>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>> it and any attachments and notify the sender that you have received it in
>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>> sell or a solicitation to buy or sell any securities, investment products or
>> other financial product or service, an official confirmation of any
>> transaction, or an official statement of Barclays. Any views or opinions
>> presented are solely those of the author and do not necessarily represent
>> those of Barclays. This e-mail is subject to terms available at the
>> following link: www.barcap.com/emaildisclaimer. By messaging with
>> Barclays you consent to the foregoing.  Barclays Capital is the
>> investment banking division of Barclays Bank PLC, a company registered in
>> England (number 1026167) with its registered office at 1 Churchill Place,
>> London, E14 5HP.  This email may relate to or be sent from other members
>> of the Barclays Group.**
>>
>> _______________________________________________
>>
>
>

Re: Exponential performance decay when inserting large number of blocks

Posted by alex kamil <al...@gmail.com>.
>I'm doing the insert from a node on the same rack as the cluster but it is
not part of it.
So it looks like you are copying from a single node, I'd try to run the
inserts from multiple nodes in parallel, to avoid IO and/or CPU and/or
network bottleneck on the "source" node. Try to upload from multiple
locations.

On Thu, Jan 14, 2010 at 12:13 PM, <Zl...@barclayscapital.com>wrote:

>   More general info:
>
> I'm doing the insert from a node on the same rack as the cluster but it is
> not part of it.  Data is being read from a local disk and the datanodes
> store to the local partitions as well.  The filesystem is ext3, but if it
> were an inode issue the 4-node cluster would perform much worse than the
> 11-node.  No MR jobs or any other activity is present - these are test
> clusters that I create and remove with HOD.  I'm using revision 897952 (the
> hdfs ui reports 897347 for some reason) , checked out the branch-0.20 a few
> days ago.
>
> Todd:
>
> I will repeat the test, waiting several hours after the first round of
> inserts.  Unless the balancer daemon starts by default, I have not started
> it.  The datablocks seemed uniformly spread amongst the datanodes.  I've
> added two additional metrics to be recorded by the datanode -
> DataNode.xmitsInProgress and DataNode.getXCeiverCount().  These are polled
> every 10 seconds.  If anyone wants me to add additional metrics at any
> component let me know.
>
>  > The test case is making files with a ton of blocks. Appending a block
> to an end of a file might be O(n) -
> > usually this isn't a problem since even large files are almost always
> <100 blocks, and the majority <10.
> > In the test, there are files with 50,000+ blocks, so O(n) runtime
> anywhere in the blocklist for a file is pretty bad.
>
> The files in the first test are 16k blocks each.  I am inserting the same
> file under different filenames in consecutive runs.  If that were the
> reason, the first insert should take the same amount of time as the last.
> Nevertheless, I will run the next test with 4k blocks per file and increase
> the number of consecutive insertions.
>
> Dhruba:
> Unless there is a very high number of collisions, the hashmap should
> perform in constant time.  Even if there were collisions, I would be seeing
> much higher CPU usage on the NameNode.  According to the metrics I've
> already sent, towards the end of the test the capacity of the BlockMap
> was 512k and the load approaching 0.66.
>
> Best Regards,
> Zlatin Balevsky
>
> P.S. I could not find contact info for the HOD developers.  I'd like to
> ask them to document the "walltime" and "idleness-limit" parameters!
>
>  ------------------------------
> *From:* Dhruba Borthakur [mailto:dhruba@gmail.com]
> *Sent:* Thursday, January 14, 2010 9:04 AM
>
> *To:* hdfs-user@hadoop.apache.org
> *Subject:* Re: Exponential performance decay when inserting large number
> of blocks
>
> Here is another thing that came to my mind.
>
> The Namenode has a hash map in memory where it inserts all blocks. when a
> new block needs to be allocated, the namenode first generates a random
> number and checks to see if ti exists in the hashmap. If it does not exist
> in the hash map, then that number is the block id of the to-be-allocated
> block. The namenode then inserts this number into the hash map and sends it
> to te client. The client receives it as the blockid and uses it to write
> data to the datanode(s).
>
> One possibility is that that the time to do a hash-lookup varies depending
> on the number of blocks in the hash.
>
> dhruba
>
>
>
>
> On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <al...@gmail.com> wrote:
>
>> >launched 8 instances of the bin/hadoop fs -put utility
>> Zlatin, may be a silly question, are you running dfs -put locally on each
>> datanode,  or from a single box
>> Also where are you copying the data from, do you have local copies on each
>> node before the insert or all your files reside on a single server, or may
>> be on NFS?
>> i would also chk the network stats on datanodes and namenode and see if
>> the nics are not saturated, i guess you have enough bandwidth but may be
>> there is some issue with NIC on the namenode or something, i saw strange
>> things happening. you can probably monitor the number of conections/sockets,
>> bandwidth, IO waits, # of threads
>> if you are writing to dfs from a single location may be there is a problem
>> on a single node to handle all this outbound traffic, if you are
>> distributing files in parallel from multiple nodes, than mat be there is an
>> inbound congestion on namenode or something like that
>>
>> if its not the case, i'd explore using distcp utility for copying data in
>> parallel  (it comes with the distro)
>> also if you really hit a wall, and have some time, i'd take look at
>> alternatives to Filesystem API, may be simething like Fuse-DFS and other
>> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)
>>
>>
>> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Err, ignore that attachment - attached the wrong graph with the right
>>> labels!
>>>
>>> Here's the right graph.
>>>
>>> -Todd
>>>
>>>
>>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <er...@lifeless.net> wrote:
>>>>
>>>>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
>>>>> > Alex, Dhruba
>>>>> >
>>>>> > I repeated the experiment increasing the block size to 32k.  Still
>>>>> doing
>>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
>>>>> > also running iostat on one of the datanodes.  Did not notice anything
>>>>> > that would explain an exponential slowdown.  There was more activity
>>>>> > while the inserts were active but far from the limits of the disk
>>>>> system.
>>>>>
>>>>> While creating many blocks, could it be that the replication pipe
>>>>> lining
>>>>> is eating up the available handler threads on the data nodes? By
>>>>> increasing the block size you would see better performance because the
>>>>> system spends more time writing data to local disk and less time
>>>>> dealing
>>>>> with things like replication "overhead." At a small block size, I could
>>>>> imagine you're artificially creating a situation where you saturate the
>>>>> default size configured thread pools or something weird like that.
>>>>>
>>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>>>>> this seems unlikely, but it might be worth looking into. The question
>>>>> is
>>>>> if testing with an artificially small block size like this is even a
>>>>> viable test. At some point the overhead of talking to the name node,
>>>>> selecting data nodes for a block, and setting up replication pipe lines
>>>>> could become some abnormally high percentage of the run time.
>>>>>
>>>>>
>>>> The concern isn't why the insertion is slow, but rather why the scaling
>>>> curve looks the way it does. Looking at the data, it looks like the
>>>> insertion rate (blocks per second) is actually related as 1/n where N is the
>>>> number of blocks. Attaching another graph of the same data which I think is
>>>> a little clearer to read.
>>>>
>>>>
>>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the
>>>>> end of your runtime (if the balancer daemon is running) and this is
>>>>> causing additional shuffling of data.
>>>>>
>>>>
>>>> That's certainly one possibility.
>>>>
>>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks,
>>>> let the cluster sit for a few hours, then come back and start another
>>>> insertion. Is the rate slow, or does it return to the fast starting speed?
>>>>
>>>> -Todd
>>>>
>>>
>>>
>>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>
> _______________________________________________
>
>
>
> This e-mail may contain information that is confidential, privileged or
> otherwise protected from disclosure. If you are not an intended recipient of
> this e-mail, do not duplicate or redistribute it by any means. Please delete
> it and any attachments and notify the sender that you have received it in
> error. Unless specifically indicated, this e-mail is not an offer to buy or
> sell or a solicitation to buy or sell any securities, investment products or
> other financial product or service, an official confirmation of any
> transaction, or an official statement of Barclays. Any views or opinions
> presented are solely those of the author and do not necessarily represent
> those of Barclays. This e-mail is subject to terms available at the
> following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
> you consent to the foregoing.  Barclays Capital is the investment banking
> division of Barclays Bank PLC, a company registered in England (number
> 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
> This email may relate to or be sent from other members of the Barclays
> Group.**
>
> _______________________________________________
>

RE: Exponential performance decay when inserting large number of blocks

Posted by Zl...@barclayscapital.com.
More general info:
 
I'm doing the insert from a node on the same rack as the cluster but it
is not part of it.  Data is being read from a local disk and the
datanodes store to the local partitions as well.  The filesystem is
ext3, but if it were an inode issue the 4-node cluster would perform
much worse than the 11-node.  No MR jobs or any other activity is
present - these are test clusters that I create and remove with HOD.
I'm using revision 897952 (the hdfs ui reports 897347 for some reason) ,
checked out the branch-0.20 a few days ago.
 
Todd:
 
I will repeat the test, waiting several hours after the first round of
inserts.  Unless the balancer daemon starts by default, I have not
started it.  The datablocks seemed uniformly spread amongst the
datanodes.  I've added two additional metrics to be recorded by the
datanode - DataNode.xmitsInProgress and DataNode.getXCeiverCount().
These are polled every 10 seconds.  If anyone wants me to add additional
metrics at any component let me know.
 
> The test case is making files with a ton of blocks. Appending a block
to an end of a file might be O(n) - 
> usually this isn't a problem since even large files are almost always
<100 blocks, and the majority <10. 
> In the test, there are files with 50,000+ blocks, so O(n) runtime
anywhere in the blocklist for a file is pretty bad.


The files in the first test are 16k blocks each.  I am inserting the
same file under different filenames in consecutive runs.  If that were
the reason, the first insert should take the same amount of time as the
last.  Nevertheless, I will run the next test with 4k blocks per file
and increase the number of consecutive insertions.
 
Dhruba:
Unless there is a very high number of collisions, the hashmap should
perform in constant time.  Even if there were collisions, I would be
seeing much higher CPU usage on the NameNode.  According to the metrics
I've already sent, towards the end of the test the capacity of the
BlockMap was 512k and the load approaching 0.66.  
 
Best Regards,
Zlatin Balevsky
 
P.S. I could not find contact info for the HOD developers.  I'd like to
ask them to document the "walltime" and "idleness-limit" parameters!
 
________________________________

From: Dhruba Borthakur [mailto:dhruba@gmail.com] 
Sent: Thursday, January 14, 2010 9:04 AM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay when inserting large number
of blocks


Here is another thing that came to my mind.

The Namenode has a hash map in memory where it inserts all blocks. when
a new block needs to be allocated, the namenode first generates a random
number and checks to see if ti exists in the hashmap. If it does not
exist in the hash map, then that number is the block id of the
to-be-allocated block. The namenode then inserts this number into the
hash map and sends it to te client. The client receives it as the
blockid and uses it to write data to the datanode(s).

One possibility is that that the time to do a hash-lookup varies
depending on the number of blocks in the hash.

dhruba





On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <al...@gmail.com>
wrote:


	>launched 8 instances of the bin/hadoop fs -put utility
	Zlatin, may be a silly question, are you running dfs -put
locally on each datanode,  or from a single box 
	Also where are you copying the data from, do you have local
copies on each node before the insert or all your files reside on a
single server, or may be on NFS?
	i would also chk the network stats on datanodes and namenode and
see if the nics are not saturated, i guess you have enough bandwidth but
may be there is some issue with NIC on the namenode or something, i saw
strange things happening. you can probably monitor the number of
conections/sockets, bandwidth, IO waits, # of threads 
	if you are writing to dfs from a single location may be there is
a problem on a single node to handle all this outbound traffic, if you
are distributing files in parallel from multiple nodes, than mat be
there is an inbound congestion on namenode or something like that
	
	if its not the case, i'd explore using distcp utility for
copying data in parallel  (it comes with the distro)
	also if you really hit a wall, and have some time, i'd take look
at alternatives to Filesystem API, may be simething like Fuse-DFS and
other packages supported by libhdfs
(http://wiki.apache.org/hadoop/LibHDFS)
	
	
	
	On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon
<to...@cloudera.com> wrote:
	

		Err, ignore that attachment - attached the wrong graph
with the right labels! 

		Here's the right graph.

		-Todd 


		On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon
<to...@cloudera.com> wrote:
		

			On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer
<er...@lifeless.net> wrote:
			

				On 1/13/10 8:12 PM,
Zlatin.Balevsky@barclayscapital.com wrote:
				> Alex, Dhruba
				>
				> I repeated the experiment increasing
the block size to 32k.  Still doing
				> 8 inserts in parallel, file size now
is 512 MB; 11 datanodes.  I was
				> also running iostat on one of the
datanodes.  Did not notice anything
				> that would explain an exponential
slowdown.  There was more activity
				> while the inserts were active but far
from the limits of the disk system.
				
				
				While creating many blocks, could it be
that the replication pipe lining
				is eating up the available handler
threads on the data nodes? By
				increasing the block size you would see
better performance because the
				system spends more time writing data to
local disk and less time dealing
				with things like replication "overhead."
At a small block size, I could
				imagine you're artificially creating a
situation where you saturate the
				default size configured thread pools or
something weird like that.
				
				If you're doing 8 inserts in parallel
from one machine with 11 nodes
				this seems unlikely, but it might be
worth looking into. The question is
				if testing with an artificially small
block size like this is even a
				viable test. At some point the overhead
of talking to the name node,
				selecting data nodes for a block, and
setting up replication pipe lines
				could become some abnormally high
percentage of the run time.
				
				


			The concern isn't why the insertion is slow, but
rather why the scaling curve looks the way it does. Looking at the data,
it looks like the insertion rate (blocks per second) is actually related
as 1/n where N is the number of blocks. Attaching another graph of the
same data which I think is a little clearer to read.
			 

				Also, I wonder if the cluster is trying
to rebalance blocks toward the
				end of your runtime (if the balancer
daemon is running) and this is
				causing additional shuffling of data.
				


			That's certainly one possibility.

			Zlatin: here's a test to try: after the FS is
full with 400,000 blocks, let the cluster sit for a few hours, then come
back and start another insertion. Is the rate slow, or does it return to
the fast starting speed?

			
			-Todd






-- 
Connect to me at http://www.facebook.com/dhruba


_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

Re: Exponential performance decay when inserting large number of blocks

Posted by Todd Lipcon <to...@cloudera.com>.
I have another conjecture about this:

The test case is making files with a ton of blocks. Appending a block to an
end of a file might be O(n) - usually this isn't a problem since even large
files are almost always <100 blocks, and the majority <10. In the test,
there are files with 50,000+ blocks, so O(n) runtime anywhere in the
blocklist for a file is pretty bad.

-Todd

On Thu, Jan 14, 2010 at 6:03 AM, Dhruba Borthakur <dh...@gmail.com> wrote:

> Here is another thing that came to my mind.
>
> The Namenode has a hash map in memory where it inserts all blocks. when a
> new block needs to be allocated, the namenode first generates a random
> number and checks to see if ti exists in the hashmap. If it does not exist
> in the hash map, then that number is the block id of the to-be-allocated
> block. The namenode then inserts this number into the hash map and sends it
> to te client. The client receives it as the blockid and uses it to write
> data to the datanode(s).
>
> One possibility is that that the time to do a hash-lookup varies depending
> on the number of blocks in the hash.
>
> dhruba
>
>
>
>
>
> On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <al...@gmail.com> wrote:
>
>> >launched 8 instances of the bin/hadoop fs -put utility
>> Zlatin, may be a silly question, are you running dfs -put locally on each
>> datanode,  or from a single box
>> Also where are you copying the data from, do you have local copies on each
>> node before the insert or all your files reside on a single server, or may
>> be on NFS?
>> i would also chk the network stats on datanodes and namenode and see if
>> the nics are not saturated, i guess you have enough bandwidth but may be
>> there is some issue with NIC on the namenode or something, i saw strange
>> things happening. you can probably monitor the number of conections/sockets,
>> bandwidth, IO waits, # of threads
>> if you are writing to dfs from a single location may be there is a problem
>> on a single node to handle all this outbound traffic, if you are
>> distributing files in parallel from multiple nodes, than mat be there is an
>> inbound congestion on namenode or something like that
>>
>> if its not the case, i'd explore using distcp utility for copying data in
>> parallel  (it comes with the distro)
>> also if you really hit a wall, and have some time, i'd take look at
>> alternatives to Filesystem API, may be simething like Fuse-DFS and other
>> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)
>>
>>
>> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Err, ignore that attachment - attached the wrong graph with the right
>>> labels!
>>>
>>> Here's the right graph.
>>>
>>> -Todd
>>>
>>>
>>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <er...@lifeless.net> wrote:
>>>>
>>>>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
>>>>> > Alex, Dhruba
>>>>> >
>>>>> > I repeated the experiment increasing the block size to 32k.  Still
>>>>> doing
>>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
>>>>> > also running iostat on one of the datanodes.  Did not notice anything
>>>>> > that would explain an exponential slowdown.  There was more activity
>>>>> > while the inserts were active but far from the limits of the disk
>>>>> system.
>>>>>
>>>>> While creating many blocks, could it be that the replication pipe
>>>>> lining
>>>>> is eating up the available handler threads on the data nodes? By
>>>>> increasing the block size you would see better performance because the
>>>>> system spends more time writing data to local disk and less time
>>>>> dealing
>>>>> with things like replication "overhead." At a small block size, I could
>>>>> imagine you're artificially creating a situation where you saturate the
>>>>> default size configured thread pools or something weird like that.
>>>>>
>>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>>>>> this seems unlikely, but it might be worth looking into. The question
>>>>> is
>>>>> if testing with an artificially small block size like this is even a
>>>>> viable test. At some point the overhead of talking to the name node,
>>>>> selecting data nodes for a block, and setting up replication pipe lines
>>>>> could become some abnormally high percentage of the run time.
>>>>>
>>>>>
>>>> The concern isn't why the insertion is slow, but rather why the scaling
>>>> curve looks the way it does. Looking at the data, it looks like the
>>>> insertion rate (blocks per second) is actually related as 1/n where N is the
>>>> number of blocks. Attaching another graph of the same data which I think is
>>>> a little clearer to read.
>>>>
>>>>
>>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the
>>>>> end of your runtime (if the balancer daemon is running) and this is
>>>>> causing additional shuffling of data.
>>>>>
>>>>
>>>> That's certainly one possibility.
>>>>
>>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks,
>>>> let the cluster sit for a few hours, then come back and start another
>>>> insertion. Is the rate slow, or does it return to the fast starting speed?
>>>>
>>>> -Todd
>>>>
>>>
>>>
>>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: Exponential performance decay when inserting large number of blocks

Posted by Dhruba Borthakur <dh...@gmail.com>.
Here is another thing that came to my mind.

The Namenode has a hash map in memory where it inserts all blocks. when a
new block needs to be allocated, the namenode first generates a random
number and checks to see if ti exists in the hashmap. If it does not exist
in the hash map, then that number is the block id of the to-be-allocated
block. The namenode then inserts this number into the hash map and sends it
to te client. The client receives it as the blockid and uses it to write
data to the datanode(s).

One possibility is that that the time to do a hash-lookup varies depending
on the number of blocks in the hash.

dhruba




On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <al...@gmail.com> wrote:

> >launched 8 instances of the bin/hadoop fs -put utility
> Zlatin, may be a silly question, are you running dfs -put locally on each
> datanode,  or from a single box
> Also where are you copying the data from, do you have local copies on each
> node before the insert or all your files reside on a single server, or may
> be on NFS?
> i would also chk the network stats on datanodes and namenode and see if the
> nics are not saturated, i guess you have enough bandwidth but may be there
> is some issue with NIC on the namenode or something, i saw strange things
> happening. you can probably monitor the number of conections/sockets,
> bandwidth, IO waits, # of threads
> if you are writing to dfs from a single location may be there is a problem
> on a single node to handle all this outbound traffic, if you are
> distributing files in parallel from multiple nodes, than mat be there is an
> inbound congestion on namenode or something like that
>
> if its not the case, i'd explore using distcp utility for copying data in
> parallel  (it comes with the distro)
> also if you really hit a wall, and have some time, i'd take look at
> alternatives to Filesystem API, may be simething like Fuse-DFS and other
> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)
>
>
> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Err, ignore that attachment - attached the wrong graph with the right
>> labels!
>>
>> Here's the right graph.
>>
>> -Todd
>>
>>
>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <er...@lifeless.net> wrote:
>>>
>>>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
>>>> > Alex, Dhruba
>>>> >
>>>> > I repeated the experiment increasing the block size to 32k.  Still
>>>> doing
>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
>>>> > also running iostat on one of the datanodes.  Did not notice anything
>>>> > that would explain an exponential slowdown.  There was more activity
>>>> > while the inserts were active but far from the limits of the disk
>>>> system.
>>>>
>>>> While creating many blocks, could it be that the replication pipe lining
>>>> is eating up the available handler threads on the data nodes? By
>>>> increasing the block size you would see better performance because the
>>>> system spends more time writing data to local disk and less time dealing
>>>> with things like replication "overhead." At a small block size, I could
>>>> imagine you're artificially creating a situation where you saturate the
>>>> default size configured thread pools or something weird like that.
>>>>
>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>>>> this seems unlikely, but it might be worth looking into. The question is
>>>> if testing with an artificially small block size like this is even a
>>>> viable test. At some point the overhead of talking to the name node,
>>>> selecting data nodes for a block, and setting up replication pipe lines
>>>> could become some abnormally high percentage of the run time.
>>>>
>>>>
>>> The concern isn't why the insertion is slow, but rather why the scaling
>>> curve looks the way it does. Looking at the data, it looks like the
>>> insertion rate (blocks per second) is actually related as 1/n where N is the
>>> number of blocks. Attaching another graph of the same data which I think is
>>> a little clearer to read.
>>>
>>>
>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the
>>>> end of your runtime (if the balancer daemon is running) and this is
>>>> causing additional shuffling of data.
>>>>
>>>
>>> That's certainly one possibility.
>>>
>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks,
>>> let the cluster sit for a few hours, then come back and start another
>>> insertion. Is the rate slow, or does it return to the fast starting speed?
>>>
>>> -Todd
>>>
>>
>>
>


-- 
Connect to me at http://www.facebook.com/dhruba

Re: Exponential performance decay when inserting large number of blocks

Posted by alex kamil <al...@gmail.com>.
>launched 8 instances of the bin/hadoop fs -put utility
Zlatin, may be a silly question, are you running dfs -put locally on each
datanode,  or from a single box
Also where are you copying the data from, do you have local copies on each
node before the insert or all your files reside on a single server, or may
be on NFS?
i would also chk the network stats on datanodes and namenode and see if the
nics are not saturated, i guess you have enough bandwidth but may be there
is some issue with NIC on the namenode or something, i saw strange things
happening. you can probably monitor the number of conections/sockets,
bandwidth, IO waits, # of threads
if you are writing to dfs from a single location may be there is a problem
on a single node to handle all this outbound traffic, if you are
distributing files in parallel from multiple nodes, than mat be there is an
inbound congestion on namenode or something like that

if its not the case, i'd explore using distcp utility for copying data in
parallel  (it comes with the distro)
also if you really hit a wall, and have some time, i'd take look at
alternatives to Filesystem API, may be simething like Fuse-DFS and other
packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)


On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Err, ignore that attachment - attached the wrong graph with the right
> labels!
>
> Here's the right graph.
>
> -Todd
>
>
> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <er...@lifeless.net> wrote:
>>
>>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
>>> > Alex, Dhruba
>>> >
>>> > I repeated the experiment increasing the block size to 32k.  Still
>>> doing
>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
>>> > also running iostat on one of the datanodes.  Did not notice anything
>>> > that would explain an exponential slowdown.  There was more activity
>>> > while the inserts were active but far from the limits of the disk
>>> system.
>>>
>>> While creating many blocks, could it be that the replication pipe lining
>>> is eating up the available handler threads on the data nodes? By
>>> increasing the block size you would see better performance because the
>>> system spends more time writing data to local disk and less time dealing
>>> with things like replication "overhead." At a small block size, I could
>>> imagine you're artificially creating a situation where you saturate the
>>> default size configured thread pools or something weird like that.
>>>
>>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>>> this seems unlikely, but it might be worth looking into. The question is
>>> if testing with an artificially small block size like this is even a
>>> viable test. At some point the overhead of talking to the name node,
>>> selecting data nodes for a block, and setting up replication pipe lines
>>> could become some abnormally high percentage of the run time.
>>>
>>>
>> The concern isn't why the insertion is slow, but rather why the scaling
>> curve looks the way it does. Looking at the data, it looks like the
>> insertion rate (blocks per second) is actually related as 1/n where N is the
>> number of blocks. Attaching another graph of the same data which I think is
>> a little clearer to read.
>>
>>
>>> Also, I wonder if the cluster is trying to rebalance blocks toward the
>>> end of your runtime (if the balancer daemon is running) and this is
>>> causing additional shuffling of data.
>>>
>>
>> That's certainly one possibility.
>>
>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks,
>> let the cluster sit for a few hours, then come back and start another
>> insertion. Is the rate slow, or does it return to the fast starting speed?
>>
>> -Todd
>>
>
>

Re: Exponential performance decay when inserting large number of blocks

Posted by Todd Lipcon <to...@cloudera.com>.
Err, ignore that attachment - attached the wrong graph with the right
labels!

Here's the right graph.

-Todd

On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <er...@lifeless.net> wrote:
>
>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
>> > Alex, Dhruba
>> >
>> > I repeated the experiment increasing the block size to 32k.  Still doing
>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
>> > also running iostat on one of the datanodes.  Did not notice anything
>> > that would explain an exponential slowdown.  There was more activity
>> > while the inserts were active but far from the limits of the disk
>> system.
>>
>> While creating many blocks, could it be that the replication pipe lining
>> is eating up the available handler threads on the data nodes? By
>> increasing the block size you would see better performance because the
>> system spends more time writing data to local disk and less time dealing
>> with things like replication "overhead." At a small block size, I could
>> imagine you're artificially creating a situation where you saturate the
>> default size configured thread pools or something weird like that.
>>
>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>> this seems unlikely, but it might be worth looking into. The question is
>> if testing with an artificially small block size like this is even a
>> viable test. At some point the overhead of talking to the name node,
>> selecting data nodes for a block, and setting up replication pipe lines
>> could become some abnormally high percentage of the run time.
>>
>>
> The concern isn't why the insertion is slow, but rather why the scaling
> curve looks the way it does. Looking at the data, it looks like the
> insertion rate (blocks per second) is actually related as 1/n where N is the
> number of blocks. Attaching another graph of the same data which I think is
> a little clearer to read.
>
>
>> Also, I wonder if the cluster is trying to rebalance blocks toward the
>> end of your runtime (if the balancer daemon is running) and this is
>> causing additional shuffling of data.
>>
>
> That's certainly one possibility.
>
> Zlatin: here's a test to try: after the FS is full with 400,000 blocks, let
> the cluster sit for a few hours, then come back and start another insertion.
> Is the rate slow, or does it return to the fast starting speed?
>
> -Todd
>

Re: Exponential performance decay when inserting large number of blocks

Posted by Todd Lipcon <to...@cloudera.com>.
On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <er...@lifeless.net> wrote:

> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
> > Alex, Dhruba
> >
> > I repeated the experiment increasing the block size to 32k.  Still doing
> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
> > also running iostat on one of the datanodes.  Did not notice anything
> > that would explain an exponential slowdown.  There was more activity
> > while the inserts were active but far from the limits of the disk system.
>
> While creating many blocks, could it be that the replication pipe lining
> is eating up the available handler threads on the data nodes? By
> increasing the block size you would see better performance because the
> system spends more time writing data to local disk and less time dealing
> with things like replication "overhead." At a small block size, I could
> imagine you're artificially creating a situation where you saturate the
> default size configured thread pools or something weird like that.
>
> If you're doing 8 inserts in parallel from one machine with 11 nodes
> this seems unlikely, but it might be worth looking into. The question is
> if testing with an artificially small block size like this is even a
> viable test. At some point the overhead of talking to the name node,
> selecting data nodes for a block, and setting up replication pipe lines
> could become some abnormally high percentage of the run time.
>
>
The concern isn't why the insertion is slow, but rather why the scaling
curve looks the way it does. Looking at the data, it looks like the
insertion rate (blocks per second) is actually related as 1/n where N is the
number of blocks. Attaching another graph of the same data which I think is
a little clearer to read.


> Also, I wonder if the cluster is trying to rebalance blocks toward the
> end of your runtime (if the balancer daemon is running) and this is
> causing additional shuffling of data.
>

That's certainly one possibility.

Zlatin: here's a test to try: after the FS is full with 400,000 blocks, let
the cluster sit for a few hours, then come back and start another insertion.
Is the rate slow, or does it return to the fast starting speed?

-Todd

Re: Exponential performance decay when inserting large number of blocks

Posted by Eric Sammer <er...@lifeless.net>.
On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
> Alex, Dhruba
>  
> I repeated the experiment increasing the block size to 32k.  Still doing
> 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
> also running iostat on one of the datanodes.  Did not notice anything
> that would explain an exponential slowdown.  There was more activity
> while the inserts were active but far from the limits of the disk system.

While creating many blocks, could it be that the replication pipe lining
is eating up the available handler threads on the data nodes? By
increasing the block size you would see better performance because the
system spends more time writing data to local disk and less time dealing
with things like replication "overhead." At a small block size, I could
imagine you're artificially creating a situation where you saturate the
default size configured thread pools or something weird like that.

If you're doing 8 inserts in parallel from one machine with 11 nodes
this seems unlikely, but it might be worth looking into. The question is
if testing with an artificially small block size like this is even a
viable test. At some point the overhead of talking to the name node,
selecting data nodes for a block, and setting up replication pipe lines
could become some abnormally high percentage of the run time.

Also, I wonder if the cluster is trying to rebalance blocks toward the
end of your runtime (if the balancer daemon is running) and this is
causing additional shuffling of data.

Just throwing ideas out there. I don't know if this is reasonable at
all. I've never tested with a small block size like that and I don't
know the exact amount of overhead in some of these bits of the code.

Regards.
-- 
Eric Sammer
eric@lifeless.net
http://esammer.blogspot.com

Re: Exponential performance decay when inserting large number of blocks

Posted by alex kamil <al...@gmail.com>.
 ..hmm, i used to work a single GB file rather than many 16MB files
In Tom White's book "Hadoop the definitive guide" he also recomennds to
merge many files into one (with a simple awk script for example or just a
cat command if start and end file are easily identifiable). He shows how to
use InputFormat and OutputFormat interfaces to parse such files.

Cloudera blog has an article about  "small files problem":
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
It says: "Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so
100KB files.
The 10,000 files use one map each, and the job time can be tens or hundreds
of times slower than the equivalent one with a single input file"

But in your case your input is 16MB and the block size 32k.. It would create
a couple of thousands maps, instead i would probably try with 16MB input
file and a block size of 16MB, basically each file would be assigned a
dedicated mapper "thread". Since you have many of those files lets say a
hundred hadoop will create 100 maps (you just need to update *max*.*map*.tasks
parameter).
You can double the number of maps by cutting the block size in half (to 8
MB). Once again it would only speed up processing, I dont think splitting
the input with different block sizes would affect insert time significantly,
I think you proved it by your test.

I mean, no matter what it the block size,  it still needs to update the
metadata on the namenode and update the datanodes but now instead of lets
say a single write per file (with 16MB block) it will do it a couple of
thousand times with smaller 32k blocks.

Its just my 50 cents,  I have worked with hadoop for only a couple of nights
for a term project, may be there are some Cloudera guys who can better help.

Cheers
Alex



On Wed, Jan 13, 2010 at 8:12 PM, <Zl...@barclayscapital.com>wrote:

>  Alex, Dhruba
>
> I repeated the experiment increasing the block size to 32k.  Still doing 8
> inserts in parallel, file size now is 512 MB; 11 datanodes.  I was also
> running iostat on one of the datanodes.  Did not notice anything that would
> explain an exponential slowdown.  There was more activity while the inserts
> were active but far from the limits of the disk system.
>
> There is definitely some improvement - I was able to finish 3 rounds of
> inserts for total of 12GB.  However, the shape is still logarithmic.
> Unfortunately, I am constrained with disk space and cannot test a block size
> that is even close to the real world sizes.
>
> Attached are the collected metrics and a graph comparing the three
> inserts.  As before, you'll notice gaps in the metrics between runs of the
> insert script which have been edited out of the graph.
>
> Best Regards,
> Zlatin Balevsky
>
>
>
>  ------------------------------
> *From:* alex kamil [mailto:alex.kamil@gmail.com]
> *Sent:* Wednesday, January 13, 2010 7:03 PM
>
> *To:* hdfs-user@hadoop.apache.org
> *Subject:* Re: Exponential performance decay when inserting large number
> of blocks
>
> Zlotin, nevermind, just noticed that you're measuring the dfs insert time
> 1k block are way too small anyway
>
> On Wed, Jan 13, 2010 at 6:08 PM, alex kamil <al...@gmail.com> wrote:
>
>> Zlatin,
>> i dont know what is the nature of the job that you are running but it
>> looks like you are hitting an io bottleneck, try to multiple the size of
>> input data and test with 3-5GB at least, by changing the block size in this
>> case you can increase the number of maps and scale your app. (Round the
>> block size to a multiple of 512)
>> i attached some results from my experiments with KNN sclaing on 4 nodes
>> cluster, it didnt scale linearly but it wasn't too bad, it sometimes depends
>> on the algorithm, with other algos, like PCA/SVD the scaling was linear
>> also note that sometimes the bottleneck in in Reduce step, i added a
>> Combiner between mapper and reducer in my code and it affected scalability
>> significantly.
>>
>> let me know if you have any questions
>> Alex Kamil
>> ak2834@columbia.edu
>>
>>
>> On Wed, Jan 13, 2010 at 5:35 PM, Dhruba Borthakur <dh...@gmail.com>wrote:
>>
>>> Another thing to observe is the rate of IO on the datanodes. Maybe u can
>>> do a sar/iostat on the datanodes and see if the ddatanode devices show an
>>> increase in activity while inserting the last lot of blocks. One posiblity
>>> is that the OS cache on the datanodes cached most of the data from the first
>>> few runs, but when more and more data started arriving on the datanode it
>>> triggered more flushing of OS buffers. (on the datanode).
>>>
>>> thanks,
>>> dhruba
>>>
>>>
>>>
>>> On Wed, Jan 13, 2010 at 2:18 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> Hey Zlatin,
>>>>
>>>> Thanks for the explanation and the additional data. I'm a bit busy today
>>>> but will try to go through the data and reproduce the results later this
>>>> week.
>>>>
>>>> -Todd
>>>>
>>>>
>>>> On Wed, Jan 13, 2010 at 2:07 PM, <Zl...@barclayscapital.com>wrote:
>>>>
>>>>>  Todd,
>>>>>
>>>>> I used a shell script that launched 8 instances of the bin/hadoop fs
>>>>> -put utility.  After all 8 processes were done and I verified though the web
>>>>> ui that the files were inserted, I re-launched the script manually again.
>>>>> That is why you'll notice that in the metrics there are two short periods
>>>>> without any activity (I edited those out from the graph).  There were
>>>>> occasional NotReplicatedYet exceptions in the logs of those processes, but
>>>>> they were occurring at constant rate.
>>>>>
>>>>> I did not run a profiler, but that will eventually be the next step.
>>>>> I'm attaching the metrics from the namenode and one of the datanodes from
>>>>> the experiment with 4 datanodes.  They were recorded every 10 seconds.  Heap
>>>>> size for all processes is 2GB, and while there was occasional CPU usage on
>>>>> the Namenode it was never 100%.  (and there are plenty of cores).
>>>>>
>>>>> Ultimately the block size will be much larger than the default as the
>>>>> total data will be in the 2^(well over 50) range.  With this test I am
>>>>> trying to determine if there are any bottlenecks at the NameNode  component.
>>>>>
>>>>> Best Regards,
>>>>> Zlatin Balevsky
>>>>>
>>>>>  ------------------------------
>>>>> *From:* Todd Lipcon [mailto:todd@cloudera.com]
>>>>> *Sent:* Wednesday, January 13, 2010 4:34 PM
>>>>> *To:* hdfs-user@hadoop.apache.org
>>>>> *Subject:* Re: Exponential performance decay when inserting large
>>>>> number of blocks
>>>>>
>>>>>   Also, if you have the program you used to do the insertions, and
>>>>> could attach it, I'd be interested in trying to replicate this on a test
>>>>> cluster. If you can't redistribute it, I can start from scratch, but would
>>>>> be easier to run yours.
>>>>>
>>>>> Thanks
>>>>> -Todd
>>>>>
>>>>> On Wed, Jan 13, 2010 at 1:31 PM, Todd Lipcon <to...@cloudera.com>wrote:
>>>>>
>>>>>> Hi Zlatin,
>>>>>>
>>>>>> This is a very interesting test you've run, and certainly not expected
>>>>>> results. I know of many clusters happily chugging along with millions of
>>>>>> blocks, so problems at 400K are very strange. By any chance were you able to
>>>>>> collect profiling information from the NameNode while running this test?
>>>>>>
>>>>>> That said, I hope you've set the block size to 1KB for the purpose of
>>>>>> this test and not because you expect to run that in production. Recommended
>>>>>> block sizes are at least 64MB and often 128MB or 256MB for larger clusters.
>>>>>>
>>>>>> Thanks
>>>>>> -Todd
>>>>>>
>>>>>> On Wed, Jan 13, 2010 at 1:21 PM, <Zlatin.Balevsky@barclayscapital.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Greetings,
>>>>>>>
>>>>>>> I am testing how HDFS scales with very large number of blocks.  I did
>>>>>>> the following setup:
>>>>>>>
>>>>>>> Set the default blocks size to 1KB
>>>>>>> Started 8 insert processes, each inserting a 16MB file
>>>>>>> Repeated the insert 3 times, keeping the already inserted files in
>>>>>>> HDFS
>>>>>>> Repeated the entire experiment on one cluster with 4 and another with
>>>>>>> 11
>>>>>>> identical datanodes (allocated through HOD)
>>>>>>>
>>>>>>> Results:
>>>>>>> The first 128MB (2^18 blocks) insert finished in 5 minutes.  The
>>>>>>> second
>>>>>>> in 12 minutes.  The third didn't finish within 1 hour.  The 11-node
>>>>>>> cluster was marginally faster.
>>>>>>>
>>>>>>> Throughout this I was storing all available metrics.  There were no
>>>>>>> signs of insufficient memory on any of the nodes; CPU usage and
>>>>>>> garbage
>>>>>>> collections were constant throughout.  If anyone is interested I can
>>>>>>> provide the recorded metrics.  I've attached a chart that looks
>>>>>>> clearly
>>>>>>> logarithmic.
>>>>>>>
>>>>>>> Can anyone please point to what could be the bottleneck here?  I'm
>>>>>>> evaluating HDFS for usage scenarios requiring 2^(a lot more than 18)
>>>>>>> blocks.
>>>>>>>
>>>>>>> Bes <<insertion_rate_4_and_11_datanodes.JPG>> t Regards,
>>>>>>> Zlatin Balevsky
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>
>>>>>>> This e-mail may contain information that is confidential, privileged
>>>>>>> or otherwise protected from disclosure. If you are not an intended recipient
>>>>>>> of this e-mail, do not duplicate or redistribute it by any means. Please
>>>>>>> delete it and any attachments and notify the sender that you have received
>>>>>>> it in error. Unless specifically indicated, this e-mail is not an offer to
>>>>>>> buy or sell or a solicitation to buy or sell any securities, investment
>>>>>>> products or other financial product or service, an official confirmation of
>>>>>>> any transaction, or an official statement of Barclays. Any views or opinions
>>>>>>> presented are solely those of the author and do not necessarily represent
>>>>>>> those of Barclays. This e-mail is subject to terms available at the
>>>>>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>>>>>> Barclays you consent to the foregoing.  Barclays Capital is the investment
>>>>>>> banking division of Barclays Bank PLC, a company registered in England
>>>>>>> (number 1026167) with its registered office at 1 Churchill Place, London,
>>>>>>> E14 5HP.  This email may relate to or be sent from other members of the
>>>>>>> Barclays Group.
>>>>>>> _______________________________________________
>>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>>
>>>>>
>>>>>
>>>>> This e-mail may contain information that is confidential, privileged or
>>>>> otherwise protected from disclosure. If you are not an intended recipient of
>>>>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>>>>> it and any attachments and notify the sender that you have received it in
>>>>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>>>>> sell or a solicitation to buy or sell any securities, investment products or
>>>>> other financial product or service, an official confirmation of any
>>>>> transaction, or an official statement of Barclays. Any views or opinions
>>>>> presented are solely those of the author and do not necessarily represent
>>>>> those of Barclays. This e-mail is subject to terms available at the
>>>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>>>> Barclays you consent to the foregoing.  Barclays Capital is the
>>>>> investment banking division of Barclays Bank PLC, a company registered in
>>>>> England (number 1026167) with its registered office at 1 Churchill Place,
>>>>> London, E14 5HP.  This email may relate to or be sent from other
>>>>> members of the Barclays Group.**
>>>>>
>>>>> _______________________________________________
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Connect to me at http://www.facebook.com/dhruba
>>>
>>
>>
>  _______________________________________________
>
>
>
> This e-mail may contain information that is confidential, privileged or
> otherwise protected from disclosure. If you are not an intended recipient of
> this e-mail, do not duplicate or redistribute it by any means. Please delete
> it and any attachments and notify the sender that you have received it in
> error. Unless specifically indicated, this e-mail is not an offer to buy or
> sell or a solicitation to buy or sell any securities, investment products or
> other financial product or service, an official confirmation of any
> transaction, or an official statement of Barclays. Any views or opinions
> presented are solely those of the author and do not necessarily represent
> those of Barclays. This e-mail is subject to terms available at the
> following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
> you consent to the foregoing.  Barclays Capital is the investment banking
> division of Barclays Bank PLC, a company registered in England (number
> 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
> This email may relate to or be sent from other members of the Barclays
> Group.**
>
> _______________________________________________
>

RE: Exponential performance decay when inserting large number of blocks

Posted by Zl...@barclayscapital.com.
Alex, Dhruba
 
I repeated the experiment increasing the block size to 32k.  Still doing
8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
also running iostat on one of the datanodes.  Did not notice anything
that would explain an exponential slowdown.  There was more activity
while the inserts were active but far from the limits of the disk
system.
 
There is definitely some improvement - I was able to finish 3 rounds of
inserts for total of 12GB.  However, the shape is still logarithmic.
Unfortunately, I am constrained with disk space and cannot test a block
size that is even close to the real world sizes.  
 
Attached are the collected metrics and a graph comparing the three
inserts.  As before, you'll notice gaps in the metrics between runs of
the insert script which have been edited out of the graph.
 
Best Regards,
Zlatin Balevsky
 
 


________________________________

From: alex kamil [mailto:alex.kamil@gmail.com] 
Sent: Wednesday, January 13, 2010 7:03 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay when inserting large number
of blocks


Zlotin, nevermind, just noticed that you're measuring the dfs insert
time 
1k block are way too small anyway


On Wed, Jan 13, 2010 at 6:08 PM, alex kamil <al...@gmail.com>
wrote:


	Zlatin, 
	i dont know what is the nature of the job that you are running
but it looks like you are hitting an io bottleneck, try to multiple the
size of input data and test with 3-5GB at least, by changing the block
size in this case you can increase the number of maps and scale your
app. (Round the block size to a multiple of 512)
	i attached some results from my experiments with KNN sclaing on
4 nodes cluster, it didnt scale linearly but it wasn't too bad, it
sometimes depends on the algorithm, with other algos, like PCA/SVD the
scaling was linear  
	also note that sometimes the bottleneck in in Reduce step, i
added a Combiner between mapper and reducer in my code and it affected
scalability significantly.
	
	let me know if you have any questions
	Alex Kamil
	ak2834@columbia.edu 


	On Wed, Jan 13, 2010 at 5:35 PM, Dhruba Borthakur <
dhruba@gmail.com> wrote:
	

		Another thing to observe is the rate of IO on the
datanodes. Maybe u can do a sar/iostat on the datanodes and see if the
ddatanode devices show an increase in activity while inserting the last
lot of blocks. One posiblity is that the OS cache on the datanodes
cached most of the data from the first few runs, but when more and more
data started arriving on the datanode it triggered more flushing of OS
buffers. (on the datanode).
		
		thanks,
		dhruba 



		On Wed, Jan 13, 2010 at 2:18 PM, Todd Lipcon <
todd@cloudera.com> wrote:
		

			Hey Zlatin, 

			Thanks for the explanation and the additional
data. I'm a bit busy today but will try to go through the data and
reproduce the results later this week.

			-Todd 


			On Wed, Jan 13, 2010 at 2:07 PM, <
Zlatin.Balevsky@barclayscapital.com> wrote:
			

				Todd,
				 
				I used a shell script that launched 8
instances of the bin/hadoop fs -put utility.  After all 8 processes were
done and I verified though the web ui that the files were inserted, I
re-launched the script manually again.  That is why you'll notice that
in the metrics there are two short periods without any activity (I
edited those out from the graph).  There were occasional
NotReplicatedYet exceptions in the logs of those processes, but they
were occurring at constant rate.
				 
				I did not run a profiler, but that will
eventually be the next step.  I'm attaching the metrics from the
namenode and one of the datanodes from the experiment with 4 datanodes.
They were recorded every 10 seconds.  Heap size for all processes is
2GB, and while there was occasional CPU usage on the Namenode it was
never 100%.  (and there are plenty of cores).
				 
				Ultimately the block size will be much
larger than the default as the total data will be in the 2^(well over
50) range.  With this test I am trying to determine if there are any
bottlenecks at the NameNode  component.
				 
				Best Regards,
				Zlatin Balevsky
				 
________________________________

				From: Todd Lipcon [mailto:
todd@cloudera.com] 
				Sent: Wednesday, January 13, 2010 4:34
PM
				To: hdfs-user@hadoop.apache.org
				Subject: Re: Exponential performance
decay when inserting large number of blocks
				
				
				Also, if you have the program you used
to do the insertions, and could attach it, I'd be interested in trying
to replicate this on a test cluster. If you can't redistribute it, I can
start from scratch, but would be easier to run yours. 

				Thanks
				-Todd
				
				
				On Wed, Jan 13, 2010 at 1:31 PM, Todd
Lipcon <to...@cloudera.com> wrote:
				

				Hi Zlatin, 

				This is a very interesting test you've
run, and certainly not expected results. I know of many clusters happily
chugging along with millions of blocks, so problems at 400K are very
strange. By any chance were you able to collect profiling information
from the NameNode while running this test?

				That said, I hope you've set the block
size to 1KB for the purpose of this test and not because you expect to
run that in production. Recommended block sizes are at least 64MB and
often 128MB or 256MB for larger clusters.

				Thanks
				-Todd

				On Wed, Jan 13, 2010 at 1:21 PM, <
Zlatin.Balevsky@barclayscapital.com> wrote:
				

				Greetings,
				
				I am testing how HDFS scales with very
large number of blocks.  I did
				the following setup:
				
				Set the default blocks size to 1KB
				Started 8 insert processes, each
inserting a 16MB file
				Repeated the insert 3 times, keeping the
already inserted files in HDFS
				Repeated the entire experiment on one
cluster with 4 and another with 11
				identical datanodes (allocated through
HOD)
				
				Results:
				The first 128MB (2^18 blocks) insert
finished in 5 minutes.  The second
				in 12 minutes.  The third didn't finish
within 1 hour.  The 11-node
				cluster was marginally faster.
				
				Throughout this I was storing all
available metrics.  There were no
				signs of insufficient memory on any of
the nodes; CPU usage and garbage
				collections were constant throughout.
If anyone is interested I can
				provide the recorded metrics.  I've
attached a chart that looks clearly
				logarithmic.
				
				Can anyone please point to what could be
the bottleneck here?  I'm
				evaluating HDFS for usage scenarios
requiring 2^(a lot more than 18)
				blocks.
				
				Bes
<<insertion_rate_4_and_11_datanodes.JPG>> t Regards,
				Zlatin Balevsky
				
	
_______________________________________________
				
				This e-mail may contain information that
is confidential, privileged or otherwise protected from disclosure. If
you are not an intended recipient of this e-mail, do not duplicate or
redistribute it by any means. Please delete it and any attachments and
notify the sender that you have received it in error. Unless
specifically indicated, this e-mail is not an offer to buy or sell or a
solicitation to buy or sell any securities, investment products or other
financial product or service, an official confirmation of any
transaction, or an official statement of Barclays. Any views or opinions
presented are solely those of the author and do not necessarily
represent those of Barclays. This e-mail is subject to terms available
at the following link: www.barcap.com/emaildisclaimer. By messaging with
Barclays you consent to the foregoing.  Barclays Capital is the
investment banking division of Barclays Bank PLC, a company registered
in England (number 1026167) with its registered office at 1 Churchill
Place, London, E14 5HP.  This email may relate to or be sent from other
members of the Barclays Group.
	
_______________________________________________
				



	
_______________________________________________

				 

				This e-mail may contain information that
is confidential, privileged or otherwise protected from disclosure. If
you are not an intended recipient of this e-mail, do not duplicate or
redistribute it by any means. Please delete it and any attachments and
notify the sender that you have received it in error. Unless
specifically indicated, this e-mail is not an offer to buy or sell or a
solicitation to buy or sell any securities, investment products or other
financial product or service, an official confirmation of any
transaction, or an official statement of Barclays. Any views or opinions
presented are solely those of the author and do not necessarily
represent those of Barclays. This e-mail is subject to terms available
at the following link: www.barcap.com/emaildisclaimer. By messaging with
Barclays you consent to the foregoing.  Barclays Capital is the
investment banking division of Barclays Bank PLC, a company registered
in England (number 1026167) with its registered office at 1 Churchill
Place, London, E14 5HP.  This email may relate to or be sent from other
members of the Barclays Group.

	
_______________________________________________





		-- 
		Connect to me at http://www.facebook.com/dhruba
		




_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

Re: Exponential performance decay when inserting large number of blocks

Posted by alex kamil <al...@gmail.com>.
Zlotin, nevermind, just noticed that you're measuring the dfs insert time
1k block are way too small anyway

On Wed, Jan 13, 2010 at 6:08 PM, alex kamil <al...@gmail.com> wrote:

> Zlatin,
> i dont know what is the nature of the job that you are running but it looks
> like you are hitting an io bottleneck, try to multiple the size of input
> data and test with 3-5GB at least, by changing the block size in this case
> you can increase the number of maps and scale your app. (Round the block
> size to a multiple of 512)
> i attached some results from my experiments with KNN sclaing on 4 nodes
> cluster, it didnt scale linearly but it wasn't too bad, it sometimes depends
> on the algorithm, with other algos, like PCA/SVD the scaling was linear
> also note that sometimes the bottleneck in in Reduce step, i added a
> Combiner between mapper and reducer in my code and it affected scalability
> significantly.
>
> let me know if you have any questions
> Alex Kamil
> ak2834@columbia.edu
>
>
> On Wed, Jan 13, 2010 at 5:35 PM, Dhruba Borthakur <dh...@gmail.com>wrote:
>
>> Another thing to observe is the rate of IO on the datanodes. Maybe u can
>> do a sar/iostat on the datanodes and see if the ddatanode devices show an
>> increase in activity while inserting the last lot of blocks. One posiblity
>> is that the OS cache on the datanodes cached most of the data from the first
>> few runs, but when more and more data started arriving on the datanode it
>> triggered more flushing of OS buffers. (on the datanode).
>>
>> thanks,
>> dhruba
>>
>>
>>
>> On Wed, Jan 13, 2010 at 2:18 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Hey Zlatin,
>>>
>>> Thanks for the explanation and the additional data. I'm a bit busy today
>>> but will try to go through the data and reproduce the results later this
>>> week.
>>>
>>> -Todd
>>>
>>>
>>> On Wed, Jan 13, 2010 at 2:07 PM, <Zl...@barclayscapital.com>wrote:
>>>
>>>>  Todd,
>>>>
>>>> I used a shell script that launched 8 instances of the bin/hadoop fs
>>>> -put utility.  After all 8 processes were done and I verified though the web
>>>> ui that the files were inserted, I re-launched the script manually again.
>>>> That is why you'll notice that in the metrics there are two short periods
>>>> without any activity (I edited those out from the graph).  There were
>>>> occasional NotReplicatedYet exceptions in the logs of those processes, but
>>>> they were occurring at constant rate.
>>>>
>>>> I did not run a profiler, but that will eventually be the next step.
>>>> I'm attaching the metrics from the namenode and one of the datanodes from
>>>> the experiment with 4 datanodes.  They were recorded every 10 seconds.  Heap
>>>> size for all processes is 2GB, and while there was occasional CPU usage on
>>>> the Namenode it was never 100%.  (and there are plenty of cores).
>>>>
>>>> Ultimately the block size will be much larger than the default as the
>>>> total data will be in the 2^(well over 50) range.  With this test I am
>>>> trying to determine if there are any bottlenecks at the NameNode  component.
>>>>
>>>> Best Regards,
>>>> Zlatin Balevsky
>>>>
>>>>  ------------------------------
>>>> *From:* Todd Lipcon [mailto:todd@cloudera.com]
>>>> *Sent:* Wednesday, January 13, 2010 4:34 PM
>>>> *To:* hdfs-user@hadoop.apache.org
>>>> *Subject:* Re: Exponential performance decay when inserting large
>>>> number of blocks
>>>>
>>>> Also, if you have the program you used to do the insertions, and could
>>>> attach it, I'd be interested in trying to replicate this on a test cluster.
>>>> If you can't redistribute it, I can start from scratch, but would be easier
>>>> to run yours.
>>>>
>>>> Thanks
>>>> -Todd
>>>>
>>>> On Wed, Jan 13, 2010 at 1:31 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>>
>>>>> Hi Zlatin,
>>>>>
>>>>> This is a very interesting test you've run, and certainly not expected
>>>>> results. I know of many clusters happily chugging along with millions of
>>>>> blocks, so problems at 400K are very strange. By any chance were you able to
>>>>> collect profiling information from the NameNode while running this test?
>>>>>
>>>>> That said, I hope you've set the block size to 1KB for the purpose of
>>>>> this test and not because you expect to run that in production. Recommended
>>>>> block sizes are at least 64MB and often 128MB or 256MB for larger clusters.
>>>>>
>>>>> Thanks
>>>>> -Todd
>>>>>
>>>>> On Wed, Jan 13, 2010 at 1:21 PM, <Zl...@barclayscapital.com>wrote:
>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> I am testing how HDFS scales with very large number of blocks.  I did
>>>>>> the following setup:
>>>>>>
>>>>>> Set the default blocks size to 1KB
>>>>>> Started 8 insert processes, each inserting a 16MB file
>>>>>> Repeated the insert 3 times, keeping the already inserted files in
>>>>>> HDFS
>>>>>> Repeated the entire experiment on one cluster with 4 and another with
>>>>>> 11
>>>>>> identical datanodes (allocated through HOD)
>>>>>>
>>>>>> Results:
>>>>>> The first 128MB (2^18 blocks) insert finished in 5 minutes.  The
>>>>>> second
>>>>>> in 12 minutes.  The third didn't finish within 1 hour.  The 11-node
>>>>>> cluster was marginally faster.
>>>>>>
>>>>>> Throughout this I was storing all available metrics.  There were no
>>>>>> signs of insufficient memory on any of the nodes; CPU usage and
>>>>>> garbage
>>>>>> collections were constant throughout.  If anyone is interested I can
>>>>>> provide the recorded metrics.  I've attached a chart that looks
>>>>>> clearly
>>>>>> logarithmic.
>>>>>>
>>>>>> Can anyone please point to what could be the bottleneck here?  I'm
>>>>>> evaluating HDFS for usage scenarios requiring 2^(a lot more than 18)
>>>>>> blocks.
>>>>>>
>>>>>> Bes <<insertion_rate_4_and_11_datanodes.JPG>> t Regards,
>>>>>> Zlatin Balevsky
>>>>>>
>>>>>> _______________________________________________
>>>>>>
>>>>>> This e-mail may contain information that is confidential, privileged
>>>>>> or otherwise protected from disclosure. If you are not an intended recipient
>>>>>> of this e-mail, do not duplicate or redistribute it by any means. Please
>>>>>> delete it and any attachments and notify the sender that you have received
>>>>>> it in error. Unless specifically indicated, this e-mail is not an offer to
>>>>>> buy or sell or a solicitation to buy or sell any securities, investment
>>>>>> products or other financial product or service, an official confirmation of
>>>>>> any transaction, or an official statement of Barclays. Any views or opinions
>>>>>> presented are solely those of the author and do not necessarily represent
>>>>>> those of Barclays. This e-mail is subject to terms available at the
>>>>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>>>>> Barclays you consent to the foregoing.  Barclays Capital is the investment
>>>>>> banking division of Barclays Bank PLC, a company registered in England
>>>>>> (number 1026167) with its registered office at 1 Churchill Place, London,
>>>>>> E14 5HP.  This email may relate to or be sent from other members of the
>>>>>> Barclays Group.
>>>>>> _______________________________________________
>>>>>>
>>>>>
>>>>>
>>>>  _______________________________________________
>>>>
>>>>
>>>>
>>>> This e-mail may contain information that is confidential, privileged or
>>>> otherwise protected from disclosure. If you are not an intended recipient of
>>>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>>>> it and any attachments and notify the sender that you have received it in
>>>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>>>> sell or a solicitation to buy or sell any securities, investment products or
>>>> other financial product or service, an official confirmation of any
>>>> transaction, or an official statement of Barclays. Any views or opinions
>>>> presented are solely those of the author and do not necessarily represent
>>>> those of Barclays. This e-mail is subject to terms available at the
>>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>>> Barclays you consent to the foregoing.  Barclays Capital is the
>>>> investment banking division of Barclays Bank PLC, a company registered in
>>>> England (number 1026167) with its registered office at 1 Churchill Place,
>>>> London, E14 5HP.  This email may relate to or be sent from other
>>>> members of the Barclays Group.**
>>>>
>>>> _______________________________________________
>>>>
>>>
>>>
>>
>>
>> --
>> Connect to me at http://www.facebook.com/dhruba
>>
>
>

Re: Exponential performance decay when inserting large number of blocks

Posted by alex kamil <al...@gmail.com>.
Zlatin,
i dont know what is the nature of the job that you are running but it looks
like you are hitting an io bottleneck, try to multiple the size of input
data and test with 3-5GB at least, by changing the block size in this case
you can increase the number of maps and scale your app. (Round the block
size to a multiple of 512)
i attached some results from my experiments with KNN sclaing on 4 nodes
cluster, it didnt scale linearly but it wasn't too bad, it sometimes depends
on the algorithm, with other algos, like PCA/SVD the scaling was linear
also note that sometimes the bottleneck in in Reduce step, i added a
Combiner between mapper and reducer in my code and it affected scalability
significantly.

let me know if you have any questions
Alex Kamil
ak2834@columbia.edu

On Wed, Jan 13, 2010 at 5:35 PM, Dhruba Borthakur <dh...@gmail.com> wrote:

> Another thing to observe is the rate of IO on the datanodes. Maybe u can do
> a sar/iostat on the datanodes and see if the ddatanode devices show an
> increase in activity while inserting the last lot of blocks. One posiblity
> is that the OS cache on the datanodes cached most of the data from the first
> few runs, but when more and more data started arriving on the datanode it
> triggered more flushing of OS buffers. (on the datanode).
>
> thanks,
> dhruba
>
>
>
> On Wed, Jan 13, 2010 at 2:18 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hey Zlatin,
>>
>> Thanks for the explanation and the additional data. I'm a bit busy today
>> but will try to go through the data and reproduce the results later this
>> week.
>>
>> -Todd
>>
>>
>> On Wed, Jan 13, 2010 at 2:07 PM, <Zl...@barclayscapital.com>wrote:
>>
>>>  Todd,
>>>
>>> I used a shell script that launched 8 instances of the bin/hadoop fs -put
>>> utility.  After all 8 processes were done and I verified though the web ui
>>> that the files were inserted, I re-launched the script manually again.  That
>>> is why you'll notice that in the metrics there are two short periods without
>>> any activity (I edited those out from the graph).  There were occasional
>>> NotReplicatedYet exceptions in the logs of those processes, but they were
>>> occurring at constant rate.
>>>
>>> I did not run a profiler, but that will eventually be the next step.  I'm
>>> attaching the metrics from the namenode and one of the datanodes from the
>>> experiment with 4 datanodes.  They were recorded every 10 seconds.  Heap
>>> size for all processes is 2GB, and while there was occasional CPU usage on
>>> the Namenode it was never 100%.  (and there are plenty of cores).
>>>
>>> Ultimately the block size will be much larger than the default as the
>>> total data will be in the 2^(well over 50) range.  With this test I am
>>> trying to determine if there are any bottlenecks at the NameNode  component.
>>>
>>> Best Regards,
>>> Zlatin Balevsky
>>>
>>>  ------------------------------
>>> *From:* Todd Lipcon [mailto:todd@cloudera.com]
>>> *Sent:* Wednesday, January 13, 2010 4:34 PM
>>> *To:* hdfs-user@hadoop.apache.org
>>> *Subject:* Re: Exponential performance decay when inserting large number
>>> of blocks
>>>
>>> Also, if you have the program you used to do the insertions, and could
>>> attach it, I'd be interested in trying to replicate this on a test cluster.
>>> If you can't redistribute it, I can start from scratch, but would be easier
>>> to run yours.
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Wed, Jan 13, 2010 at 1:31 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> Hi Zlatin,
>>>>
>>>> This is a very interesting test you've run, and certainly not expected
>>>> results. I know of many clusters happily chugging along with millions of
>>>> blocks, so problems at 400K are very strange. By any chance were you able to
>>>> collect profiling information from the NameNode while running this test?
>>>>
>>>> That said, I hope you've set the block size to 1KB for the purpose of
>>>> this test and not because you expect to run that in production. Recommended
>>>> block sizes are at least 64MB and often 128MB or 256MB for larger clusters.
>>>>
>>>> Thanks
>>>> -Todd
>>>>
>>>> On Wed, Jan 13, 2010 at 1:21 PM, <Zl...@barclayscapital.com>wrote:
>>>>
>>>>> Greetings,
>>>>>
>>>>> I am testing how HDFS scales with very large number of blocks.  I did
>>>>> the following setup:
>>>>>
>>>>> Set the default blocks size to 1KB
>>>>> Started 8 insert processes, each inserting a 16MB file
>>>>> Repeated the insert 3 times, keeping the already inserted files in HDFS
>>>>> Repeated the entire experiment on one cluster with 4 and another with
>>>>> 11
>>>>> identical datanodes (allocated through HOD)
>>>>>
>>>>> Results:
>>>>> The first 128MB (2^18 blocks) insert finished in 5 minutes.  The second
>>>>> in 12 minutes.  The third didn't finish within 1 hour.  The 11-node
>>>>> cluster was marginally faster.
>>>>>
>>>>> Throughout this I was storing all available metrics.  There were no
>>>>> signs of insufficient memory on any of the nodes; CPU usage and garbage
>>>>> collections were constant throughout.  If anyone is interested I can
>>>>> provide the recorded metrics.  I've attached a chart that looks clearly
>>>>> logarithmic.
>>>>>
>>>>> Can anyone please point to what could be the bottleneck here?  I'm
>>>>> evaluating HDFS for usage scenarios requiring 2^(a lot more than 18)
>>>>> blocks.
>>>>>
>>>>> Bes <<insertion_rate_4_and_11_datanodes.JPG>> t Regards,
>>>>> Zlatin Balevsky
>>>>>
>>>>> _______________________________________________
>>>>>
>>>>> This e-mail may contain information that is confidential, privileged or
>>>>> otherwise protected from disclosure. If you are not an intended recipient of
>>>>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>>>>> it and any attachments and notify the sender that you have received it in
>>>>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>>>>> sell or a solicitation to buy or sell any securities, investment products or
>>>>> other financial product or service, an official confirmation of any
>>>>> transaction, or an official statement of Barclays. Any views or opinions
>>>>> presented are solely those of the author and do not necessarily represent
>>>>> those of Barclays. This e-mail is subject to terms available at the
>>>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>>>> Barclays you consent to the foregoing.  Barclays Capital is the investment
>>>>> banking division of Barclays Bank PLC, a company registered in England
>>>>> (number 1026167) with its registered office at 1 Churchill Place, London,
>>>>> E14 5HP.  This email may relate to or be sent from other members of the
>>>>> Barclays Group.
>>>>> _______________________________________________
>>>>>
>>>>
>>>>
>>>  _______________________________________________
>>>
>>>
>>>
>>> This e-mail may contain information that is confidential, privileged or
>>> otherwise protected from disclosure. If you are not an intended recipient of
>>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>>> it and any attachments and notify the sender that you have received it in
>>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>>> sell or a solicitation to buy or sell any securities, investment products or
>>> other financial product or service, an official confirmation of any
>>> transaction, or an official statement of Barclays. Any views or opinions
>>> presented are solely those of the author and do not necessarily represent
>>> those of Barclays. This e-mail is subject to terms available at the
>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>> Barclays you consent to the foregoing.  Barclays Capital is the
>>> investment banking division of Barclays Bank PLC, a company registered in
>>> England (number 1026167) with its registered office at 1 Churchill Place,
>>> London, E14 5HP.  This email may relate to or be sent from other members
>>> of the Barclays Group.**
>>>
>>> _______________________________________________
>>>
>>
>>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: Exponential performance decay when inserting large number of blocks

Posted by Dhruba Borthakur <dh...@gmail.com>.
Another thing to observe is the rate of IO on the datanodes. Maybe u can do
a sar/iostat on the datanodes and see if the ddatanode devices show an
increase in activity while inserting the last lot of blocks. One posiblity
is that the OS cache on the datanodes cached most of the data from the first
few runs, but when more and more data started arriving on the datanode it
triggered more flushing of OS buffers. (on the datanode).

thanks,
dhruba


On Wed, Jan 13, 2010 at 2:18 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Zlatin,
>
> Thanks for the explanation and the additional data. I'm a bit busy today
> but will try to go through the data and reproduce the results later this
> week.
>
> -Todd
>
>
> On Wed, Jan 13, 2010 at 2:07 PM, <Zl...@barclayscapital.com>wrote:
>
>>  Todd,
>>
>> I used a shell script that launched 8 instances of the bin/hadoop fs -put
>> utility.  After all 8 processes were done and I verified though the web ui
>> that the files were inserted, I re-launched the script manually again.  That
>> is why you'll notice that in the metrics there are two short periods without
>> any activity (I edited those out from the graph).  There were occasional
>> NotReplicatedYet exceptions in the logs of those processes, but they were
>> occurring at constant rate.
>>
>> I did not run a profiler, but that will eventually be the next step.  I'm
>> attaching the metrics from the namenode and one of the datanodes from the
>> experiment with 4 datanodes.  They were recorded every 10 seconds.  Heap
>> size for all processes is 2GB, and while there was occasional CPU usage on
>> the Namenode it was never 100%.  (and there are plenty of cores).
>>
>> Ultimately the block size will be much larger than the default as the
>> total data will be in the 2^(well over 50) range.  With this test I am
>> trying to determine if there are any bottlenecks at the NameNode  component.
>>
>> Best Regards,
>> Zlatin Balevsky
>>
>>  ------------------------------
>> *From:* Todd Lipcon [mailto:todd@cloudera.com]
>> *Sent:* Wednesday, January 13, 2010 4:34 PM
>> *To:* hdfs-user@hadoop.apache.org
>> *Subject:* Re: Exponential performance decay when inserting large number
>> of blocks
>>
>> Also, if you have the program you used to do the insertions, and could
>> attach it, I'd be interested in trying to replicate this on a test cluster.
>> If you can't redistribute it, I can start from scratch, but would be easier
>> to run yours.
>>
>> Thanks
>> -Todd
>>
>> On Wed, Jan 13, 2010 at 1:31 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Hi Zlatin,
>>>
>>> This is a very interesting test you've run, and certainly not expected
>>> results. I know of many clusters happily chugging along with millions of
>>> blocks, so problems at 400K are very strange. By any chance were you able to
>>> collect profiling information from the NameNode while running this test?
>>>
>>> That said, I hope you've set the block size to 1KB for the purpose of
>>> this test and not because you expect to run that in production. Recommended
>>> block sizes are at least 64MB and often 128MB or 256MB for larger clusters.
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Wed, Jan 13, 2010 at 1:21 PM, <Zl...@barclayscapital.com>wrote:
>>>
>>>> Greetings,
>>>>
>>>> I am testing how HDFS scales with very large number of blocks.  I did
>>>> the following setup:
>>>>
>>>> Set the default blocks size to 1KB
>>>> Started 8 insert processes, each inserting a 16MB file
>>>> Repeated the insert 3 times, keeping the already inserted files in HDFS
>>>> Repeated the entire experiment on one cluster with 4 and another with 11
>>>> identical datanodes (allocated through HOD)
>>>>
>>>> Results:
>>>> The first 128MB (2^18 blocks) insert finished in 5 minutes.  The second
>>>> in 12 minutes.  The third didn't finish within 1 hour.  The 11-node
>>>> cluster was marginally faster.
>>>>
>>>> Throughout this I was storing all available metrics.  There were no
>>>> signs of insufficient memory on any of the nodes; CPU usage and garbage
>>>> collections were constant throughout.  If anyone is interested I can
>>>> provide the recorded metrics.  I've attached a chart that looks clearly
>>>> logarithmic.
>>>>
>>>> Can anyone please point to what could be the bottleneck here?  I'm
>>>> evaluating HDFS for usage scenarios requiring 2^(a lot more than 18)
>>>> blocks.
>>>>
>>>> Bes <<insertion_rate_4_and_11_datanodes.JPG>> t Regards,
>>>> Zlatin Balevsky
>>>>
>>>> _______________________________________________
>>>>
>>>> This e-mail may contain information that is confidential, privileged or
>>>> otherwise protected from disclosure. If you are not an intended recipient of
>>>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>>>> it and any attachments and notify the sender that you have received it in
>>>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>>>> sell or a solicitation to buy or sell any securities, investment products or
>>>> other financial product or service, an official confirmation of any
>>>> transaction, or an official statement of Barclays. Any views or opinions
>>>> presented are solely those of the author and do not necessarily represent
>>>> those of Barclays. This e-mail is subject to terms available at the
>>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>>> Barclays you consent to the foregoing.  Barclays Capital is the investment
>>>> banking division of Barclays Bank PLC, a company registered in England
>>>> (number 1026167) with its registered office at 1 Churchill Place, London,
>>>> E14 5HP.  This email may relate to or be sent from other members of the
>>>> Barclays Group.
>>>> _______________________________________________
>>>>
>>>
>>>
>>  _______________________________________________
>>
>>
>>
>> This e-mail may contain information that is confidential, privileged or
>> otherwise protected from disclosure. If you are not an intended recipient of
>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>> it and any attachments and notify the sender that you have received it in
>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>> sell or a solicitation to buy or sell any securities, investment products or
>> other financial product or service, an official confirmation of any
>> transaction, or an official statement of Barclays. Any views or opinions
>> presented are solely those of the author and do not necessarily represent
>> those of Barclays. This e-mail is subject to terms available at the
>> following link: www.barcap.com/emaildisclaimer. By messaging with
>> Barclays you consent to the foregoing.  Barclays Capital is the
>> investment banking division of Barclays Bank PLC, a company registered in
>> England (number 1026167) with its registered office at 1 Churchill Place,
>> London, E14 5HP.  This email may relate to or be sent from other members
>> of the Barclays Group.**
>>
>> _______________________________________________
>>
>
>


-- 
Connect to me at http://www.facebook.com/dhruba

Re: Exponential performance decay when inserting large number of blocks

Posted by Todd Lipcon <to...@cloudera.com>.
Hey Zlatin,

Thanks for the explanation and the additional data. I'm a bit busy today but
will try to go through the data and reproduce the results later this week.

-Todd

On Wed, Jan 13, 2010 at 2:07 PM, <Zl...@barclayscapital.com>wrote:

>  Todd,
>
> I used a shell script that launched 8 instances of the bin/hadoop fs -put
> utility.  After all 8 processes were done and I verified though the web ui
> that the files were inserted, I re-launched the script manually again.  That
> is why you'll notice that in the metrics there are two short periods without
> any activity (I edited those out from the graph).  There were occasional
> NotReplicatedYet exceptions in the logs of those processes, but they were
> occurring at constant rate.
>
> I did not run a profiler, but that will eventually be the next step.  I'm
> attaching the metrics from the namenode and one of the datanodes from the
> experiment with 4 datanodes.  They were recorded every 10 seconds.  Heap
> size for all processes is 2GB, and while there was occasional CPU usage on
> the Namenode it was never 100%.  (and there are plenty of cores).
>
> Ultimately the block size will be much larger than the default as the total
> data will be in the 2^(well over 50) range.  With this test I am trying to
> determine if there are any bottlenecks at the NameNode  component.
>
> Best Regards,
> Zlatin Balevsky
>
>  ------------------------------
> *From:* Todd Lipcon [mailto:todd@cloudera.com]
> *Sent:* Wednesday, January 13, 2010 4:34 PM
> *To:* hdfs-user@hadoop.apache.org
> *Subject:* Re: Exponential performance decay when inserting large number
> of blocks
>
> Also, if you have the program you used to do the insertions, and could
> attach it, I'd be interested in trying to replicate this on a test cluster.
> If you can't redistribute it, I can start from scratch, but would be easier
> to run yours.
>
> Thanks
> -Todd
>
> On Wed, Jan 13, 2010 at 1:31 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hi Zlatin,
>>
>> This is a very interesting test you've run, and certainly not expected
>> results. I know of many clusters happily chugging along with millions of
>> blocks, so problems at 400K are very strange. By any chance were you able to
>> collect profiling information from the NameNode while running this test?
>>
>> That said, I hope you've set the block size to 1KB for the purpose of this
>> test and not because you expect to run that in production. Recommended block
>> sizes are at least 64MB and often 128MB or 256MB for larger clusters.
>>
>> Thanks
>> -Todd
>>
>> On Wed, Jan 13, 2010 at 1:21 PM, <Zl...@barclayscapital.com>wrote:
>>
>>> Greetings,
>>>
>>> I am testing how HDFS scales with very large number of blocks.  I did
>>> the following setup:
>>>
>>> Set the default blocks size to 1KB
>>> Started 8 insert processes, each inserting a 16MB file
>>> Repeated the insert 3 times, keeping the already inserted files in HDFS
>>> Repeated the entire experiment on one cluster with 4 and another with 11
>>> identical datanodes (allocated through HOD)
>>>
>>> Results:
>>> The first 128MB (2^18 blocks) insert finished in 5 minutes.  The second
>>> in 12 minutes.  The third didn't finish within 1 hour.  The 11-node
>>> cluster was marginally faster.
>>>
>>> Throughout this I was storing all available metrics.  There were no
>>> signs of insufficient memory on any of the nodes; CPU usage and garbage
>>> collections were constant throughout.  If anyone is interested I can
>>> provide the recorded metrics.  I've attached a chart that looks clearly
>>> logarithmic.
>>>
>>> Can anyone please point to what could be the bottleneck here?  I'm
>>> evaluating HDFS for usage scenarios requiring 2^(a lot more than 18)
>>> blocks.
>>>
>>> Bes <<insertion_rate_4_and_11_datanodes.JPG>> t Regards,
>>> Zlatin Balevsky
>>>
>>> _______________________________________________
>>>
>>> This e-mail may contain information that is confidential, privileged or
>>> otherwise protected from disclosure. If you are not an intended recipient of
>>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>>> it and any attachments and notify the sender that you have received it in
>>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>>> sell or a solicitation to buy or sell any securities, investment products or
>>> other financial product or service, an official confirmation of any
>>> transaction, or an official statement of Barclays. Any views or opinions
>>> presented are solely those of the author and do not necessarily represent
>>> those of Barclays. This e-mail is subject to terms available at the
>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>> Barclays you consent to the foregoing.  Barclays Capital is the investment
>>> banking division of Barclays Bank PLC, a company registered in England
>>> (number 1026167) with its registered office at 1 Churchill Place, London,
>>> E14 5HP.  This email may relate to or be sent from other members of the
>>> Barclays Group.
>>> _______________________________________________
>>>
>>
>>
>  _______________________________________________
>
>
>
> This e-mail may contain information that is confidential, privileged or
> otherwise protected from disclosure. If you are not an intended recipient of
> this e-mail, do not duplicate or redistribute it by any means. Please delete
> it and any attachments and notify the sender that you have received it in
> error. Unless specifically indicated, this e-mail is not an offer to buy or
> sell or a solicitation to buy or sell any securities, investment products or
> other financial product or service, an official confirmation of any
> transaction, or an official statement of Barclays. Any views or opinions
> presented are solely those of the author and do not necessarily represent
> those of Barclays. This e-mail is subject to terms available at the
> following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
> you consent to the foregoing.  Barclays Capital is the investment banking
> division of Barclays Bank PLC, a company registered in England (number
> 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
> This email may relate to or be sent from other members of the Barclays
> Group.**
>
> _______________________________________________
>

RE: Exponential performance decay when inserting large number of blocks

Posted by Zl...@barclayscapital.com.
Todd,
 
I used a shell script that launched 8 instances of the bin/hadoop fs
-put utility.  After all 8 processes were done and I verified though the
web ui that the files were inserted, I re-launched the script manually
again.  That is why you'll notice that in the metrics there are two
short periods without any activity (I edited those out from the graph).
There were occasional NotReplicatedYet exceptions in the logs of those
processes, but they were occurring at constant rate.
 
I did not run a profiler, but that will eventually be the next step.
I'm attaching the metrics from the namenode and one of the datanodes
from the experiment with 4 datanodes.  They were recorded every 10
seconds.  Heap size for all processes is 2GB, and while there was
occasional CPU usage on the Namenode it was never 100%.  (and there are
plenty of cores).
 
Ultimately the block size will be much larger than the default as the
total data will be in the 2^(well over 50) range.  With this test I am
trying to determine if there are any bottlenecks at the NameNode
component.
 
Best Regards,
Zlatin Balevsky
 
________________________________

From: Todd Lipcon [mailto:todd@cloudera.com] 
Sent: Wednesday, January 13, 2010 4:34 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay when inserting large number
of blocks


Also, if you have the program you used to do the insertions, and could
attach it, I'd be interested in trying to replicate this on a test
cluster. If you can't redistribute it, I can start from scratch, but
would be easier to run yours. 

Thanks
-Todd


On Wed, Jan 13, 2010 at 1:31 PM, Todd Lipcon <to...@cloudera.com> wrote:


	Hi Zlatin, 

	This is a very interesting test you've run, and certainly not
expected results. I know of many clusters happily chugging along with
millions of blocks, so problems at 400K are very strange. By any chance
were you able to collect profiling information from the NameNode while
running this test?

	That said, I hope you've set the block size to 1KB for the
purpose of this test and not because you expect to run that in
production. Recommended block sizes are at least 64MB and often 128MB or
256MB for larger clusters.

	Thanks
	-Todd

	On Wed, Jan 13, 2010 at 1:21 PM, <
Zlatin.Balevsky@barclayscapital.com> wrote:
	

		Greetings,
		
		I am testing how HDFS scales with very large number of
blocks.  I did
		the following setup:
		
		Set the default blocks size to 1KB
		Started 8 insert processes, each inserting a 16MB file
		Repeated the insert 3 times, keeping the already
inserted files in HDFS
		Repeated the entire experiment on one cluster with 4 and
another with 11
		identical datanodes (allocated through HOD)
		
		Results:
		The first 128MB (2^18 blocks) insert finished in 5
minutes.  The second
		in 12 minutes.  The third didn't finish within 1 hour.
The 11-node
		cluster was marginally faster.
		
		Throughout this I was storing all available metrics.
There were no
		signs of insufficient memory on any of the nodes; CPU
usage and garbage
		collections were constant throughout.  If anyone is
interested I can
		provide the recorded metrics.  I've attached a chart
that looks clearly
		logarithmic.
		
		Can anyone please point to what could be the bottleneck
here?  I'm
		evaluating HDFS for usage scenarios requiring 2^(a lot
more than 18)
		blocks.
		
		Bes <<insertion_rate_4_and_11_datanodes.JPG>> t Regards,
		Zlatin Balevsky
		
		_______________________________________________
		
		This e-mail may contain information that is
confidential, privileged or otherwise protected from disclosure. If you
are not an intended recipient of this e-mail, do not duplicate or
redistribute it by any means. Please delete it and any attachments and
notify the sender that you have received it in error. Unless
specifically indicated, this e-mail is not an offer to buy or sell or a
solicitation to buy or sell any securities, investment products or other
financial product or service, an official confirmation of any
transaction, or an official statement of Barclays. Any views or opinions
presented are solely those of the author and do not necessarily
represent those of Barclays. This e-mail is subject to terms available
at the following link: www.barcap.com/emaildisclaimer. By messaging with
Barclays you consent to the foregoing.  Barclays Capital is the
investment banking division of Barclays Bank PLC, a company registered
in England (number 1026167) with its registered office at 1 Churchill
Place, London, E14 5HP.  This email may relate to or be sent from other
members of the Barclays Group.
		_______________________________________________
		




_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

Re: Exponential performance decay when inserting large number of blocks

Posted by Todd Lipcon <to...@cloudera.com>.
Also, if you have the program you used to do the insertions, and could
attach it, I'd be interested in trying to replicate this on a test cluster.
If you can't redistribute it, I can start from scratch, but would be easier
to run yours.

Thanks
-Todd

On Wed, Jan 13, 2010 at 1:31 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Zlatin,
>
> This is a very interesting test you've run, and certainly not expected
> results. I know of many clusters happily chugging along with millions of
> blocks, so problems at 400K are very strange. By any chance were you able to
> collect profiling information from the NameNode while running this test?
>
> That said, I hope you've set the block size to 1KB for the purpose of this
> test and not because you expect to run that in production. Recommended block
> sizes are at least 64MB and often 128MB or 256MB for larger clusters.
>
> Thanks
> -Todd
>
> On Wed, Jan 13, 2010 at 1:21 PM, <Zl...@barclayscapital.com>wrote:
>
>> Greetings,
>>
>> I am testing how HDFS scales with very large number of blocks.  I did
>> the following setup:
>>
>> Set the default blocks size to 1KB
>> Started 8 insert processes, each inserting a 16MB file
>> Repeated the insert 3 times, keeping the already inserted files in HDFS
>> Repeated the entire experiment on one cluster with 4 and another with 11
>> identical datanodes (allocated through HOD)
>>
>> Results:
>> The first 128MB (2^18 blocks) insert finished in 5 minutes.  The second
>> in 12 minutes.  The third didn't finish within 1 hour.  The 11-node
>> cluster was marginally faster.
>>
>> Throughout this I was storing all available metrics.  There were no
>> signs of insufficient memory on any of the nodes; CPU usage and garbage
>> collections were constant throughout.  If anyone is interested I can
>> provide the recorded metrics.  I've attached a chart that looks clearly
>> logarithmic.
>>
>> Can anyone please point to what could be the bottleneck here?  I'm
>> evaluating HDFS for usage scenarios requiring 2^(a lot more than 18)
>> blocks.
>>
>> Bes <<insertion_rate_4_and_11_datanodes.JPG>> t Regards,
>> Zlatin Balevsky
>>
>> _______________________________________________
>>
>> This e-mail may contain information that is confidential, privileged or
>> otherwise protected from disclosure. If you are not an intended recipient of
>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>> it and any attachments and notify the sender that you have received it in
>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>> sell or a solicitation to buy or sell any securities, investment products or
>> other financial product or service, an official confirmation of any
>> transaction, or an official statement of Barclays. Any views or opinions
>> presented are solely those of the author and do not necessarily represent
>> those of Barclays. This e-mail is subject to terms available at the
>> following link: www.barcap.com/emaildisclaimer. By messaging with
>> Barclays you consent to the foregoing.  Barclays Capital is the investment
>> banking division of Barclays Bank PLC, a company registered in England
>> (number 1026167) with its registered office at 1 Churchill Place, London,
>> E14 5HP.  This email may relate to or be sent from other members of the
>> Barclays Group.
>> _______________________________________________
>>
>
>