You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2013/07/02 00:46:32 UTC

intermediate results files

If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

Re: intermediate results files

Posted by Ravi Prakash <ra...@ymail.com>.

Hi John!
If your block is going to be replicated to three nodes, then in the default block placement policy, 2 of them will be on the same rack, and a third one will be on a different rack. Depending on the network bandwidths available intra-rack and inter-rack, writing with replication factor=3 may be almost as fast or (more likely) slower. With replication factor=2, the default block placement is to place them on different racks, so you wouldn't gain much. So you can 
1. Either choose replication factor = 1
2. Change the block placement policy such that even with replication factor=2, it will choose two nodes in the same rack.

HTH
Ravi




________________________________
 From: Devaraj k <de...@huawei.com>
To: "user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Tuesday, July 2, 2013 1:00 AM
Subject: RE: intermediate results files
 


 
If you are 100% sure that all the node data nodes are available and healthy for that period of time, you can choose the replication factor as 1 or <3.
 
Thanks
Devaraj k
 
From:John Lilley [mailto:john.lilley@redpoint.net] 
Sent: 02 July 2013 04:40
To: user@hadoop.apache.org
Subject: RE: intermediate results files
 
I’ve seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can’t seem to find that now.
John
 
From:Mohammad Tariq [mailto:dontariq@gmail.com] 
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files
 
Hello John,
 
      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.


Warm Regards,
Tariq
cloudfront.blogspot.com
 
On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?  
Thanks
john

Re: intermediate results files

Posted by Ravi Prakash <ra...@ymail.com>.

Hi John!
If your block is going to be replicated to three nodes, then in the default block placement policy, 2 of them will be on the same rack, and a third one will be on a different rack. Depending on the network bandwidths available intra-rack and inter-rack, writing with replication factor=3 may be almost as fast or (more likely) slower. With replication factor=2, the default block placement is to place them on different racks, so you wouldn't gain much. So you can 
1. Either choose replication factor = 1
2. Change the block placement policy such that even with replication factor=2, it will choose two nodes in the same rack.

HTH
Ravi




________________________________
 From: Devaraj k <de...@huawei.com>
To: "user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Tuesday, July 2, 2013 1:00 AM
Subject: RE: intermediate results files
 


 
If you are 100% sure that all the node data nodes are available and healthy for that period of time, you can choose the replication factor as 1 or <3.
 
Thanks
Devaraj k
 
From:John Lilley [mailto:john.lilley@redpoint.net] 
Sent: 02 July 2013 04:40
To: user@hadoop.apache.org
Subject: RE: intermediate results files
 
I’ve seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can’t seem to find that now.
John
 
From:Mohammad Tariq [mailto:dontariq@gmail.com] 
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files
 
Hello John,
 
      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.


Warm Regards,
Tariq
cloudfront.blogspot.com
 
On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?  
Thanks
john

Re: intermediate results files

Posted by Ravi Prakash <ra...@ymail.com>.

Hi John!
If your block is going to be replicated to three nodes, then in the default block placement policy, 2 of them will be on the same rack, and a third one will be on a different rack. Depending on the network bandwidths available intra-rack and inter-rack, writing with replication factor=3 may be almost as fast or (more likely) slower. With replication factor=2, the default block placement is to place them on different racks, so you wouldn't gain much. So you can 
1. Either choose replication factor = 1
2. Change the block placement policy such that even with replication factor=2, it will choose two nodes in the same rack.

HTH
Ravi




________________________________
 From: Devaraj k <de...@huawei.com>
To: "user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Tuesday, July 2, 2013 1:00 AM
Subject: RE: intermediate results files
 


 
If you are 100% sure that all the node data nodes are available and healthy for that period of time, you can choose the replication factor as 1 or <3.
 
Thanks
Devaraj k
 
From:John Lilley [mailto:john.lilley@redpoint.net] 
Sent: 02 July 2013 04:40
To: user@hadoop.apache.org
Subject: RE: intermediate results files
 
I’ve seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can’t seem to find that now.
John
 
From:Mohammad Tariq [mailto:dontariq@gmail.com] 
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files
 
Hello John,
 
      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.


Warm Regards,
Tariq
cloudfront.blogspot.com
 
On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?  
Thanks
john

Re: intermediate results files

Posted by Ravi Prakash <ra...@ymail.com>.

Hi John!
If your block is going to be replicated to three nodes, then in the default block placement policy, 2 of them will be on the same rack, and a third one will be on a different rack. Depending on the network bandwidths available intra-rack and inter-rack, writing with replication factor=3 may be almost as fast or (more likely) slower. With replication factor=2, the default block placement is to place them on different racks, so you wouldn't gain much. So you can 
1. Either choose replication factor = 1
2. Change the block placement policy such that even with replication factor=2, it will choose two nodes in the same rack.

HTH
Ravi




________________________________
 From: Devaraj k <de...@huawei.com>
To: "user@hadoop.apache.org" <us...@hadoop.apache.org> 
Sent: Tuesday, July 2, 2013 1:00 AM
Subject: RE: intermediate results files
 


 
If you are 100% sure that all the node data nodes are available and healthy for that period of time, you can choose the replication factor as 1 or <3.
 
Thanks
Devaraj k
 
From:John Lilley [mailto:john.lilley@redpoint.net] 
Sent: 02 July 2013 04:40
To: user@hadoop.apache.org
Subject: RE: intermediate results files
 
I’ve seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can’t seem to find that now.
John
 
From:Mohammad Tariq [mailto:dontariq@gmail.com] 
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files
 
Hello John,
 
      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.


Warm Regards,
Tariq
cloudfront.blogspot.com
 
On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?  
Thanks
john

RE: intermediate results files

Posted by Devaraj k <de...@huawei.com>.

If you are 100% sure that all the node data nodes are available and healthy for that period of time, you can choose the replication factor as 1 or <3.

Thanks
Devaraj k

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: 02 July 2013 04:40
To: user@hadoop.apache.org
Subject: RE: intermediate results files

I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

RE: intermediate results files

Posted by John Lilley <jo...@redpoint.net>.

Replication also has downstream effects: it puts pressure on the available network bandwidth and disk I/O bandwidth when the cluster is loaded.
john

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 6:35 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files

I see. This difference is because of the fact that the next block of data will not be written to HDFS until the previous block was successfully written to 'all' the DNs selected for replication. This implies that higher RF means more time for the completion of a block write.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:39 AM, John Lilley <jo...@redpoint.net>> wrote:
I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com<ma...@gmail.com>]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

RE: intermediate results files

Posted by John Lilley <jo...@redpoint.net>.

Replication also has downstream effects: it puts pressure on the available network bandwidth and disk I/O bandwidth when the cluster is loaded.
john

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 6:35 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files

I see. This difference is because of the fact that the next block of data will not be written to HDFS until the previous block was successfully written to 'all' the DNs selected for replication. This implies that higher RF means more time for the completion of a block write.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:39 AM, John Lilley <jo...@redpoint.net>> wrote:
I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com<ma...@gmail.com>]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

RE: intermediate results files

Posted by John Lilley <jo...@redpoint.net>.

Replication also has downstream effects: it puts pressure on the available network bandwidth and disk I/O bandwidth when the cluster is loaded.
john

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 6:35 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files

I see. This difference is because of the fact that the next block of data will not be written to HDFS until the previous block was successfully written to 'all' the DNs selected for replication. This implies that higher RF means more time for the completion of a block write.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:39 AM, John Lilley <jo...@redpoint.net>> wrote:
I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com<ma...@gmail.com>]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

RE: intermediate results files

Posted by John Lilley <jo...@redpoint.net>.

Replication also has downstream effects: it puts pressure on the available network bandwidth and disk I/O bandwidth when the cluster is loaded.
john

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 6:35 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files

I see. This difference is because of the fact that the next block of data will not be written to HDFS until the previous block was successfully written to 'all' the DNs selected for replication. This implies that higher RF means more time for the completion of a block write.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:39 AM, John Lilley <jo...@redpoint.net>> wrote:
I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com<ma...@gmail.com>]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

Re: intermediate results files

Posted by Mohammad Tariq <do...@gmail.com>.

I see. This difference is because of the fact that the next block of data
will not be written to HDFS until the previous block was successfully
written to 'all' the DNs selected for replication. This implies that higher
RF means more time for the completion of a block write.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Jul 2, 2013 at 4:39 AM, John Lilley <jo...@redpoint.net>wrote:

>  I’ve seen some benchmarks where replication=1 runs at about 50MB/sec and
> replication=3 runs at about 33MB/sec, but I can’t seem to find that now.**
> **
>
> John****
>
> ** **
>
> *From:* Mohammad Tariq [mailto:dontariq@gmail.com]
> *Sent:* Monday, July 01, 2013 5:03 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: intermediate results files****
>
> ** **
>
> Hello John,****
>
> ** **
>
>       IMHO, it doesn't matter. Your job will write the result just once.
> Replica creation is handled at the HDFS layer so it has nothing to with
> your job. Your job will still be writing at the same speed.****
>
>
> ****
>
> Warm Regards,****
>
> Tariq****
>
> cloudfront.blogspot.com****
>
> ** **
>
> On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> If my reducers are going to create results that are temporary in nature
> (consumed by the next processing stage) is it recommended to use a
> replication factor <3 to improve performance?  ****
>
> Thanks****
>
> john****
>
>  ****
>
> ** **
>

Re: intermediate results files

Posted by Mohammad Tariq <do...@gmail.com>.

I see. This difference is because of the fact that the next block of data
will not be written to HDFS until the previous block was successfully
written to 'all' the DNs selected for replication. This implies that higher
RF means more time for the completion of a block write.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Jul 2, 2013 at 4:39 AM, John Lilley <jo...@redpoint.net>wrote:

>  I’ve seen some benchmarks where replication=1 runs at about 50MB/sec and
> replication=3 runs at about 33MB/sec, but I can’t seem to find that now.**
> **
>
> John****
>
> ** **
>
> *From:* Mohammad Tariq [mailto:dontariq@gmail.com]
> *Sent:* Monday, July 01, 2013 5:03 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: intermediate results files****
>
> ** **
>
> Hello John,****
>
> ** **
>
>       IMHO, it doesn't matter. Your job will write the result just once.
> Replica creation is handled at the HDFS layer so it has nothing to with
> your job. Your job will still be writing at the same speed.****
>
>
> ****
>
> Warm Regards,****
>
> Tariq****
>
> cloudfront.blogspot.com****
>
> ** **
>
> On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> If my reducers are going to create results that are temporary in nature
> (consumed by the next processing stage) is it recommended to use a
> replication factor <3 to improve performance?  ****
>
> Thanks****
>
> john****
>
>  ****
>
> ** **
>

Re: intermediate results files

Posted by Mohammad Tariq <do...@gmail.com>.

I see. This difference is because of the fact that the next block of data
will not be written to HDFS until the previous block was successfully
written to 'all' the DNs selected for replication. This implies that higher
RF means more time for the completion of a block write.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Jul 2, 2013 at 4:39 AM, John Lilley <jo...@redpoint.net>wrote:

>  I’ve seen some benchmarks where replication=1 runs at about 50MB/sec and
> replication=3 runs at about 33MB/sec, but I can’t seem to find that now.**
> **
>
> John****
>
> ** **
>
> *From:* Mohammad Tariq [mailto:dontariq@gmail.com]
> *Sent:* Monday, July 01, 2013 5:03 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: intermediate results files****
>
> ** **
>
> Hello John,****
>
> ** **
>
>       IMHO, it doesn't matter. Your job will write the result just once.
> Replica creation is handled at the HDFS layer so it has nothing to with
> your job. Your job will still be writing at the same speed.****
>
>
> ****
>
> Warm Regards,****
>
> Tariq****
>
> cloudfront.blogspot.com****
>
> ** **
>
> On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> If my reducers are going to create results that are temporary in nature
> (consumed by the next processing stage) is it recommended to use a
> replication factor <3 to improve performance?  ****
>
> Thanks****
>
> john****
>
>  ****
>
> ** **
>

RE: intermediate results files

Posted by Devaraj k <de...@huawei.com>.

If you are 100% sure that all the node data nodes are available and healthy for that period of time, you can choose the replication factor as 1 or <3.

Thanks
Devaraj k

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: 02 July 2013 04:40
To: user@hadoop.apache.org
Subject: RE: intermediate results files

I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

RE: intermediate results files

Posted by Devaraj k <de...@huawei.com>.

If you are 100% sure that all the node data nodes are available and healthy for that period of time, you can choose the replication factor as 1 or <3.

Thanks
Devaraj k

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: 02 July 2013 04:40
To: user@hadoop.apache.org
Subject: RE: intermediate results files

I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

RE: intermediate results files

Posted by Devaraj k <de...@huawei.com>.

If you are 100% sure that all the node data nodes are available and healthy for that period of time, you can choose the replication factor as 1 or <3.

Thanks
Devaraj k

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: 02 July 2013 04:40
To: user@hadoop.apache.org
Subject: RE: intermediate results files

I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

Re: intermediate results files

Posted by Mohammad Tariq <do...@gmail.com>.

I see. This difference is because of the fact that the next block of data
will not be written to HDFS until the previous block was successfully
written to 'all' the DNs selected for replication. This implies that higher
RF means more time for the completion of a block write.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Jul 2, 2013 at 4:39 AM, John Lilley <jo...@redpoint.net>wrote:

>  I’ve seen some benchmarks where replication=1 runs at about 50MB/sec and
> replication=3 runs at about 33MB/sec, but I can’t seem to find that now.**
> **
>
> John****
>
> ** **
>
> *From:* Mohammad Tariq [mailto:dontariq@gmail.com]
> *Sent:* Monday, July 01, 2013 5:03 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: intermediate results files****
>
> ** **
>
> Hello John,****
>
> ** **
>
>       IMHO, it doesn't matter. Your job will write the result just once.
> Replica creation is handled at the HDFS layer so it has nothing to with
> your job. Your job will still be writing at the same speed.****
>
>
> ****
>
> Warm Regards,****
>
> Tariq****
>
> cloudfront.blogspot.com****
>
> ** **
>
> On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> If my reducers are going to create results that are temporary in nature
> (consumed by the next processing stage) is it recommended to use a
> replication factor <3 to improve performance?  ****
>
> Thanks****
>
> john****
>
>  ****
>
> ** **
>

RE: intermediate results files

Posted by John Lilley <jo...@redpoint.net>.

I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

RE: intermediate results files

Posted by John Lilley <jo...@redpoint.net>.

I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

RE: intermediate results files

Posted by John Lilley <jo...@redpoint.net>.

I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

RE: intermediate results files

Posted by John Lilley <jo...@redpoint.net>.

I've seen some benchmarks where replication=1 runs at about 50MB/sec and replication=3 runs at about 33MB/sec, but I can't seem to find that now.
John

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Monday, July 01, 2013 5:03 PM
To: user@hadoop.apache.org
Subject: Re: intermediate results files

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once. Replica creation is handled at the HDFS layer so it has nothing to with your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com<http://cloudfront.blogspot.com>

On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>> wrote:
If my reducers are going to create results that are temporary in nature (consumed by the next processing stage) is it recommended to use a replication factor <3 to improve performance?
Thanks
john

Re: intermediate results files

Posted by Mohammad Tariq <do...@gmail.com>.

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once.
Replica creation is handled at the HDFS layer so it has nothing to with
your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>wrote:

>  If my reducers are going to create results that are temporary in nature
> (consumed by the next processing stage) is it recommended to use a
> replication factor <3 to improve performance?  ****
>
> Thanks****
>
> john****
>
> ** **
>

Re: intermediate results files

Posted by Mohammad Tariq <do...@gmail.com>.

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once.
Replica creation is handled at the HDFS layer so it has nothing to with
your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>wrote:

>  If my reducers are going to create results that are temporary in nature
> (consumed by the next processing stage) is it recommended to use a
> replication factor <3 to improve performance?  ****
>
> Thanks****
>
> john****
>
> ** **
>

Re: intermediate results files

Posted by Mohammad Tariq <do...@gmail.com>.

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once.
Replica creation is handled at the HDFS layer so it has nothing to with
your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>wrote:

>  If my reducers are going to create results that are temporary in nature
> (consumed by the next processing stage) is it recommended to use a
> replication factor <3 to improve performance?  ****
>
> Thanks****
>
> john****
>
> ** **
>

Re: intermediate results files

Posted by Mohammad Tariq <do...@gmail.com>.

Hello John,

      IMHO, it doesn't matter. Your job will write the result just once.
Replica creation is handled at the HDFS layer so it has nothing to with
your job. Your job will still be writing at the same speed.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Jul 2, 2013 at 4:16 AM, John Lilley <jo...@redpoint.net>wrote:

>  If my reducers are going to create results that are temporary in nature
> (consumed by the next processing stage) is it recommended to use a
> replication factor <3 to improve performance?  ****
>
> Thanks****
>
> john****
>
> ** **
>