You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by "Zhenhua (Gerald) Guo" <je...@gmail.com> on 2012/01/26 19:57:28 UTC

Replication is done synchronously or asynchronously?

I have two questions regarding creation of replicas.
- When a user uploads a file to HDFS, it returns whenever the first
replica is created? or the client needs wait until all replicas are
created?
- When the output of MapReduce jobs is written to HDFS (by reduce
tasks), the writing of output returns when the first replica is
created? or wait until all replicas are created?

Thanks

Gerald

Re: Replication is done synchronously or asynchronously?

Posted by "Zhenhua (Gerald) Guo" <je...@gmail.com>.

Thanks a lot!  Your reply thoroughly cleared my confusion.

Gerald

On Fri, Jan 27, 2012 at 1:02 AM, Harsh J <ha...@cloudera.com> wrote:
> Yes you're correct.
>
> Also note that sometimes the request may be for 3 replicas but
> NameNode may only be able to grant lesser cause remaining DNs are
> full/unreachable/loaded-with-threads, in which case write will work
> with just the lesser amount of pipeline size, so long as its >=
> dfs.replication.min.
>
> If it gets 0 assignments when requesting for a write, it runs into
> this: wiki.apache.org/hadoop/FAQ#What_does_.22file_could_only_be_replicated_to_0_nodes.2C_instead_of_1.22_mean.3F
>
> On Fri, Jan 27, 2012 at 4:53 AM, Zhenhua (Gerald) Guo <je...@gmail.com> wrote:
>> Thanks, Harsh J.  Your answer is quite helpful!
>> If I understand right, writes wait until all replicas are created if
>> there is no error during the replication process.  If there is any
>> error in the replication pipeline, dfs.replication.min comes into play
>> .  Is my understanding correct?
>>
>> Gerald
>>
>> On Thu, Jan 26, 2012 at 4:07 PM, Harsh J <ha...@cloudera.com> wrote:
>>> Hi,
>>>
>>> On Fri, Jan 27, 2012 at 12:27 AM, Zhenhua (Gerald) Guo <je...@gmail.com> wrote:
>>>> I have two questions regarding creation of replicas.
>>>> - When a user uploads a file to HDFS, it returns whenever the first
>>>> replica is created? or the client needs wait until all replicas are
>>>> created?
>>>> - When the output of MapReduce jobs is written to HDFS (by reduce
>>>> tasks), the writing of output returns when the first replica is
>>>> created? or wait until all replicas are created?
>>>
>>> Both questions are the same as both do the same form of DFS write.
>>>
>>> Writes are synchronous and replication is pipelined, presently in Apache Hadoop.
>>>
>>> But a write will succeed if at least 1 replica was written (controlled
>>> via dfs.replication.min -- pipeline can lose DNs out of errors, or can
>>> get fewer than requested DNs cause of load/space issues, but write
>>> will succeed if it at least gets one DN)
>>>
>>> Also see the whole conversation at
>>> http://search-hadoop.com/m/bF99W1ZmNqz1 for some more tidbits you
>>> might find interesting.
>>>
>>> --
>>> Harsh J
>>> Customer Ops. Engineer, Cloudera
>
>
>
> --
> Harsh J
> Customer Ops. Engineer, Cloudera

Re: Replication is done synchronously or asynchronously?

Posted by Harsh J <ha...@cloudera.com>.

Yes you're correct.

Also note that sometimes the request may be for 3 replicas but
NameNode may only be able to grant lesser cause remaining DNs are
full/unreachable/loaded-with-threads, in which case write will work
with just the lesser amount of pipeline size, so long as its >=
dfs.replication.min.

If it gets 0 assignments when requesting for a write, it runs into
this: wiki.apache.org/hadoop/FAQ#What_does_.22file_could_only_be_replicated_to_0_nodes.2C_instead_of_1.22_mean.3F

On Fri, Jan 27, 2012 at 4:53 AM, Zhenhua (Gerald) Guo <je...@gmail.com> wrote:
> Thanks, Harsh J.  Your answer is quite helpful!
> If I understand right, writes wait until all replicas are created if
> there is no error during the replication process.  If there is any
> error in the replication pipeline, dfs.replication.min comes into play
> .  Is my understanding correct?
>
> Gerald
>
> On Thu, Jan 26, 2012 at 4:07 PM, Harsh J <ha...@cloudera.com> wrote:
>> Hi,
>>
>> On Fri, Jan 27, 2012 at 12:27 AM, Zhenhua (Gerald) Guo <je...@gmail.com> wrote:
>>> I have two questions regarding creation of replicas.
>>> - When a user uploads a file to HDFS, it returns whenever the first
>>> replica is created? or the client needs wait until all replicas are
>>> created?
>>> - When the output of MapReduce jobs is written to HDFS (by reduce
>>> tasks), the writing of output returns when the first replica is
>>> created? or wait until all replicas are created?
>>
>> Both questions are the same as both do the same form of DFS write.
>>
>> Writes are synchronous and replication is pipelined, presently in Apache Hadoop.
>>
>> But a write will succeed if at least 1 replica was written (controlled
>> via dfs.replication.min -- pipeline can lose DNs out of errors, or can
>> get fewer than requested DNs cause of load/space issues, but write
>> will succeed if it at least gets one DN)
>>
>> Also see the whole conversation at
>> http://search-hadoop.com/m/bF99W1ZmNqz1 for some more tidbits you
>> might find interesting.
>>
>> --
>> Harsh J
>> Customer Ops. Engineer, Cloudera



-- 
Harsh J
Customer Ops. Engineer, Cloudera

Re: Replication is done synchronously or asynchronously?

Posted by "Zhenhua (Gerald) Guo" <je...@gmail.com>.

Thanks, Harsh J.  Your answer is quite helpful!
If I understand right, writes wait until all replicas are created if
there is no error during the replication process.  If there is any
error in the replication pipeline, dfs.replication.min comes into play
.  Is my understanding correct?

Gerald

On Thu, Jan 26, 2012 at 4:07 PM, Harsh J <ha...@cloudera.com> wrote:
> Hi,
>
> On Fri, Jan 27, 2012 at 12:27 AM, Zhenhua (Gerald) Guo <je...@gmail.com> wrote:
>> I have two questions regarding creation of replicas.
>> - When a user uploads a file to HDFS, it returns whenever the first
>> replica is created? or the client needs wait until all replicas are
>> created?
>> - When the output of MapReduce jobs is written to HDFS (by reduce
>> tasks), the writing of output returns when the first replica is
>> created? or wait until all replicas are created?
>
> Both questions are the same as both do the same form of DFS write.
>
> Writes are synchronous and replication is pipelined, presently in Apache Hadoop.
>
> But a write will succeed if at least 1 replica was written (controlled
> via dfs.replication.min -- pipeline can lose DNs out of errors, or can
> get fewer than requested DNs cause of load/space issues, but write
> will succeed if it at least gets one DN)
>
> Also see the whole conversation at
> http://search-hadoop.com/m/bF99W1ZmNqz1 for some more tidbits you
> might find interesting.
>
> --
> Harsh J
> Customer Ops. Engineer, Cloudera

Re: Replication is done synchronously or asynchronously?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

On Fri, Jan 27, 2012 at 12:27 AM, Zhenhua (Gerald) Guo <je...@gmail.com> wrote:
> I have two questions regarding creation of replicas.
> - When a user uploads a file to HDFS, it returns whenever the first
> replica is created? or the client needs wait until all replicas are
> created?
> - When the output of MapReduce jobs is written to HDFS (by reduce
> tasks), the writing of output returns when the first replica is
> created? or wait until all replicas are created?

Both questions are the same as both do the same form of DFS write.

Writes are synchronous and replication is pipelined, presently in Apache Hadoop.

But a write will succeed if at least 1 replica was written (controlled
via dfs.replication.min -- pipeline can lose DNs out of errors, or can
get fewer than requested DNs cause of load/space issues, but write
will succeed if it at least gets one DN)

Also see the whole conversation at
http://search-hadoop.com/m/bF99W1ZmNqz1 for some more tidbits you
might find interesting.

-- 
Harsh J
Customer Ops. Engineer, Cloudera