You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by EdwardKing <zh...@neusoft.com> on 2014/02/19 10:02:55 UTC

How to keep data consistency?

Hadoop 2.2.0, two computer, one is master,another is node1. I want to know following scene:

If node1 is down by some reason, but I don't know node1 can't work, then I use hadoop command to put a file,such as:
$ hadoop fs -put graph.txtgraphin/graph.txt

I know graph.txt file will be put master machine,but node1 computer don't contains this file. After some time, node1 machine can repair, there will be inconsistent because of graph.txt file,how to realize consistency between master machine and node1 machine? Thanks.


---------------------------------------------------------------------------------------------------
Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s) 
is intended only for the use of the intended recipient and may be confidential and/or privileged of 
Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader of this communication is 
not the intended recipient, unauthorized use, forwarding, printing,  storing, disclosure or copying 
is strictly prohibited, and may be unlawful.If you have received this communication in error,please 
immediately notify the sender by return e-mail, and delete the original message and all copies from 
your system. Thank you. 
---------------------------------------------------------------------------------------------------

Re: How to keep data consistency?

Posted by Devin Suiter RDX <ds...@rdx.com>.

Edward,

It doesn't seem like your "hadoop -put ..." command will even complete -
the master isn't receiving the file at any point. It instructs the node1 to
connect to the client, after asking the node1 if it is in a state that it
can receive data to be written, which depends on several other daemons
being available and several successful internal RPC calls. A file doesn't
get written to the master for storage - the only thing the master has is a
transaction log telling it that client A sent a request to -put file
graph.txt file in the HDFS storage location
hdfs://user/$submitterusername/graphin/ and whether that request succeeded.
There is another set of processes that tell the node1 to tell master
periodically what files or pieces of files it has, and that record is
stored in another file, but nothing goes to master for storing.

So if node1 is down, you'll know when you try and -put something, since
it's your only data node - if it was not able to receive something for
storage, master would know there is no node alive to put anything in, and
tell you "I can't put that in HDFS because there are no storage nodes that
are alive" but in Java exceptions instead of plain language.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Wed, Feb 19, 2014 at 4:10 AM, Sergey Murylev <se...@gmail.com>wrote:

>  Hi Edward,
>
> You can't achieve data consistency on your cluster configuration. To do
> this you need at least 3 data nodes and enabled replication with level 3 (
> dfs.replication property in hdfs-site.xml).
>
> On 19/02/14 13:02, EdwardKing wrote:
>
> Hadoop 2.2.0, two computer, one is master,another is node1. I want to know
> following scene:
>
> If node1 is down by some reason, but I don't know node1 can't work, then I
> use hadoop command to put a file,such as:
> $ hadoop fs -put graph.txtgraphin/graph.txt
>
> I know graph.txt file will be put master machine,but node1 computer don't
> contains this file. After some time, node1 machine can repair, there will
> be inconsistent because of graph.txt file,how to realize consistency
> between master machine and node1 machine? Thanks.
>
>
>
>
>
> ---------------------------------------------------------------------------------------------------
> Confidentiality Notice: The information contained in this e-mail and any
> accompanying attachment(s)
> is intended only for the use of the intended recipient and may be
> confidential and/or privileged of
> Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader
> of this communication is
> not the intended recipient, unauthorized use, forwarding, printing,
> storing, disclosure or copying
> is strictly prohibited, and may be unlawful.If you have received this
> communication in error,please
> immediately notify the sender by return e-mail, and delete the original
> message and all copies from
> your system. Thank you.
>
> ---------------------------------------------------------------------------------------------------
>
>
>

Re: How to keep data consistency?

Posted by Devin Suiter RDX <ds...@rdx.com>.

Edward,

It doesn't seem like your "hadoop -put ..." command will even complete -
the master isn't receiving the file at any point. It instructs the node1 to
connect to the client, after asking the node1 if it is in a state that it
can receive data to be written, which depends on several other daemons
being available and several successful internal RPC calls. A file doesn't
get written to the master for storage - the only thing the master has is a
transaction log telling it that client A sent a request to -put file
graph.txt file in the HDFS storage location
hdfs://user/$submitterusername/graphin/ and whether that request succeeded.
There is another set of processes that tell the node1 to tell master
periodically what files or pieces of files it has, and that record is
stored in another file, but nothing goes to master for storing.

So if node1 is down, you'll know when you try and -put something, since
it's your only data node - if it was not able to receive something for
storage, master would know there is no node alive to put anything in, and
tell you "I can't put that in HDFS because there are no storage nodes that
are alive" but in Java exceptions instead of plain language.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Wed, Feb 19, 2014 at 4:10 AM, Sergey Murylev <se...@gmail.com>wrote:

>  Hi Edward,
>
> You can't achieve data consistency on your cluster configuration. To do
> this you need at least 3 data nodes and enabled replication with level 3 (
> dfs.replication property in hdfs-site.xml).
>
> On 19/02/14 13:02, EdwardKing wrote:
>
> Hadoop 2.2.0, two computer, one is master,another is node1. I want to know
> following scene:
>
> If node1 is down by some reason, but I don't know node1 can't work, then I
> use hadoop command to put a file,such as:
> $ hadoop fs -put graph.txtgraphin/graph.txt
>
> I know graph.txt file will be put master machine,but node1 computer don't
> contains this file. After some time, node1 machine can repair, there will
> be inconsistent because of graph.txt file,how to realize consistency
> between master machine and node1 machine? Thanks.
>
>
>
>
>
> ---------------------------------------------------------------------------------------------------
> Confidentiality Notice: The information contained in this e-mail and any
> accompanying attachment(s)
> is intended only for the use of the intended recipient and may be
> confidential and/or privileged of
> Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader
> of this communication is
> not the intended recipient, unauthorized use, forwarding, printing,
> storing, disclosure or copying
> is strictly prohibited, and may be unlawful.If you have received this
> communication in error,please
> immediately notify the sender by return e-mail, and delete the original
> message and all copies from
> your system. Thank you.
>
> ---------------------------------------------------------------------------------------------------
>
>
>

Re: How to keep data consistency?

Posted by Devin Suiter RDX <ds...@rdx.com>.

Edward,

It doesn't seem like your "hadoop -put ..." command will even complete -
the master isn't receiving the file at any point. It instructs the node1 to
connect to the client, after asking the node1 if it is in a state that it
can receive data to be written, which depends on several other daemons
being available and several successful internal RPC calls. A file doesn't
get written to the master for storage - the only thing the master has is a
transaction log telling it that client A sent a request to -put file
graph.txt file in the HDFS storage location
hdfs://user/$submitterusername/graphin/ and whether that request succeeded.
There is another set of processes that tell the node1 to tell master
periodically what files or pieces of files it has, and that record is
stored in another file, but nothing goes to master for storing.

So if node1 is down, you'll know when you try and -put something, since
it's your only data node - if it was not able to receive something for
storage, master would know there is no node alive to put anything in, and
tell you "I can't put that in HDFS because there are no storage nodes that
are alive" but in Java exceptions instead of plain language.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Wed, Feb 19, 2014 at 4:10 AM, Sergey Murylev <se...@gmail.com>wrote:

>  Hi Edward,
>
> You can't achieve data consistency on your cluster configuration. To do
> this you need at least 3 data nodes and enabled replication with level 3 (
> dfs.replication property in hdfs-site.xml).
>
> On 19/02/14 13:02, EdwardKing wrote:
>
> Hadoop 2.2.0, two computer, one is master,another is node1. I want to know
> following scene:
>
> If node1 is down by some reason, but I don't know node1 can't work, then I
> use hadoop command to put a file,such as:
> $ hadoop fs -put graph.txtgraphin/graph.txt
>
> I know graph.txt file will be put master machine,but node1 computer don't
> contains this file. After some time, node1 machine can repair, there will
> be inconsistent because of graph.txt file,how to realize consistency
> between master machine and node1 machine? Thanks.
>
>
>
>
>
> ---------------------------------------------------------------------------------------------------
> Confidentiality Notice: The information contained in this e-mail and any
> accompanying attachment(s)
> is intended only for the use of the intended recipient and may be
> confidential and/or privileged of
> Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader
> of this communication is
> not the intended recipient, unauthorized use, forwarding, printing,
> storing, disclosure or copying
> is strictly prohibited, and may be unlawful.If you have received this
> communication in error,please
> immediately notify the sender by return e-mail, and delete the original
> message and all copies from
> your system. Thank you.
>
> ---------------------------------------------------------------------------------------------------
>
>
>

Re: How to keep data consistency?

Posted by Devin Suiter RDX <ds...@rdx.com>.

Edward,

It doesn't seem like your "hadoop -put ..." command will even complete -
the master isn't receiving the file at any point. It instructs the node1 to
connect to the client, after asking the node1 if it is in a state that it
can receive data to be written, which depends on several other daemons
being available and several successful internal RPC calls. A file doesn't
get written to the master for storage - the only thing the master has is a
transaction log telling it that client A sent a request to -put file
graph.txt file in the HDFS storage location
hdfs://user/$submitterusername/graphin/ and whether that request succeeded.
There is another set of processes that tell the node1 to tell master
periodically what files or pieces of files it has, and that record is
stored in another file, but nothing goes to master for storing.

So if node1 is down, you'll know when you try and -put something, since
it's your only data node - if it was not able to receive something for
storage, master would know there is no node alive to put anything in, and
tell you "I can't put that in HDFS because there are no storage nodes that
are alive" but in Java exceptions instead of plain language.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Wed, Feb 19, 2014 at 4:10 AM, Sergey Murylev <se...@gmail.com>wrote:

>  Hi Edward,
>
> You can't achieve data consistency on your cluster configuration. To do
> this you need at least 3 data nodes and enabled replication with level 3 (
> dfs.replication property in hdfs-site.xml).
>
> On 19/02/14 13:02, EdwardKing wrote:
>
> Hadoop 2.2.0, two computer, one is master,another is node1. I want to know
> following scene:
>
> If node1 is down by some reason, but I don't know node1 can't work, then I
> use hadoop command to put a file,such as:
> $ hadoop fs -put graph.txtgraphin/graph.txt
>
> I know graph.txt file will be put master machine,but node1 computer don't
> contains this file. After some time, node1 machine can repair, there will
> be inconsistent because of graph.txt file,how to realize consistency
> between master machine and node1 machine? Thanks.
>
>
>
>
>
> ---------------------------------------------------------------------------------------------------
> Confidentiality Notice: The information contained in this e-mail and any
> accompanying attachment(s)
> is intended only for the use of the intended recipient and may be
> confidential and/or privileged of
> Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader
> of this communication is
> not the intended recipient, unauthorized use, forwarding, printing,
> storing, disclosure or copying
> is strictly prohibited, and may be unlawful.If you have received this
> communication in error,please
> immediately notify the sender by return e-mail, and delete the original
> message and all copies from
> your system. Thank you.
>
> ---------------------------------------------------------------------------------------------------
>
>
>

Re: How to keep data consistency?

Posted by Sergey Murylev <se...@gmail.com>.

Hi Edward,

You can't achieve data consistency on your cluster configuration. To do
this you need at least 3 data nodes and enabled replication with level 3
( dfs.replication property in hdfs-site.xml).

On 19/02/14 13:02, EdwardKing wrote:
> Hadoop 2.2.0, two computer, one is master,another is node1. I want to
> know following scene:
> If node1 is down by some reason, but I don't know node1 can't work,
> then I use hadoop command to put a file,such as:
> $ hadoop fs -put graph.txtgraphin/graph.txt
> I know graph.txt file will be put master machine,but node1 computer
> don't contains this file. After some time, node1 machine can repair,
> there will be inconsistent because of graph.txt file,how to realize
> consistency between master machine and node1 machine? Thanks.
>
> ---------------------------------------------------------------------------------------------------
> Confidentiality Notice: The information contained in this e-mail and
> any accompanying attachment(s)
> is intended only for the use of the intended recipient and may be
> confidential and/or privileged of
> Neusoft Corporation, its subsidiaries and/or its affiliates. If any
> reader of this communication is
> not the intended recipient, unauthorized use, forwarding, printing,
> storing, disclosure or copying
> is strictly prohibited, and may be unlawful.If you have received this
> communication in error,please
> immediately notify the sender by return e-mail, and delete the
> original message and all copies from
> your system. Thank you.
> ---------------------------------------------------------------------------------------------------
>

Re: How to keep data consistency?

Posted by Sergey Murylev <se...@gmail.com>.

Hi Edward,

You can't achieve data consistency on your cluster configuration. To do
this you need at least 3 data nodes and enabled replication with level 3
( dfs.replication property in hdfs-site.xml).

On 19/02/14 13:02, EdwardKing wrote:
> Hadoop 2.2.0, two computer, one is master,another is node1. I want to
> know following scene:
> If node1 is down by some reason, but I don't know node1 can't work,
> then I use hadoop command to put a file,such as:
> $ hadoop fs -put graph.txtgraphin/graph.txt
> I know graph.txt file will be put master machine,but node1 computer
> don't contains this file. After some time, node1 machine can repair,
> there will be inconsistent because of graph.txt file,how to realize
> consistency between master machine and node1 machine? Thanks.
>
> ---------------------------------------------------------------------------------------------------
> Confidentiality Notice: The information contained in this e-mail and
> any accompanying attachment(s)
> is intended only for the use of the intended recipient and may be
> confidential and/or privileged of
> Neusoft Corporation, its subsidiaries and/or its affiliates. If any
> reader of this communication is
> not the intended recipient, unauthorized use, forwarding, printing,
> storing, disclosure or copying
> is strictly prohibited, and may be unlawful.If you have received this
> communication in error,please
> immediately notify the sender by return e-mail, and delete the
> original message and all copies from
> your system. Thank you.
> ---------------------------------------------------------------------------------------------------
>

Re: How to keep data consistency?

Posted by Sergey Murylev <se...@gmail.com>.

Hi Edward,

You can't achieve data consistency on your cluster configuration. To do
this you need at least 3 data nodes and enabled replication with level 3
( dfs.replication property in hdfs-site.xml).

On 19/02/14 13:02, EdwardKing wrote:
> Hadoop 2.2.0, two computer, one is master,another is node1. I want to
> know following scene:
> If node1 is down by some reason, but I don't know node1 can't work,
> then I use hadoop command to put a file,such as:
> $ hadoop fs -put graph.txtgraphin/graph.txt
> I know graph.txt file will be put master machine,but node1 computer
> don't contains this file. After some time, node1 machine can repair,
> there will be inconsistent because of graph.txt file,how to realize
> consistency between master machine and node1 machine? Thanks.
>
> ---------------------------------------------------------------------------------------------------
> Confidentiality Notice: The information contained in this e-mail and
> any accompanying attachment(s)
> is intended only for the use of the intended recipient and may be
> confidential and/or privileged of
> Neusoft Corporation, its subsidiaries and/or its affiliates. If any
> reader of this communication is
> not the intended recipient, unauthorized use, forwarding, printing,
> storing, disclosure or copying
> is strictly prohibited, and may be unlawful.If you have received this
> communication in error,please
> immediately notify the sender by return e-mail, and delete the
> original message and all copies from
> your system. Thank you.
> ---------------------------------------------------------------------------------------------------
>

Re: How to keep data consistency?

Posted by Sergey Murylev <se...@gmail.com>.

Hi Edward,

You can't achieve data consistency on your cluster configuration. To do
this you need at least 3 data nodes and enabled replication with level 3
( dfs.replication property in hdfs-site.xml).

On 19/02/14 13:02, EdwardKing wrote:
> Hadoop 2.2.0, two computer, one is master,another is node1. I want to
> know following scene:
> If node1 is down by some reason, but I don't know node1 can't work,
> then I use hadoop command to put a file,such as:
> $ hadoop fs -put graph.txtgraphin/graph.txt
> I know graph.txt file will be put master machine,but node1 computer
> don't contains this file. After some time, node1 machine can repair,
> there will be inconsistent because of graph.txt file,how to realize
> consistency between master machine and node1 machine? Thanks.
>
> ---------------------------------------------------------------------------------------------------
> Confidentiality Notice: The information contained in this e-mail and
> any accompanying attachment(s)
> is intended only for the use of the intended recipient and may be
> confidential and/or privileged of
> Neusoft Corporation, its subsidiaries and/or its affiliates. If any
> reader of this communication is
> not the intended recipient, unauthorized use, forwarding, printing,
> storing, disclosure or copying
> is strictly prohibited, and may be unlawful.If you have received this
> communication in error,please
> immediately notify the sender by return e-mail, and delete the
> original message and all copies from
> your system. Thank you.
> ---------------------------------------------------------------------------------------------------
>