You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sasha Dolgy <sa...@gmail.com> on 2009/05/17 16:55:19 UTC

proper method for writing files to hdfs

The following graphic outlines the architecture for HDFS:
http://hadoop.apache.org/core/docs/current/images/hdfsarchitecture.gif

If one is to write a client that adds data into HDFS, it needs to add it
through the Data Node.  Now, from the graphic I am to understand that the
client doesn't communicate with the NameNode, and only the Data Node.

In the examples I've seen and the playing I am doing, I am connecting to the
hdfs url as a configuration parameter before I create a file.  Is this the
incorrect way to create files in HDFS?

    Configuration config = new Configuration();
    config.set("fs.default.name","hdfs://foo.bar.com:9000/");
    String path = "/tmp/i/am/a/path/to/a/file.name"
    Path hdfsPath = new Path(path);
    FileSystem fileSystem = FileSystem.get(config);
    FSDataOutputStream os = fileSystem.create(hdfsPath, false);
    os.write("something".getBytes());
    os.close();

Should the client be connecting to a data node to create the file as
indicated in the graphic above?

If connecting to a data node is possible and suggested, where can I find
more details about this process?

Thanks in advance,
-sasha

-- 
Sasha Dolgy
sasha.dolgy@gmail.com

RE: proper method for writing files to hdfs

Posted by "Habermaas, William" <Wi...@fatwire.com>.


-----Original Message-----
From: Sasha Dolgy [mailto:sdolgy@gmail.com] 
Sent: Monday, May 18, 2009 9:50 AM
To: core-user@hadoop.apache.org
Subject: Re: proper method for writing files to hdfs

Ok, on the same page with that.

Going back to the original question.  In our scenario we are trying to
stream data into HDFS and despite the posts and hints I've been
reading, it's still tough to crack this nut and this is why I thought
(and thankfully I wasn't right) that we were going about this the
wrong way:

We open up a new file and get the FSDataOutputStream and start to
write data and flush as concurrent information comes in:

2009-05-17 06:16:50,921 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file
offset of block blk_5834867413110307425_1064 from 2451 to 20
48 meta file offset to 23
2009-05-17 06:16:50,921 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0
adding seqno 3 to ack queue.
2009-05-17 06:16:50,921 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_5834867413110307425_1064 acking for packet 3
2009-05-17 06:16:51,111 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet
for block blk_5834867413110307425_1064 of length 735 seqno
 4 offsetInBlock 2048 lastPacketInBlock false
2009-05-17 06:16:51,111 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file
offset of block blk_5834867413110307425_1064 from 2518 to 20
48 meta file offset to 23
2009-05-17 06:16:51,111 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0
adding seqno 4 to ack queue.
2009-05-17 06:16:51,112 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_5834867413110307425_1064 acking for packet 4
2009-05-17 06:16:51,297 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet
for block blk_5834867413110307425_1064 of length 509 seqno
 5 offsetInBlock 2560 lastPacketInBlock false
2009-05-17 06:16:51,297 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file
offset of block blk_5834867413110307425_1064 from 2771 to 25
60 meta file offset to 27

The file gets bigger and bigger, but it is not commited to hdfs until
we close() the stream.  We've waited for the block size to go above
64k and even higher, and it never writes itself out to hdfs.  I've
seen the JIRA bug reports, etc.

Has no one done this?  Is it bad to stream data into it?  How do I
force it to flush the data to disk...

The POC is with environmental data every moment from multiple sources
for monitoring temperature in computers / facilities...

Suppose I'm just a little frustrated.  I see that hadoop is brilliant
for large sets of data that you already have or are happy to move onto
HDFS ...

-sd

> On Mon, May 18, 2009 at 2:09 PM, Bill Habermaas <bi...@habermaas.us> wrote:
>> Sasha,
>>
>> Connecting to the namenode is the proper way to establish the hdfs
>> connection.  Afterwards the Hadoop client handler that is called by your
>> code will go directly to the datanodes. There is no reason for you to
>> communicate directly with a datanode nor is there a way for you to even
> know
>> where the data nodes are located. That is all done by the Hadoop client
> code
>> and done silently under the covers by Hadoop itself.
>>
>> Bill
>>
>> -----Original Message-----
>> From: sdolgy@gmail.com [mailto:sdolgy@gmail.com] On Behalf Of Sasha Dolgy
>> Sent: Sunday, May 17, 2009 10:55 AM
>> To: core-user@hadoop.apache.org
>> Subject: proper method for writing files to hdfs
>>
>> The following graphic outlines the architecture for HDFS:
>> http://hadoop.apache.org/core/docs/current/images/hdfsarchitecture.gif
>>
>> If one is to write a client that adds data into HDFS, it needs to add it
>> through the Data Node.  Now, from the graphic I am to understand that the
>> client doesn't communicate with the NameNode, and only the Data Node.
>>
>> In the examples I've seen and the playing I am doing, I am connecting to
> the
>> hdfs url as a configuration parameter before I create a file.  Is this the
>> incorrect way to create files in HDFS?
>>
>>    Configuration config = new Configuration();
>>    config.set("fs.default.name","hdfs://foo.bar.com:9000/");
>>    String path = "/tmp/i/am/a/path/to/a/file.name"
>>    Path hdfsPath = new Path(path);
>>    FileSystem fileSystem = FileSystem.get(config);
>>    FSDataOutputStream os = fileSystem.create(hdfsPath, false);
>>    os.write("something".getBytes());
>>    os.close();
>>
>> Should the client be connecting to a data node to create the file as
>> indicated in the graphic above?
>>
>> If connecting to a data node is possible and suggested, where can I find
>> more details about this process?
>>
>> Thanks in advance,
>> -sasha
>>
>> --
>> Sasha Dolgy
>> sasha.dolgy@gmail.com

RE: proper method for writing files to hdfs

Posted by Bill Habermaas <bi...@habermaas.us>.

Hadoop writes data to the local filesystem, when the blocksize is reached it
is written into hdfs. Think of hdfs as a block management system rather than
a file system even though the end result is a series of blocks that
constitute a file. You will not see the data in hdfs until the file is
closed - that is the reality of the implementation.  The only way to 'flush'
is what you have already discovered - closing the file. To flush the data
often in hdfs can be a very expensive operation when you consider that it
will affect multiple nodes distributed over a network. I suspect that is why
it isn't there. I believe there is a jira somewhere to have 'sync' force the
data out to disk but I do not know the number or what its status is. 

Assuming you are collecting data as an unending process, you might consider
closing the hdfs output at periodic intervals and/or collecting data locally
(with your intervening flushes) and then moving it into hdfs so it can get
processed by map/reduce. It is a prudent approach to minimize potential data
loss if the hdfs connection gets broken. 

Every implementation is different so you gotta be creative. :o)

Bill   

-----Original Message-----
From: Sasha Dolgy [mailto:sdolgy@gmail.com] 
Sent: Monday, May 18, 2009 9:50 AM
To: core-user@hadoop.apache.org
Subject: Re: proper method for writing files to hdfs

Ok, on the same page with that.

Going back to the original question.  In our scenario we are trying to
stream data into HDFS and despite the posts and hints I've been
reading, it's still tough to crack this nut and this is why I thought
(and thankfully I wasn't right) that we were going about this the
wrong way:

We open up a new file and get the FSDataOutputStream and start to
write data and flush as concurrent information comes in:

2009-05-17 06:16:50,921 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file
offset of block blk_5834867413110307425_1064 from 2451 to 20
48 meta file offset to 23
2009-05-17 06:16:50,921 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0
adding seqno 3 to ack queue.
2009-05-17 06:16:50,921 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_5834867413110307425_1064 acking for packet 3
2009-05-17 06:16:51,111 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet
for block blk_5834867413110307425_1064 of length 735 seqno
 4 offsetInBlock 2048 lastPacketInBlock false
2009-05-17 06:16:51,111 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file
offset of block blk_5834867413110307425_1064 from 2518 to 20
48 meta file offset to 23
2009-05-17 06:16:51,111 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0
adding seqno 4 to ack queue.
2009-05-17 06:16:51,112 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_5834867413110307425_1064 acking for packet 4
2009-05-17 06:16:51,297 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet
for block blk_5834867413110307425_1064 of length 509 seqno
 5 offsetInBlock 2560 lastPacketInBlock false
2009-05-17 06:16:51,297 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file
offset of block blk_5834867413110307425_1064 from 2771 to 25
60 meta file offset to 27

The file gets bigger and bigger, but it is not commited to hdfs until
we close() the stream.  We've waited for the block size to go above
64k and even higher, and it never writes itself out to hdfs.  I've
seen the JIRA bug reports, etc.

Has no one done this?  Is it bad to stream data into it?  How do I
force it to flush the data to disk...

The POC is with environmental data every moment from multiple sources
for monitoring temperature in computers / facilities...

Suppose I'm just a little frustrated.  I see that hadoop is brilliant
for large sets of data that you already have or are happy to move onto
HDFS ...

-sd

> On Mon, May 18, 2009 at 2:09 PM, Bill Habermaas <bi...@habermaas.us> wrote:
>> Sasha,
>>
>> Connecting to the namenode is the proper way to establish the hdfs
>> connection.  Afterwards the Hadoop client handler that is called by your
>> code will go directly to the datanodes. There is no reason for you to
>> communicate directly with a datanode nor is there a way for you to even
> know
>> where the data nodes are located. That is all done by the Hadoop client
> code
>> and done silently under the covers by Hadoop itself.
>>
>> Bill
>>
>> -----Original Message-----
>> From: sdolgy@gmail.com [mailto:sdolgy@gmail.com] On Behalf Of Sasha Dolgy
>> Sent: Sunday, May 17, 2009 10:55 AM
>> To: core-user@hadoop.apache.org
>> Subject: proper method for writing files to hdfs
>>
>> The following graphic outlines the architecture for HDFS:
>> http://hadoop.apache.org/core/docs/current/images/hdfsarchitecture.gif
>>
>> If one is to write a client that adds data into HDFS, it needs to add it
>> through the Data Node.  Now, from the graphic I am to understand that the
>> client doesn't communicate with the NameNode, and only the Data Node.
>>
>> In the examples I've seen and the playing I am doing, I am connecting to
> the
>> hdfs url as a configuration parameter before I create a file.  Is this
the
>> incorrect way to create files in HDFS?
>>
>>    Configuration config = new Configuration();
>>    config.set("fs.default.name","hdfs://foo.bar.com:9000/");
>>    String path = "/tmp/i/am/a/path/to/a/file.name"
>>    Path hdfsPath = new Path(path);
>>    FileSystem fileSystem = FileSystem.get(config);
>>    FSDataOutputStream os = fileSystem.create(hdfsPath, false);
>>    os.write("something".getBytes());
>>    os.close();
>>
>> Should the client be connecting to a data node to create the file as
>> indicated in the graphic above?
>>
>> If connecting to a data node is possible and suggested, where can I find
>> more details about this process?
>>
>> Thanks in advance,
>> -sasha
>>
>> --
>> Sasha Dolgy
>> sasha.dolgy@gmail.com

Re: proper method for writing files to hdfs

Posted by Sasha Dolgy <sd...@gmail.com>.

Ok, on the same page with that.

Going back to the original question.  In our scenario we are trying to
stream data into HDFS and despite the posts and hints I've been
reading, it's still tough to crack this nut and this is why I thought
(and thankfully I wasn't right) that we were going about this the
wrong way:

We open up a new file and get the FSDataOutputStream and start to
write data and flush as concurrent information comes in:

2009-05-17 06:16:50,921 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file
offset of block blk_5834867413110307425_1064 from 2451 to 20
48 meta file offset to 23
2009-05-17 06:16:50,921 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0
adding seqno 3 to ack queue.
2009-05-17 06:16:50,921 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_5834867413110307425_1064 acking for packet 3
2009-05-17 06:16:51,111 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet
for block blk_5834867413110307425_1064 of length 735 seqno
 4 offsetInBlock 2048 lastPacketInBlock false
2009-05-17 06:16:51,111 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file
offset of block blk_5834867413110307425_1064 from 2518 to 20
48 meta file offset to 23
2009-05-17 06:16:51,111 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0
adding seqno 4 to ack queue.
2009-05-17 06:16:51,112 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
block blk_5834867413110307425_1064 acking for packet 4
2009-05-17 06:16:51,297 DEBUG
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet
for block blk_5834867413110307425_1064 of length 509 seqno
 5 offsetInBlock 2560 lastPacketInBlock false
2009-05-17 06:16:51,297 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file
offset of block blk_5834867413110307425_1064 from 2771 to 25
60 meta file offset to 27

The file gets bigger and bigger, but it is not commited to hdfs until
we close() the stream.  We've waited for the block size to go above
64k and even higher, and it never writes itself out to hdfs.  I've
seen the JIRA bug reports, etc.

Has no one done this?  Is it bad to stream data into it?  How do I
force it to flush the data to disk...

The POC is with environmental data every moment from multiple sources
for monitoring temperature in computers / facilities...

Suppose I'm just a little frustrated.  I see that hadoop is brilliant
for large sets of data that you already have or are happy to move onto
HDFS ...

-sd

> On Mon, May 18, 2009 at 2:09 PM, Bill Habermaas <bi...@habermaas.us> wrote:
>> Sasha,
>>
>> Connecting to the namenode is the proper way to establish the hdfs
>> connection.  Afterwards the Hadoop client handler that is called by your
>> code will go directly to the datanodes. There is no reason for you to
>> communicate directly with a datanode nor is there a way for you to even
> know
>> where the data nodes are located. That is all done by the Hadoop client
> code
>> and done silently under the covers by Hadoop itself.
>>
>> Bill
>>
>> -----Original Message-----
>> From: sdolgy@gmail.com [mailto:sdolgy@gmail.com] On Behalf Of Sasha Dolgy
>> Sent: Sunday, May 17, 2009 10:55 AM
>> To: core-user@hadoop.apache.org
>> Subject: proper method for writing files to hdfs
>>
>> The following graphic outlines the architecture for HDFS:
>> http://hadoop.apache.org/core/docs/current/images/hdfsarchitecture.gif
>>
>> If one is to write a client that adds data into HDFS, it needs to add it
>> through the Data Node.  Now, from the graphic I am to understand that the
>> client doesn't communicate with the NameNode, and only the Data Node.
>>
>> In the examples I've seen and the playing I am doing, I am connecting to
> the
>> hdfs url as a configuration parameter before I create a file.  Is this the
>> incorrect way to create files in HDFS?
>>
>>    Configuration config = new Configuration();
>>    config.set("fs.default.name","hdfs://foo.bar.com:9000/");
>>    String path = "/tmp/i/am/a/path/to/a/file.name"
>>    Path hdfsPath = new Path(path);
>>    FileSystem fileSystem = FileSystem.get(config);
>>    FSDataOutputStream os = fileSystem.create(hdfsPath, false);
>>    os.write("something".getBytes());
>>    os.close();
>>
>> Should the client be connecting to a data node to create the file as
>> indicated in the graphic above?
>>
>> If connecting to a data node is possible and suggested, where can I find
>> more details about this process?
>>
>> Thanks in advance,
>> -sasha
>>
>> --
>> Sasha Dolgy
>> sasha.dolgy@gmail.com

RE: proper method for writing files to hdfs

Posted by Bill Habermaas <bi...@habermaas.us>.

Sasha, 

If the namenode is unavailable then you cannot communicate with Hadoop.  It
is the single point of failure and once it is down then the system is
unusable.  The secondary name node is not a failover substitute for the name
node. The name is misleading. It's purpose is simply to checkpoint the
namenode's data so you can recover from a namenode failure that has
corrupted data. 

Bill 


-----Original Message-----
From: Sasha Dolgy [mailto:sdolgy@gmail.com] 
Sent: Monday, May 18, 2009 9:34 AM
To: core-user@hadoop.apache.org
Subject: Re: proper method for writing files to hdfs

Hi Bill,

Thanks for that.  If the NameNode is unavailable, how do we find the
secondary name node?  Is there a way to deal with this in the code or
should a load balancer of some type sit above each and only direct
traffic to the name node if its listening?

-sd

On Mon, May 18, 2009 at 2:09 PM, Bill Habermaas <bi...@habermaas.us> wrote:
> Sasha,
>
> Connecting to the namenode is the proper way to establish the hdfs
> connection.  Afterwards the Hadoop client handler that is called by your
> code will go directly to the datanodes. There is no reason for you to
> communicate directly with a datanode nor is there a way for you to even
know
> where the data nodes are located. That is all done by the Hadoop client
code
> and done silently under the covers by Hadoop itself.
>
> Bill
>
> -----Original Message-----
> From: sdolgy@gmail.com [mailto:sdolgy@gmail.com] On Behalf Of Sasha Dolgy
> Sent: Sunday, May 17, 2009 10:55 AM
> To: core-user@hadoop.apache.org
> Subject: proper method for writing files to hdfs
>
> The following graphic outlines the architecture for HDFS:
> http://hadoop.apache.org/core/docs/current/images/hdfsarchitecture.gif
>
> If one is to write a client that adds data into HDFS, it needs to add it
> through the Data Node.  Now, from the graphic I am to understand that the
> client doesn't communicate with the NameNode, and only the Data Node.
>
> In the examples I've seen and the playing I am doing, I am connecting to
the
> hdfs url as a configuration parameter before I create a file.  Is this the
> incorrect way to create files in HDFS?
>
>    Configuration config = new Configuration();
>    config.set("fs.default.name","hdfs://foo.bar.com:9000/");
>    String path = "/tmp/i/am/a/path/to/a/file.name"
>    Path hdfsPath = new Path(path);
>    FileSystem fileSystem = FileSystem.get(config);
>    FSDataOutputStream os = fileSystem.create(hdfsPath, false);
>    os.write("something".getBytes());
>    os.close();
>
> Should the client be connecting to a data node to create the file as
> indicated in the graphic above?
>
> If connecting to a data node is possible and suggested, where can I find
> more details about this process?
>
> Thanks in advance,
> -sasha
>
> --
> Sasha Dolgy
> sasha.dolgy@gmail.com
>
>
>



-- 
Sasha Dolgy
sasha.dolgy@gmail.com

Re: proper method for writing files to hdfs

Posted by Sasha Dolgy <sd...@gmail.com>.

Hi Bill,

Thanks for that.  If the NameNode is unavailable, how do we find the
secondary name node?  Is there a way to deal with this in the code or
should a load balancer of some type sit above each and only direct
traffic to the name node if its listening?

-sd

On Mon, May 18, 2009 at 2:09 PM, Bill Habermaas <bi...@habermaas.us> wrote:
> Sasha,
>
> Connecting to the namenode is the proper way to establish the hdfs
> connection.  Afterwards the Hadoop client handler that is called by your
> code will go directly to the datanodes. There is no reason for you to
> communicate directly with a datanode nor is there a way for you to even know
> where the data nodes are located. That is all done by the Hadoop client code
> and done silently under the covers by Hadoop itself.
>
> Bill
>
> -----Original Message-----
> From: sdolgy@gmail.com [mailto:sdolgy@gmail.com] On Behalf Of Sasha Dolgy
> Sent: Sunday, May 17, 2009 10:55 AM
> To: core-user@hadoop.apache.org
> Subject: proper method for writing files to hdfs
>
> The following graphic outlines the architecture for HDFS:
> http://hadoop.apache.org/core/docs/current/images/hdfsarchitecture.gif
>
> If one is to write a client that adds data into HDFS, it needs to add it
> through the Data Node.  Now, from the graphic I am to understand that the
> client doesn't communicate with the NameNode, and only the Data Node.
>
> In the examples I've seen and the playing I am doing, I am connecting to the
> hdfs url as a configuration parameter before I create a file.  Is this the
> incorrect way to create files in HDFS?
>
>    Configuration config = new Configuration();
>    config.set("fs.default.name","hdfs://foo.bar.com:9000/");
>    String path = "/tmp/i/am/a/path/to/a/file.name"
>    Path hdfsPath = new Path(path);
>    FileSystem fileSystem = FileSystem.get(config);
>    FSDataOutputStream os = fileSystem.create(hdfsPath, false);
>    os.write("something".getBytes());
>    os.close();
>
> Should the client be connecting to a data node to create the file as
> indicated in the graphic above?
>
> If connecting to a data node is possible and suggested, where can I find
> more details about this process?
>
> Thanks in advance,
> -sasha
>
> --
> Sasha Dolgy
> sasha.dolgy@gmail.com
>
>
>



-- 
Sasha Dolgy
sasha.dolgy@gmail.com

RE: proper method for writing files to hdfs

Posted by Bill Habermaas <bi...@habermaas.us>.

Sasha,

Connecting to the namenode is the proper way to establish the hdfs
connection.  Afterwards the Hadoop client handler that is called by your
code will go directly to the datanodes. There is no reason for you to
communicate directly with a datanode nor is there a way for you to even know
where the data nodes are located. That is all done by the Hadoop client code
and done silently under the covers by Hadoop itself. 

Bill

-----Original Message-----
From: sdolgy@gmail.com [mailto:sdolgy@gmail.com] On Behalf Of Sasha Dolgy
Sent: Sunday, May 17, 2009 10:55 AM
To: core-user@hadoop.apache.org
Subject: proper method for writing files to hdfs

The following graphic outlines the architecture for HDFS:
http://hadoop.apache.org/core/docs/current/images/hdfsarchitecture.gif

If one is to write a client that adds data into HDFS, it needs to add it
through the Data Node.  Now, from the graphic I am to understand that the
client doesn't communicate with the NameNode, and only the Data Node.

In the examples I've seen and the playing I am doing, I am connecting to the
hdfs url as a configuration parameter before I create a file.  Is this the
incorrect way to create files in HDFS?

    Configuration config = new Configuration();
    config.set("fs.default.name","hdfs://foo.bar.com:9000/");
    String path = "/tmp/i/am/a/path/to/a/file.name"
    Path hdfsPath = new Path(path);
    FileSystem fileSystem = FileSystem.get(config);
    FSDataOutputStream os = fileSystem.create(hdfsPath, false);
    os.write("something".getBytes());
    os.close();

Should the client be connecting to a data node to create the file as
indicated in the graphic above?

If connecting to a data node is possible and suggested, where can I find
more details about this process?

Thanks in advance,
-sasha

-- 
Sasha Dolgy
sasha.dolgy@gmail.com