You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com> on 2014/06/17 13:17:43 UTC

HDFS File Writes & Reads

Hi,

I have a basic question regarding file writes and reads in HDFS. Is the
file write and read process a sequential activity or executed in parallel?

For example, lets assume that there is a File File1 which constitutes of
three blocks B1, B2 and B3.

1. Will the write process write B2 only after B1 is complete and B3 only
after B2 is complete or for a large file with many blocks, can this happen
in parallel? In all the hadoop documentation, I read this to be a
sequential operation. Does that mean for a file of 1TB, it takes three
times more time than a traditional file write? (due to default replication
factor of 3)
2. Is it similar in the case of read as well?

Kindly someone please provide some clarity on this...

Regards
Vijay

Re: HDFS File Writes & Reads

Posted by Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>.

Yong,

Thanks for the clarification. It was more of an academic query. We do not
have any performance requirements at this stage.

Regards
Vijay


On 19 June 2014 19:05, java8964 <ja...@hotmail.com> wrote:

> What your understanding is almost correct, but not with the part your
> highlighted.
>
> The HDFS is not designed for write performance, but the client doesn't
> have to wait for the acknowledgment of previous packets before sending the
> next packets.
>
> This webpage describes it clearly, and hope it is helpful for you.
>
> http://aosabook.org/en/hdfs.html
>
> Quoted
>
> The next packet can be pushed to the pipeline before receiving the
> acknowledgment for the previous packets. The number of outstanding packets
> is limited by the outstanding packets window size of the client.
>
> Do you have any requirements of performance of ingesting data into HDFS?
>
> Yong
>
> ------------------------------
> Date: Thu, 19 Jun 2014 11:51:43 +0530
> Subject: Re: HDFS File Writes & Reads
> From: vijay.bhoomireddy@gmail.com
> To: user@hadoop.apache.org
>
>
> @Zeshen Wu,Thanks for the response.
>
> I still don't understand how HDFS reduces the time to write and read a
> file, compared to a traditional file read / write mechanism.
>
> For example, if I am writing a file, using the default configurations,
> Hadoop internally has to write each block to 3 data nodes. My understanding
> is that for each block, first the client writes the block to the first data
> node in the pipeline which will then inform the second and so on. Once the
> third data node successfully receives the block, it provides an
> acknowledgement back to data node 2 and finally to the client through Data
> node 1. *Only after receiving the acknowledgement for the block, the
> write is considered successful and the client proceeds to write the next
> block.*
>
> If this is the case, then the time taken to write each block is 3 times
> than the normal write due to the replication factor and the write process
> is happening sequentially block after block.
>
> Please correct me if I am wrong in my understanding. Also, the following
> questions below:
>
> 1. My understanding is that File read / write in Hadoop doesn't have any
> parallelism and the best it can perform is same to a traditional file read
> or write + some overhead involved in the distributed communication
> mechanism.
> 2. Parallelism is provided only during the data processing phase via Map
> Reduce, but not during file read / write by a client.
>
> Regards
> Vijay
>
>
>
> On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:
>
> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
>
>
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
>
> Hi,
>
> I have a basic question regarding file writes and reads in HDFS. Is the
> file write and read process a sequential activity or executed in parallel?
>
> For example, lets assume that there is a File File1 which constitutes of
> three blocks B1, B2 and B3.
>
> 1. Will the write process write B2 only after B1 is complete and B3 only
> after B2 is complete or for a large file with many blocks, can this happen
> in parallel? In all the hadoop documentation, I read this to be a
> sequential operation. Does that mean for a file of 1TB, it takes three
> times more time than a traditional file write? (due to default replication
> factor of 3)
> 2. Is it similar in the case of read as well?
>
> Kindly someone please provide some clarity on this...
>
> Regards
> Vijay
>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>
>
>

Re: HDFS File Writes & Reads

Posted by Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>.

Yong,

Thanks for the clarification. It was more of an academic query. We do not
have any performance requirements at this stage.

Regards
Vijay


On 19 June 2014 19:05, java8964 <ja...@hotmail.com> wrote:

> What your understanding is almost correct, but not with the part your
> highlighted.
>
> The HDFS is not designed for write performance, but the client doesn't
> have to wait for the acknowledgment of previous packets before sending the
> next packets.
>
> This webpage describes it clearly, and hope it is helpful for you.
>
> http://aosabook.org/en/hdfs.html
>
> Quoted
>
> The next packet can be pushed to the pipeline before receiving the
> acknowledgment for the previous packets. The number of outstanding packets
> is limited by the outstanding packets window size of the client.
>
> Do you have any requirements of performance of ingesting data into HDFS?
>
> Yong
>
> ------------------------------
> Date: Thu, 19 Jun 2014 11:51:43 +0530
> Subject: Re: HDFS File Writes & Reads
> From: vijay.bhoomireddy@gmail.com
> To: user@hadoop.apache.org
>
>
> @Zeshen Wu,Thanks for the response.
>
> I still don't understand how HDFS reduces the time to write and read a
> file, compared to a traditional file read / write mechanism.
>
> For example, if I am writing a file, using the default configurations,
> Hadoop internally has to write each block to 3 data nodes. My understanding
> is that for each block, first the client writes the block to the first data
> node in the pipeline which will then inform the second and so on. Once the
> third data node successfully receives the block, it provides an
> acknowledgement back to data node 2 and finally to the client through Data
> node 1. *Only after receiving the acknowledgement for the block, the
> write is considered successful and the client proceeds to write the next
> block.*
>
> If this is the case, then the time taken to write each block is 3 times
> than the normal write due to the replication factor and the write process
> is happening sequentially block after block.
>
> Please correct me if I am wrong in my understanding. Also, the following
> questions below:
>
> 1. My understanding is that File read / write in Hadoop doesn't have any
> parallelism and the best it can perform is same to a traditional file read
> or write + some overhead involved in the distributed communication
> mechanism.
> 2. Parallelism is provided only during the data processing phase via Map
> Reduce, but not during file read / write by a client.
>
> Regards
> Vijay
>
>
>
> On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:
>
> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
>
>
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
>
> Hi,
>
> I have a basic question regarding file writes and reads in HDFS. Is the
> file write and read process a sequential activity or executed in parallel?
>
> For example, lets assume that there is a File File1 which constitutes of
> three blocks B1, B2 and B3.
>
> 1. Will the write process write B2 only after B1 is complete and B3 only
> after B2 is complete or for a large file with many blocks, can this happen
> in parallel? In all the hadoop documentation, I read this to be a
> sequential operation. Does that mean for a file of 1TB, it takes three
> times more time than a traditional file write? (due to default replication
> factor of 3)
> 2. Is it similar in the case of read as well?
>
> Kindly someone please provide some clarity on this...
>
> Regards
> Vijay
>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>
>
>

Re: HDFS File Writes & Reads

Posted by Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>.

Yong,

Thanks for the clarification. It was more of an academic query. We do not
have any performance requirements at this stage.

Regards
Vijay


On 19 June 2014 19:05, java8964 <ja...@hotmail.com> wrote:

> What your understanding is almost correct, but not with the part your
> highlighted.
>
> The HDFS is not designed for write performance, but the client doesn't
> have to wait for the acknowledgment of previous packets before sending the
> next packets.
>
> This webpage describes it clearly, and hope it is helpful for you.
>
> http://aosabook.org/en/hdfs.html
>
> Quoted
>
> The next packet can be pushed to the pipeline before receiving the
> acknowledgment for the previous packets. The number of outstanding packets
> is limited by the outstanding packets window size of the client.
>
> Do you have any requirements of performance of ingesting data into HDFS?
>
> Yong
>
> ------------------------------
> Date: Thu, 19 Jun 2014 11:51:43 +0530
> Subject: Re: HDFS File Writes & Reads
> From: vijay.bhoomireddy@gmail.com
> To: user@hadoop.apache.org
>
>
> @Zeshen Wu,Thanks for the response.
>
> I still don't understand how HDFS reduces the time to write and read a
> file, compared to a traditional file read / write mechanism.
>
> For example, if I am writing a file, using the default configurations,
> Hadoop internally has to write each block to 3 data nodes. My understanding
> is that for each block, first the client writes the block to the first data
> node in the pipeline which will then inform the second and so on. Once the
> third data node successfully receives the block, it provides an
> acknowledgement back to data node 2 and finally to the client through Data
> node 1. *Only after receiving the acknowledgement for the block, the
> write is considered successful and the client proceeds to write the next
> block.*
>
> If this is the case, then the time taken to write each block is 3 times
> than the normal write due to the replication factor and the write process
> is happening sequentially block after block.
>
> Please correct me if I am wrong in my understanding. Also, the following
> questions below:
>
> 1. My understanding is that File read / write in Hadoop doesn't have any
> parallelism and the best it can perform is same to a traditional file read
> or write + some overhead involved in the distributed communication
> mechanism.
> 2. Parallelism is provided only during the data processing phase via Map
> Reduce, but not during file read / write by a client.
>
> Regards
> Vijay
>
>
>
> On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:
>
> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
>
>
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
>
> Hi,
>
> I have a basic question regarding file writes and reads in HDFS. Is the
> file write and read process a sequential activity or executed in parallel?
>
> For example, lets assume that there is a File File1 which constitutes of
> three blocks B1, B2 and B3.
>
> 1. Will the write process write B2 only after B1 is complete and B3 only
> after B2 is complete or for a large file with many blocks, can this happen
> in parallel? In all the hadoop documentation, I read this to be a
> sequential operation. Does that mean for a file of 1TB, it takes three
> times more time than a traditional file write? (due to default replication
> factor of 3)
> 2. Is it similar in the case of read as well?
>
> Kindly someone please provide some clarity on this...
>
> Regards
> Vijay
>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>
>
>

Re: HDFS File Writes & Reads

Posted by Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>.

Yong,

Thanks for the clarification. It was more of an academic query. We do not
have any performance requirements at this stage.

Regards
Vijay


On 19 June 2014 19:05, java8964 <ja...@hotmail.com> wrote:

> What your understanding is almost correct, but not with the part your
> highlighted.
>
> The HDFS is not designed for write performance, but the client doesn't
> have to wait for the acknowledgment of previous packets before sending the
> next packets.
>
> This webpage describes it clearly, and hope it is helpful for you.
>
> http://aosabook.org/en/hdfs.html
>
> Quoted
>
> The next packet can be pushed to the pipeline before receiving the
> acknowledgment for the previous packets. The number of outstanding packets
> is limited by the outstanding packets window size of the client.
>
> Do you have any requirements of performance of ingesting data into HDFS?
>
> Yong
>
> ------------------------------
> Date: Thu, 19 Jun 2014 11:51:43 +0530
> Subject: Re: HDFS File Writes & Reads
> From: vijay.bhoomireddy@gmail.com
> To: user@hadoop.apache.org
>
>
> @Zeshen Wu,Thanks for the response.
>
> I still don't understand how HDFS reduces the time to write and read a
> file, compared to a traditional file read / write mechanism.
>
> For example, if I am writing a file, using the default configurations,
> Hadoop internally has to write each block to 3 data nodes. My understanding
> is that for each block, first the client writes the block to the first data
> node in the pipeline which will then inform the second and so on. Once the
> third data node successfully receives the block, it provides an
> acknowledgement back to data node 2 and finally to the client through Data
> node 1. *Only after receiving the acknowledgement for the block, the
> write is considered successful and the client proceeds to write the next
> block.*
>
> If this is the case, then the time taken to write each block is 3 times
> than the normal write due to the replication factor and the write process
> is happening sequentially block after block.
>
> Please correct me if I am wrong in my understanding. Also, the following
> questions below:
>
> 1. My understanding is that File read / write in Hadoop doesn't have any
> parallelism and the best it can perform is same to a traditional file read
> or write + some overhead involved in the distributed communication
> mechanism.
> 2. Parallelism is provided only during the data processing phase via Map
> Reduce, but not during file read / write by a client.
>
> Regards
> Vijay
>
>
>
> On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:
>
> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
>
>
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
>
> Hi,
>
> I have a basic question regarding file writes and reads in HDFS. Is the
> file write and read process a sequential activity or executed in parallel?
>
> For example, lets assume that there is a File File1 which constitutes of
> three blocks B1, B2 and B3.
>
> 1. Will the write process write B2 only after B1 is complete and B3 only
> after B2 is complete or for a large file with many blocks, can this happen
> in parallel? In all the hadoop documentation, I read this to be a
> sequential operation. Does that mean for a file of 1TB, it takes three
> times more time than a traditional file write? (due to default replication
> factor of 3)
> 2. Is it similar in the case of read as well?
>
> Kindly someone please provide some clarity on this...
>
> Regards
> Vijay
>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>
>
>

RE: HDFS File Writes & Reads

Posted by java8964 <ja...@hotmail.com>.

What your understanding is almost correct, but not with the part your highlighted.
The HDFS is not designed for write performance, but the client doesn't have to wait for the acknowledgment of previous packets before sending the next packets.
This webpage describes it clearly, and hope it is helpful for you.
http://aosabook.org/en/hdfs.html
Quoted
The next packet can be pushed to the pipeline before receiving the acknowledgment for the previous packets. The number of outstanding packets is limited by the outstanding packets window size of the client.
Do you have any requirements of performance of ingesting data into HDFS?
Yong

Date: Thu, 19 Jun 2014 11:51:43 +0530
Subject: Re: HDFS File Writes & Reads
From: vijay.bhoomireddy@gmail.com
To: user@hadoop.apache.org

@Zeshen Wu,Thanks for the response.
I still don't understand how HDFS reduces the time to write and read a file, compared to a traditional file read / write mechanism. 

For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.

If this is the case, then the time taken to write each block is 3 times than the normal write due to the replication factor and the write process is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following questions below:
1. My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write + some overhead involved in the distributed communication mechanism.
2. Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.
RegardsVijay



On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:

1. HDFS doesn't allow parallel write2. HDFS use pipeline to write multiple replicas, so it doesn't take three times more time than a traditional file write3. HDFS allow parallel read




2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>:



Hi,



I have a basic question regarding file writes and reads in HDFS. Is the file write and read process a sequential activity or executed in parallel?

For example, lets assume that there is a File File1 which constitutes of three blocks B1, B2 and B3. 




1. Will the write process write B2 only after B1 is complete and B3 only after B2 is complete or for a large file with many blocks, can this happen in parallel? In all the hadoop documentation, I read this to be a sequential operation. Does that mean for a file of 1TB, it takes three times more time than a traditional file write? (due to default replication factor of 3)



2. Is it similar in the case of read as well?




Kindly someone please provide some clarity on this...
Regards



Vijay


-- 
Best Wishes!

Yours, Zesheng

RE: HDFS File Writes & Reads

Posted by java8964 <ja...@hotmail.com>.

What your understanding is almost correct, but not with the part your highlighted.
The HDFS is not designed for write performance, but the client doesn't have to wait for the acknowledgment of previous packets before sending the next packets.
This webpage describes it clearly, and hope it is helpful for you.
http://aosabook.org/en/hdfs.html
Quoted
The next packet can be pushed to the pipeline before receiving the acknowledgment for the previous packets. The number of outstanding packets is limited by the outstanding packets window size of the client.
Do you have any requirements of performance of ingesting data into HDFS?
Yong

Date: Thu, 19 Jun 2014 11:51:43 +0530
Subject: Re: HDFS File Writes & Reads
From: vijay.bhoomireddy@gmail.com
To: user@hadoop.apache.org

@Zeshen Wu,Thanks for the response.
I still don't understand how HDFS reduces the time to write and read a file, compared to a traditional file read / write mechanism. 

For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.

If this is the case, then the time taken to write each block is 3 times than the normal write due to the replication factor and the write process is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following questions below:
1. My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write + some overhead involved in the distributed communication mechanism.
2. Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.
RegardsVijay



On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:

1. HDFS doesn't allow parallel write2. HDFS use pipeline to write multiple replicas, so it doesn't take three times more time than a traditional file write3. HDFS allow parallel read




2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>:



Hi,



I have a basic question regarding file writes and reads in HDFS. Is the file write and read process a sequential activity or executed in parallel?

For example, lets assume that there is a File File1 which constitutes of three blocks B1, B2 and B3. 




1. Will the write process write B2 only after B1 is complete and B3 only after B2 is complete or for a large file with many blocks, can this happen in parallel? In all the hadoop documentation, I read this to be a sequential operation. Does that mean for a file of 1TB, it takes three times more time than a traditional file write? (due to default replication factor of 3)



2. Is it similar in the case of read as well?




Kindly someone please provide some clarity on this...
Regards



Vijay


-- 
Best Wishes!

Yours, Zesheng

RE: HDFS File Writes & Reads

Posted by java8964 <ja...@hotmail.com>.

What your understanding is almost correct, but not with the part your highlighted.
The HDFS is not designed for write performance, but the client doesn't have to wait for the acknowledgment of previous packets before sending the next packets.
This webpage describes it clearly, and hope it is helpful for you.
http://aosabook.org/en/hdfs.html
Quoted
The next packet can be pushed to the pipeline before receiving the acknowledgment for the previous packets. The number of outstanding packets is limited by the outstanding packets window size of the client.
Do you have any requirements of performance of ingesting data into HDFS?
Yong

Date: Thu, 19 Jun 2014 11:51:43 +0530
Subject: Re: HDFS File Writes & Reads
From: vijay.bhoomireddy@gmail.com
To: user@hadoop.apache.org

@Zeshen Wu,Thanks for the response.
I still don't understand how HDFS reduces the time to write and read a file, compared to a traditional file read / write mechanism. 

For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.

If this is the case, then the time taken to write each block is 3 times than the normal write due to the replication factor and the write process is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following questions below:
1. My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write + some overhead involved in the distributed communication mechanism.
2. Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.
RegardsVijay



On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:

1. HDFS doesn't allow parallel write2. HDFS use pipeline to write multiple replicas, so it doesn't take three times more time than a traditional file write3. HDFS allow parallel read




2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>:



Hi,



I have a basic question regarding file writes and reads in HDFS. Is the file write and read process a sequential activity or executed in parallel?

For example, lets assume that there is a File File1 which constitutes of three blocks B1, B2 and B3. 




1. Will the write process write B2 only after B1 is complete and B3 only after B2 is complete or for a large file with many blocks, can this happen in parallel? In all the hadoop documentation, I read this to be a sequential operation. Does that mean for a file of 1TB, it takes three times more time than a traditional file write? (due to default replication factor of 3)



2. Is it similar in the case of read as well?




Kindly someone please provide some clarity on this...
Regards



Vijay


-- 
Best Wishes!

Yours, Zesheng

RE: HDFS File Writes & Reads

Posted by java8964 <ja...@hotmail.com>.

What your understanding is almost correct, but not with the part your highlighted.
The HDFS is not designed for write performance, but the client doesn't have to wait for the acknowledgment of previous packets before sending the next packets.
This webpage describes it clearly, and hope it is helpful for you.
http://aosabook.org/en/hdfs.html
Quoted
The next packet can be pushed to the pipeline before receiving the acknowledgment for the previous packets. The number of outstanding packets is limited by the outstanding packets window size of the client.
Do you have any requirements of performance of ingesting data into HDFS?
Yong

Date: Thu, 19 Jun 2014 11:51:43 +0530
Subject: Re: HDFS File Writes & Reads
From: vijay.bhoomireddy@gmail.com
To: user@hadoop.apache.org

@Zeshen Wu,Thanks for the response.
I still don't understand how HDFS reduces the time to write and read a file, compared to a traditional file read / write mechanism. 

For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.

If this is the case, then the time taken to write each block is 3 times than the normal write due to the replication factor and the write process is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following questions below:
1. My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write + some overhead involved in the distributed communication mechanism.
2. Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.
RegardsVijay



On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:

1. HDFS doesn't allow parallel write2. HDFS use pipeline to write multiple replicas, so it doesn't take three times more time than a traditional file write3. HDFS allow parallel read




2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>:



Hi,



I have a basic question regarding file writes and reads in HDFS. Is the file write and read process a sequential activity or executed in parallel?

For example, lets assume that there is a File File1 which constitutes of three blocks B1, B2 and B3. 




1. Will the write process write B2 only after B1 is complete and B3 only after B2 is complete or for a large file with many blocks, can this happen in parallel? In all the hadoop documentation, I read this to be a sequential operation. Does that mean for a file of 1TB, it takes three times more time than a traditional file write? (due to default replication factor of 3)



2. Is it similar in the case of read as well?




Kindly someone please provide some clarity on this...
Regards



Vijay


-- 
Best Wishes!

Yours, Zesheng

Re: HDFS File Writes & Reads

Posted by Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>.

@Zeshen Wu,Thanks for the response.

I still don't understand how HDFS reduces the time to write and read a
file, compared to a traditional file read / write mechanism.

For example, if I am writing a file, using the default configurations,
Hadoop internally has to write each block to 3 data nodes. My understanding
is that for each block, first the client writes the block to the first data
node in the pipeline which will then inform the second and so on. Once the
third data node successfully receives the block, it provides an
acknowledgement back to data node 2 and finally to the client through Data
node 1. *Only after receiving the acknowledgement for the block, the write
is considered successful and the client proceeds to write the next block.*

If this is the case, then the time taken to write each block is 3 times
than the normal write due to the replication factor and the write process
is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following
questions below:

1. My understanding is that File read / write in Hadoop doesn't have any
parallelism and the best it can perform is same to a traditional file read
or write + some overhead involved in the distributed communication
mechanism.
2. Parallelism is provided only during the data processing phase via Map
Reduce, but not during file read / write by a client.

Regards
Vijay

On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:

> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
>
>
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
>
> Hi,
>>
>> I have a basic question regarding file writes and reads in HDFS. Is the
>> file write and read process a sequential activity or executed in parallel?
>>
>> For example, lets assume that there is a File File1 which constitutes of
>> three blocks B1, B2 and B3.
>>
>> 1. Will the write process write B2 only after B1 is complete and B3 only
>> after B2 is complete or for a large file with many blocks, can this happen
>> in parallel? In all the hadoop documentation, I read this to be a
>> sequential operation. Does that mean for a file of 1TB, it takes three
>> times more time than a traditional file write? (due to default replication
>> factor of 3)
>> 2. Is it similar in the case of read as well?
>>
>> Kindly someone please provide some clarity on this...
>>
>> Regards
>> Vijay
>>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>

Re: HDFS File Writes & Reads

Posted by Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>.

@Zeshen Wu,Thanks for the response.

I still don't understand how HDFS reduces the time to write and read a
file, compared to a traditional file read / write mechanism.

For example, if I am writing a file, using the default configurations,
Hadoop internally has to write each block to 3 data nodes. My understanding
is that for each block, first the client writes the block to the first data
node in the pipeline which will then inform the second and so on. Once the
third data node successfully receives the block, it provides an
acknowledgement back to data node 2 and finally to the client through Data
node 1. *Only after receiving the acknowledgement for the block, the write
is considered successful and the client proceeds to write the next block.*

If this is the case, then the time taken to write each block is 3 times
than the normal write due to the replication factor and the write process
is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following
questions below:

1. My understanding is that File read / write in Hadoop doesn't have any
parallelism and the best it can perform is same to a traditional file read
or write + some overhead involved in the distributed communication
mechanism.
2. Parallelism is provided only during the data processing phase via Map
Reduce, but not during file read / write by a client.

Regards
Vijay

On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:

> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
>
>
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
>
> Hi,
>>
>> I have a basic question regarding file writes and reads in HDFS. Is the
>> file write and read process a sequential activity or executed in parallel?
>>
>> For example, lets assume that there is a File File1 which constitutes of
>> three blocks B1, B2 and B3.
>>
>> 1. Will the write process write B2 only after B1 is complete and B3 only
>> after B2 is complete or for a large file with many blocks, can this happen
>> in parallel? In all the hadoop documentation, I read this to be a
>> sequential operation. Does that mean for a file of 1TB, it takes three
>> times more time than a traditional file write? (due to default replication
>> factor of 3)
>> 2. Is it similar in the case of read as well?
>>
>> Kindly someone please provide some clarity on this...
>>
>> Regards
>> Vijay
>>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>

Re: HDFS File Writes & Reads

Posted by Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>.

@Zeshen Wu,Thanks for the response.

I still don't understand how HDFS reduces the time to write and read a
file, compared to a traditional file read / write mechanism.

For example, if I am writing a file, using the default configurations,
Hadoop internally has to write each block to 3 data nodes. My understanding
is that for each block, first the client writes the block to the first data
node in the pipeline which will then inform the second and so on. Once the
third data node successfully receives the block, it provides an
acknowledgement back to data node 2 and finally to the client through Data
node 1. *Only after receiving the acknowledgement for the block, the write
is considered successful and the client proceeds to write the next block.*

If this is the case, then the time taken to write each block is 3 times
than the normal write due to the replication factor and the write process
is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following
questions below:

1. My understanding is that File read / write in Hadoop doesn't have any
parallelism and the best it can perform is same to a traditional file read
or write + some overhead involved in the distributed communication
mechanism.
2. Parallelism is provided only during the data processing phase via Map
Reduce, but not during file read / write by a client.

Regards
Vijay

On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:

> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
>
>
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
>
> Hi,
>>
>> I have a basic question regarding file writes and reads in HDFS. Is the
>> file write and read process a sequential activity or executed in parallel?
>>
>> For example, lets assume that there is a File File1 which constitutes of
>> three blocks B1, B2 and B3.
>>
>> 1. Will the write process write B2 only after B1 is complete and B3 only
>> after B2 is complete or for a large file with many blocks, can this happen
>> in parallel? In all the hadoop documentation, I read this to be a
>> sequential operation. Does that mean for a file of 1TB, it takes three
>> times more time than a traditional file write? (due to default replication
>> factor of 3)
>> 2. Is it similar in the case of read as well?
>>
>> Kindly someone please provide some clarity on this...
>>
>> Regards
>> Vijay
>>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>

Re: HDFS File Writes & Reads

Posted by Vijaya Narayana Reddy Bhoomi Reddy <vi...@gmail.com>.

@Zeshen Wu,Thanks for the response.

I still don't understand how HDFS reduces the time to write and read a
file, compared to a traditional file read / write mechanism.

For example, if I am writing a file, using the default configurations,
Hadoop internally has to write each block to 3 data nodes. My understanding
is that for each block, first the client writes the block to the first data
node in the pipeline which will then inform the second and so on. Once the
third data node successfully receives the block, it provides an
acknowledgement back to data node 2 and finally to the client through Data
node 1. *Only after receiving the acknowledgement for the block, the write
is considered successful and the client proceeds to write the next block.*

If this is the case, then the time taken to write each block is 3 times
than the normal write due to the replication factor and the write process
is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following
questions below:

1. My understanding is that File read / write in Hadoop doesn't have any
parallelism and the best it can perform is same to a traditional file read
or write + some overhead involved in the distributed communication
mechanism.
2. Parallelism is provided only during the data processing phase via Map
Reduce, but not during file read / write by a client.

Regards
Vijay

On 17 June 2014 19:37, Zesheng Wu <wu...@gmail.com> wrote:

> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
>
>
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
>
> Hi,
>>
>> I have a basic question regarding file writes and reads in HDFS. Is the
>> file write and read process a sequential activity or executed in parallel?
>>
>> For example, lets assume that there is a File File1 which constitutes of
>> three blocks B1, B2 and B3.
>>
>> 1. Will the write process write B2 only after B1 is complete and B3 only
>> after B2 is complete or for a large file with many blocks, can this happen
>> in parallel? In all the hadoop documentation, I read this to be a
>> sequential operation. Does that mean for a file of 1TB, it takes three
>> times more time than a traditional file write? (due to default replication
>> factor of 3)
>> 2. Is it similar in the case of read as well?
>>
>> Kindly someone please provide some clarity on this...
>>
>> Regards
>> Vijay
>>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>

Re: HDFS File Writes & Reads

Posted by Zesheng Wu <wu...@gmail.com>.

1. HDFS doesn't allow parallel write
2. HDFS use pipeline to write multiple replicas, so it doesn't take three
times more time than a traditional file write
3. HDFS allow parallel read


2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
vijay.bhoomireddy@gmail.com>:

> Hi,
>
> I have a basic question regarding file writes and reads in HDFS. Is the
> file write and read process a sequential activity or executed in parallel?
>
> For example, lets assume that there is a File File1 which constitutes of
> three blocks B1, B2 and B3.
>
> 1. Will the write process write B2 only after B1 is complete and B3 only
> after B2 is complete or for a large file with many blocks, can this happen
> in parallel? In all the hadoop documentation, I read this to be a
> sequential operation. Does that mean for a file of 1TB, it takes three
> times more time than a traditional file write? (due to default replication
> factor of 3)
> 2. Is it similar in the case of read as well?
>
> Kindly someone please provide some clarity on this...
>
> Regards
> Vijay
>



-- 
Best Wishes!

Yours, Zesheng

Re: HDFS File Writes & Reads

Posted by Zesheng Wu <wu...@gmail.com>.

1. HDFS doesn't allow parallel write
2. HDFS use pipeline to write multiple replicas, so it doesn't take three
times more time than a traditional file write
3. HDFS allow parallel read


2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
vijay.bhoomireddy@gmail.com>:

> Hi,
>
> I have a basic question regarding file writes and reads in HDFS. Is the
> file write and read process a sequential activity or executed in parallel?
>
> For example, lets assume that there is a File File1 which constitutes of
> three blocks B1, B2 and B3.
>
> 1. Will the write process write B2 only after B1 is complete and B3 only
> after B2 is complete or for a large file with many blocks, can this happen
> in parallel? In all the hadoop documentation, I read this to be a
> sequential operation. Does that mean for a file of 1TB, it takes three
> times more time than a traditional file write? (due to default replication
> factor of 3)
> 2. Is it similar in the case of read as well?
>
> Kindly someone please provide some clarity on this...
>
> Regards
> Vijay
>



-- 
Best Wishes!

Yours, Zesheng

Re: HDFS File Writes & Reads

Posted by Zesheng Wu <wu...@gmail.com>.

1. HDFS doesn't allow parallel write
2. HDFS use pipeline to write multiple replicas, so it doesn't take three
times more time than a traditional file write
3. HDFS allow parallel read


2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
vijay.bhoomireddy@gmail.com>:

> Hi,
>
> I have a basic question regarding file writes and reads in HDFS. Is the
> file write and read process a sequential activity or executed in parallel?
>
> For example, lets assume that there is a File File1 which constitutes of
> three blocks B1, B2 and B3.
>
> 1. Will the write process write B2 only after B1 is complete and B3 only
> after B2 is complete or for a large file with many blocks, can this happen
> in parallel? In all the hadoop documentation, I read this to be a
> sequential operation. Does that mean for a file of 1TB, it takes three
> times more time than a traditional file write? (due to default replication
> factor of 3)
> 2. Is it similar in the case of read as well?
>
> Kindly someone please provide some clarity on this...
>
> Regards
> Vijay
>



-- 
Best Wishes!

Yours, Zesheng

Re: HDFS File Writes & Reads

Posted by Zesheng Wu <wu...@gmail.com>.

1. HDFS doesn't allow parallel write
2. HDFS use pipeline to write multiple replicas, so it doesn't take three
times more time than a traditional file write
3. HDFS allow parallel read


2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
vijay.bhoomireddy@gmail.com>:

> Hi,
>
> I have a basic question regarding file writes and reads in HDFS. Is the
> file write and read process a sequential activity or executed in parallel?
>
> For example, lets assume that there is a File File1 which constitutes of
> three blocks B1, B2 and B3.
>
> 1. Will the write process write B2 only after B1 is complete and B3 only
> after B2 is complete or for a large file with many blocks, can this happen
> in parallel? In all the hadoop documentation, I read this to be a
> sequential operation. Does that mean for a file of 1TB, it takes three
> times more time than a traditional file write? (due to default replication
> factor of 3)
> 2. Is it similar in the case of read as well?
>
> Kindly someone please provide some clarity on this...
>
> Regards
> Vijay
>



-- 
Best Wishes!

Yours, Zesheng