You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by lei liu <li...@gmail.com> on 2013/03/28 09:15:58 UTC

DFSOutputStream.sync() method latency time

When client  write data, if there are three replicates,  the sync method
latency time formula should be:
sync method  latency time = first datanode receive data time + sencond
datanode receive data  time +  third datanode receive data time.

if the three datanode receive data time all are 2 millisecond, so the sync
method  latency time should is 6 millisecond,  but according to our our
monitor, the the sync method  latency time is 2 millisecond.


How to calculate sync method  latency time?


Thanks,

LiuLei

Re: DFSOutputStream.sync() method latency time

Posted by lei liu <li...@gmail.com>.
The sync method include below code:
  // Flush only if we haven't already flushed till this offset.
          if (lastFlushOffset != bytesCurBlock) {
            assert bytesCurBlock > lastFlushOffset;
            // record the valid offset of this flush
            lastFlushOffset = bytesCurBlock;
            enqueueCurrentPacket();
}


When there are 64k data in memory, the write method call
enqueueCurrentPacket method send one package to pipeline.  But when the
data in memory are less than 64K, the write method don't call
enqueueCurrentPacket method, so the write method don't send data to
pipeline, and then client call sync method, the sync method call
enqueueCurrentPacket method send data to pipeline, and wait ack info.





2013/3/29 Yanbo Liang <ya...@gmail.com>

> "The write method write data to memory of client, the sync method send
> package to pipeline" I thin you made a mistake for understanding the write
> procedure of HDFS.
>
> It's right that the write method write data to memory of client, however
> the data in the client memory is sent to DataNodes at the time when it was
> filled to the client memory. This procedure is finished by another thread,
> so it's concurrent operation.
>
> sync method has the same operation except for it is used for the last
> packet in the stream. It waits until have received ack from DataNodes.
>
> The write method and sync method is not concurrent. The write method or
> sync method is concurrent with the backend thread which is used to transfer
> data to DataNodes.
>
> And I guess you can understand Chinese, so I recommend you to read one of
> my blog(http://yanbohappy.sinaapp.com/?p=143) and it explain the write
> workflow detail.
>
>
> 2013/3/29 lei liu <li...@gmail.com>
>
>> Thanks Yanbo for your reply.
>>
>> I  test code are :
>>         FSDataOutputStream outputStream = fs.create(path);
>>         Random r = new Random();
>>         long totalBytes = 0;
>>         String str =  new String(new byte[1024]);
>>         while(totalBytes < 1024 * 1024 * 500) {
>>           byte[] bytes = ("start_"+r.nextLong() +"_" + str +
>> r.nextLong()+"_end" + "\n").getBytes();
>>           outputStream.write(bytes);
>>           outputStream.sync();
>>           totalBytes = totalBytes + bytes.length;
>>         }
>>         outputStream.close();
>>
>>
>> The write method and sync method is synchronized, so the two method is
>> not cocurrent.
>>
>> The write method write data to memory of client, the sync method send
>> package to pipelien,  client can execute write  method  until the  sync
>> method return sucess,  so I  think the sync method latency time should be
>> equal with superposition of each datanode operation.
>>
>>
>>
>>
>> 2013/3/28 Yanbo Liang <ya...@gmail.com>
>>
>>> 1st when client wants to write data to HDFS, it should be create
>>> DFSOutputStream.
>>> Then the client write data to this output stream and this stream will
>>> transfer data to all DataNodes with the constructed pipeline by the means
>>> of Packet whose size is 64KB.
>>> These two operations is concurrent, so the write latency is not simple
>>> superposition.
>>>
>>> 2nd the sync method only flush the last packet ( at most 64KB ) data to
>>> the pipeline.
>>>
>>> Because of the cocurrent processing of all these operations, so the
>>> latency is smaller than the superposition of each operation.
>>> It's parallel computing rather than serial computing in a sense.
>>>
>>>
>>> 2013/3/28 lei liu <li...@gmail.com>
>>>
>>>> When client  write data, if there are three replicates,  the sync
>>>> method latency time formula should be:
>>>> sync method  latency time = first datanode receive data time + sencond
>>>> datanode receive data  time +  third datanode receive data time.
>>>>
>>>> if the three datanode receive data time all are 2 millisecond, so the
>>>> sync method  latency time should is 6 millisecond,  but according to our
>>>> our monitor, the the sync method  latency time is 2 millisecond.
>>>>
>>>>
>>>> How to calculate sync method  latency time?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> LiuLei
>>>>
>>>>
>>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by lei liu <li...@gmail.com>.
The sync method include below code:
  // Flush only if we haven't already flushed till this offset.
          if (lastFlushOffset != bytesCurBlock) {
            assert bytesCurBlock > lastFlushOffset;
            // record the valid offset of this flush
            lastFlushOffset = bytesCurBlock;
            enqueueCurrentPacket();
}


When there are 64k data in memory, the write method call
enqueueCurrentPacket method send one package to pipeline.  But when the
data in memory are less than 64K, the write method don't call
enqueueCurrentPacket method, so the write method don't send data to
pipeline, and then client call sync method, the sync method call
enqueueCurrentPacket method send data to pipeline, and wait ack info.





2013/3/29 Yanbo Liang <ya...@gmail.com>

> "The write method write data to memory of client, the sync method send
> package to pipeline" I thin you made a mistake for understanding the write
> procedure of HDFS.
>
> It's right that the write method write data to memory of client, however
> the data in the client memory is sent to DataNodes at the time when it was
> filled to the client memory. This procedure is finished by another thread,
> so it's concurrent operation.
>
> sync method has the same operation except for it is used for the last
> packet in the stream. It waits until have received ack from DataNodes.
>
> The write method and sync method is not concurrent. The write method or
> sync method is concurrent with the backend thread which is used to transfer
> data to DataNodes.
>
> And I guess you can understand Chinese, so I recommend you to read one of
> my blog(http://yanbohappy.sinaapp.com/?p=143) and it explain the write
> workflow detail.
>
>
> 2013/3/29 lei liu <li...@gmail.com>
>
>> Thanks Yanbo for your reply.
>>
>> I  test code are :
>>         FSDataOutputStream outputStream = fs.create(path);
>>         Random r = new Random();
>>         long totalBytes = 0;
>>         String str =  new String(new byte[1024]);
>>         while(totalBytes < 1024 * 1024 * 500) {
>>           byte[] bytes = ("start_"+r.nextLong() +"_" + str +
>> r.nextLong()+"_end" + "\n").getBytes();
>>           outputStream.write(bytes);
>>           outputStream.sync();
>>           totalBytes = totalBytes + bytes.length;
>>         }
>>         outputStream.close();
>>
>>
>> The write method and sync method is synchronized, so the two method is
>> not cocurrent.
>>
>> The write method write data to memory of client, the sync method send
>> package to pipelien,  client can execute write  method  until the  sync
>> method return sucess,  so I  think the sync method latency time should be
>> equal with superposition of each datanode operation.
>>
>>
>>
>>
>> 2013/3/28 Yanbo Liang <ya...@gmail.com>
>>
>>> 1st when client wants to write data to HDFS, it should be create
>>> DFSOutputStream.
>>> Then the client write data to this output stream and this stream will
>>> transfer data to all DataNodes with the constructed pipeline by the means
>>> of Packet whose size is 64KB.
>>> These two operations is concurrent, so the write latency is not simple
>>> superposition.
>>>
>>> 2nd the sync method only flush the last packet ( at most 64KB ) data to
>>> the pipeline.
>>>
>>> Because of the cocurrent processing of all these operations, so the
>>> latency is smaller than the superposition of each operation.
>>> It's parallel computing rather than serial computing in a sense.
>>>
>>>
>>> 2013/3/28 lei liu <li...@gmail.com>
>>>
>>>> When client  write data, if there are three replicates,  the sync
>>>> method latency time formula should be:
>>>> sync method  latency time = first datanode receive data time + sencond
>>>> datanode receive data  time +  third datanode receive data time.
>>>>
>>>> if the three datanode receive data time all are 2 millisecond, so the
>>>> sync method  latency time should is 6 millisecond,  but according to our
>>>> our monitor, the the sync method  latency time is 2 millisecond.
>>>>
>>>>
>>>> How to calculate sync method  latency time?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> LiuLei
>>>>
>>>>
>>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by lei liu <li...@gmail.com>.
The sync method include below code:
  // Flush only if we haven't already flushed till this offset.
          if (lastFlushOffset != bytesCurBlock) {
            assert bytesCurBlock > lastFlushOffset;
            // record the valid offset of this flush
            lastFlushOffset = bytesCurBlock;
            enqueueCurrentPacket();
}


When there are 64k data in memory, the write method call
enqueueCurrentPacket method send one package to pipeline.  But when the
data in memory are less than 64K, the write method don't call
enqueueCurrentPacket method, so the write method don't send data to
pipeline, and then client call sync method, the sync method call
enqueueCurrentPacket method send data to pipeline, and wait ack info.





2013/3/29 Yanbo Liang <ya...@gmail.com>

> "The write method write data to memory of client, the sync method send
> package to pipeline" I thin you made a mistake for understanding the write
> procedure of HDFS.
>
> It's right that the write method write data to memory of client, however
> the data in the client memory is sent to DataNodes at the time when it was
> filled to the client memory. This procedure is finished by another thread,
> so it's concurrent operation.
>
> sync method has the same operation except for it is used for the last
> packet in the stream. It waits until have received ack from DataNodes.
>
> The write method and sync method is not concurrent. The write method or
> sync method is concurrent with the backend thread which is used to transfer
> data to DataNodes.
>
> And I guess you can understand Chinese, so I recommend you to read one of
> my blog(http://yanbohappy.sinaapp.com/?p=143) and it explain the write
> workflow detail.
>
>
> 2013/3/29 lei liu <li...@gmail.com>
>
>> Thanks Yanbo for your reply.
>>
>> I  test code are :
>>         FSDataOutputStream outputStream = fs.create(path);
>>         Random r = new Random();
>>         long totalBytes = 0;
>>         String str =  new String(new byte[1024]);
>>         while(totalBytes < 1024 * 1024 * 500) {
>>           byte[] bytes = ("start_"+r.nextLong() +"_" + str +
>> r.nextLong()+"_end" + "\n").getBytes();
>>           outputStream.write(bytes);
>>           outputStream.sync();
>>           totalBytes = totalBytes + bytes.length;
>>         }
>>         outputStream.close();
>>
>>
>> The write method and sync method is synchronized, so the two method is
>> not cocurrent.
>>
>> The write method write data to memory of client, the sync method send
>> package to pipelien,  client can execute write  method  until the  sync
>> method return sucess,  so I  think the sync method latency time should be
>> equal with superposition of each datanode operation.
>>
>>
>>
>>
>> 2013/3/28 Yanbo Liang <ya...@gmail.com>
>>
>>> 1st when client wants to write data to HDFS, it should be create
>>> DFSOutputStream.
>>> Then the client write data to this output stream and this stream will
>>> transfer data to all DataNodes with the constructed pipeline by the means
>>> of Packet whose size is 64KB.
>>> These two operations is concurrent, so the write latency is not simple
>>> superposition.
>>>
>>> 2nd the sync method only flush the last packet ( at most 64KB ) data to
>>> the pipeline.
>>>
>>> Because of the cocurrent processing of all these operations, so the
>>> latency is smaller than the superposition of each operation.
>>> It's parallel computing rather than serial computing in a sense.
>>>
>>>
>>> 2013/3/28 lei liu <li...@gmail.com>
>>>
>>>> When client  write data, if there are three replicates,  the sync
>>>> method latency time formula should be:
>>>> sync method  latency time = first datanode receive data time + sencond
>>>> datanode receive data  time +  third datanode receive data time.
>>>>
>>>> if the three datanode receive data time all are 2 millisecond, so the
>>>> sync method  latency time should is 6 millisecond,  but according to our
>>>> our monitor, the the sync method  latency time is 2 millisecond.
>>>>
>>>>
>>>> How to calculate sync method  latency time?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> LiuLei
>>>>
>>>>
>>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by lei liu <li...@gmail.com>.
The sync method include below code:
  // Flush only if we haven't already flushed till this offset.
          if (lastFlushOffset != bytesCurBlock) {
            assert bytesCurBlock > lastFlushOffset;
            // record the valid offset of this flush
            lastFlushOffset = bytesCurBlock;
            enqueueCurrentPacket();
}


When there are 64k data in memory, the write method call
enqueueCurrentPacket method send one package to pipeline.  But when the
data in memory are less than 64K, the write method don't call
enqueueCurrentPacket method, so the write method don't send data to
pipeline, and then client call sync method, the sync method call
enqueueCurrentPacket method send data to pipeline, and wait ack info.





2013/3/29 Yanbo Liang <ya...@gmail.com>

> "The write method write data to memory of client, the sync method send
> package to pipeline" I thin you made a mistake for understanding the write
> procedure of HDFS.
>
> It's right that the write method write data to memory of client, however
> the data in the client memory is sent to DataNodes at the time when it was
> filled to the client memory. This procedure is finished by another thread,
> so it's concurrent operation.
>
> sync method has the same operation except for it is used for the last
> packet in the stream. It waits until have received ack from DataNodes.
>
> The write method and sync method is not concurrent. The write method or
> sync method is concurrent with the backend thread which is used to transfer
> data to DataNodes.
>
> And I guess you can understand Chinese, so I recommend you to read one of
> my blog(http://yanbohappy.sinaapp.com/?p=143) and it explain the write
> workflow detail.
>
>
> 2013/3/29 lei liu <li...@gmail.com>
>
>> Thanks Yanbo for your reply.
>>
>> I  test code are :
>>         FSDataOutputStream outputStream = fs.create(path);
>>         Random r = new Random();
>>         long totalBytes = 0;
>>         String str =  new String(new byte[1024]);
>>         while(totalBytes < 1024 * 1024 * 500) {
>>           byte[] bytes = ("start_"+r.nextLong() +"_" + str +
>> r.nextLong()+"_end" + "\n").getBytes();
>>           outputStream.write(bytes);
>>           outputStream.sync();
>>           totalBytes = totalBytes + bytes.length;
>>         }
>>         outputStream.close();
>>
>>
>> The write method and sync method is synchronized, so the two method is
>> not cocurrent.
>>
>> The write method write data to memory of client, the sync method send
>> package to pipelien,  client can execute write  method  until the  sync
>> method return sucess,  so I  think the sync method latency time should be
>> equal with superposition of each datanode operation.
>>
>>
>>
>>
>> 2013/3/28 Yanbo Liang <ya...@gmail.com>
>>
>>> 1st when client wants to write data to HDFS, it should be create
>>> DFSOutputStream.
>>> Then the client write data to this output stream and this stream will
>>> transfer data to all DataNodes with the constructed pipeline by the means
>>> of Packet whose size is 64KB.
>>> These two operations is concurrent, so the write latency is not simple
>>> superposition.
>>>
>>> 2nd the sync method only flush the last packet ( at most 64KB ) data to
>>> the pipeline.
>>>
>>> Because of the cocurrent processing of all these operations, so the
>>> latency is smaller than the superposition of each operation.
>>> It's parallel computing rather than serial computing in a sense.
>>>
>>>
>>> 2013/3/28 lei liu <li...@gmail.com>
>>>
>>>> When client  write data, if there are three replicates,  the sync
>>>> method latency time formula should be:
>>>> sync method  latency time = first datanode receive data time + sencond
>>>> datanode receive data  time +  third datanode receive data time.
>>>>
>>>> if the three datanode receive data time all are 2 millisecond, so the
>>>> sync method  latency time should is 6 millisecond,  but according to our
>>>> our monitor, the the sync method  latency time is 2 millisecond.
>>>>
>>>>
>>>> How to calculate sync method  latency time?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> LiuLei
>>>>
>>>>
>>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by Yanbo Liang <ya...@gmail.com>.
"The write method write data to memory of client, the sync method send
package to pipeline" I thin you made a mistake for understanding the write
procedure of HDFS.

It's right that the write method write data to memory of client, however
the data in the client memory is sent to DataNodes at the time when it was
filled to the client memory. This procedure is finished by another thread,
so it's concurrent operation.

sync method has the same operation except for it is used for the last
packet in the stream. It waits until have received ack from DataNodes.

The write method and sync method is not concurrent. The write method or
sync method is concurrent with the backend thread which is used to transfer
data to DataNodes.

And I guess you can understand Chinese, so I recommend you to read one of
my blog(http://yanbohappy.sinaapp.com/?p=143) and it explain the write
workflow detail.

2013/3/29 lei liu <li...@gmail.com>

> Thanks Yanbo for your reply.
>
> I  test code are :
>         FSDataOutputStream outputStream = fs.create(path);
>         Random r = new Random();
>         long totalBytes = 0;
>         String str =  new String(new byte[1024]);
>         while(totalBytes < 1024 * 1024 * 500) {
>           byte[] bytes = ("start_"+r.nextLong() +"_" + str +
> r.nextLong()+"_end" + "\n").getBytes();
>           outputStream.write(bytes);
>           outputStream.sync();
>           totalBytes = totalBytes + bytes.length;
>         }
>         outputStream.close();
>
>
> The write method and sync method is synchronized, so the two method is not
> cocurrent.
>
> The write method write data to memory of client, the sync method send
> package to pipelien,  client can execute write  method  until the  sync
> method return sucess,  so I  think the sync method latency time should be
> equal with superposition of each datanode operation.
>
>
>
>
> 2013/3/28 Yanbo Liang <ya...@gmail.com>
>
>> 1st when client wants to write data to HDFS, it should be create
>> DFSOutputStream.
>> Then the client write data to this output stream and this stream will
>> transfer data to all DataNodes with the constructed pipeline by the means
>> of Packet whose size is 64KB.
>> These two operations is concurrent, so the write latency is not simple
>> superposition.
>>
>> 2nd the sync method only flush the last packet ( at most 64KB ) data to
>> the pipeline.
>>
>> Because of the cocurrent processing of all these operations, so the
>> latency is smaller than the superposition of each operation.
>> It's parallel computing rather than serial computing in a sense.
>>
>>
>> 2013/3/28 lei liu <li...@gmail.com>
>>
>>> When client  write data, if there are three replicates,  the sync method
>>> latency time formula should be:
>>> sync method  latency time = first datanode receive data time + sencond
>>> datanode receive data  time +  third datanode receive data time.
>>>
>>> if the three datanode receive data time all are 2 millisecond, so the
>>> sync method  latency time should is 6 millisecond,  but according to our
>>> our monitor, the the sync method  latency time is 2 millisecond.
>>>
>>>
>>> How to calculate sync method  latency time?
>>>
>>>
>>> Thanks,
>>>
>>> LiuLei
>>>
>>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by Yanbo Liang <ya...@gmail.com>.
"The write method write data to memory of client, the sync method send
package to pipeline" I thin you made a mistake for understanding the write
procedure of HDFS.

It's right that the write method write data to memory of client, however
the data in the client memory is sent to DataNodes at the time when it was
filled to the client memory. This procedure is finished by another thread,
so it's concurrent operation.

sync method has the same operation except for it is used for the last
packet in the stream. It waits until have received ack from DataNodes.

The write method and sync method is not concurrent. The write method or
sync method is concurrent with the backend thread which is used to transfer
data to DataNodes.

And I guess you can understand Chinese, so I recommend you to read one of
my blog(http://yanbohappy.sinaapp.com/?p=143) and it explain the write
workflow detail.

2013/3/29 lei liu <li...@gmail.com>

> Thanks Yanbo for your reply.
>
> I  test code are :
>         FSDataOutputStream outputStream = fs.create(path);
>         Random r = new Random();
>         long totalBytes = 0;
>         String str =  new String(new byte[1024]);
>         while(totalBytes < 1024 * 1024 * 500) {
>           byte[] bytes = ("start_"+r.nextLong() +"_" + str +
> r.nextLong()+"_end" + "\n").getBytes();
>           outputStream.write(bytes);
>           outputStream.sync();
>           totalBytes = totalBytes + bytes.length;
>         }
>         outputStream.close();
>
>
> The write method and sync method is synchronized, so the two method is not
> cocurrent.
>
> The write method write data to memory of client, the sync method send
> package to pipelien,  client can execute write  method  until the  sync
> method return sucess,  so I  think the sync method latency time should be
> equal with superposition of each datanode operation.
>
>
>
>
> 2013/3/28 Yanbo Liang <ya...@gmail.com>
>
>> 1st when client wants to write data to HDFS, it should be create
>> DFSOutputStream.
>> Then the client write data to this output stream and this stream will
>> transfer data to all DataNodes with the constructed pipeline by the means
>> of Packet whose size is 64KB.
>> These two operations is concurrent, so the write latency is not simple
>> superposition.
>>
>> 2nd the sync method only flush the last packet ( at most 64KB ) data to
>> the pipeline.
>>
>> Because of the cocurrent processing of all these operations, so the
>> latency is smaller than the superposition of each operation.
>> It's parallel computing rather than serial computing in a sense.
>>
>>
>> 2013/3/28 lei liu <li...@gmail.com>
>>
>>> When client  write data, if there are three replicates,  the sync method
>>> latency time formula should be:
>>> sync method  latency time = first datanode receive data time + sencond
>>> datanode receive data  time +  third datanode receive data time.
>>>
>>> if the three datanode receive data time all are 2 millisecond, so the
>>> sync method  latency time should is 6 millisecond,  but according to our
>>> our monitor, the the sync method  latency time is 2 millisecond.
>>>
>>>
>>> How to calculate sync method  latency time?
>>>
>>>
>>> Thanks,
>>>
>>> LiuLei
>>>
>>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by Yanbo Liang <ya...@gmail.com>.
"The write method write data to memory of client, the sync method send
package to pipeline" I thin you made a mistake for understanding the write
procedure of HDFS.

It's right that the write method write data to memory of client, however
the data in the client memory is sent to DataNodes at the time when it was
filled to the client memory. This procedure is finished by another thread,
so it's concurrent operation.

sync method has the same operation except for it is used for the last
packet in the stream. It waits until have received ack from DataNodes.

The write method and sync method is not concurrent. The write method or
sync method is concurrent with the backend thread which is used to transfer
data to DataNodes.

And I guess you can understand Chinese, so I recommend you to read one of
my blog(http://yanbohappy.sinaapp.com/?p=143) and it explain the write
workflow detail.

2013/3/29 lei liu <li...@gmail.com>

> Thanks Yanbo for your reply.
>
> I  test code are :
>         FSDataOutputStream outputStream = fs.create(path);
>         Random r = new Random();
>         long totalBytes = 0;
>         String str =  new String(new byte[1024]);
>         while(totalBytes < 1024 * 1024 * 500) {
>           byte[] bytes = ("start_"+r.nextLong() +"_" + str +
> r.nextLong()+"_end" + "\n").getBytes();
>           outputStream.write(bytes);
>           outputStream.sync();
>           totalBytes = totalBytes + bytes.length;
>         }
>         outputStream.close();
>
>
> The write method and sync method is synchronized, so the two method is not
> cocurrent.
>
> The write method write data to memory of client, the sync method send
> package to pipelien,  client can execute write  method  until the  sync
> method return sucess,  so I  think the sync method latency time should be
> equal with superposition of each datanode operation.
>
>
>
>
> 2013/3/28 Yanbo Liang <ya...@gmail.com>
>
>> 1st when client wants to write data to HDFS, it should be create
>> DFSOutputStream.
>> Then the client write data to this output stream and this stream will
>> transfer data to all DataNodes with the constructed pipeline by the means
>> of Packet whose size is 64KB.
>> These two operations is concurrent, so the write latency is not simple
>> superposition.
>>
>> 2nd the sync method only flush the last packet ( at most 64KB ) data to
>> the pipeline.
>>
>> Because of the cocurrent processing of all these operations, so the
>> latency is smaller than the superposition of each operation.
>> It's parallel computing rather than serial computing in a sense.
>>
>>
>> 2013/3/28 lei liu <li...@gmail.com>
>>
>>> When client  write data, if there are three replicates,  the sync method
>>> latency time formula should be:
>>> sync method  latency time = first datanode receive data time + sencond
>>> datanode receive data  time +  third datanode receive data time.
>>>
>>> if the three datanode receive data time all are 2 millisecond, so the
>>> sync method  latency time should is 6 millisecond,  but according to our
>>> our monitor, the the sync method  latency time is 2 millisecond.
>>>
>>>
>>> How to calculate sync method  latency time?
>>>
>>>
>>> Thanks,
>>>
>>> LiuLei
>>>
>>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by Yanbo Liang <ya...@gmail.com>.
"The write method write data to memory of client, the sync method send
package to pipeline" I thin you made a mistake for understanding the write
procedure of HDFS.

It's right that the write method write data to memory of client, however
the data in the client memory is sent to DataNodes at the time when it was
filled to the client memory. This procedure is finished by another thread,
so it's concurrent operation.

sync method has the same operation except for it is used for the last
packet in the stream. It waits until have received ack from DataNodes.

The write method and sync method is not concurrent. The write method or
sync method is concurrent with the backend thread which is used to transfer
data to DataNodes.

And I guess you can understand Chinese, so I recommend you to read one of
my blog(http://yanbohappy.sinaapp.com/?p=143) and it explain the write
workflow detail.

2013/3/29 lei liu <li...@gmail.com>

> Thanks Yanbo for your reply.
>
> I  test code are :
>         FSDataOutputStream outputStream = fs.create(path);
>         Random r = new Random();
>         long totalBytes = 0;
>         String str =  new String(new byte[1024]);
>         while(totalBytes < 1024 * 1024 * 500) {
>           byte[] bytes = ("start_"+r.nextLong() +"_" + str +
> r.nextLong()+"_end" + "\n").getBytes();
>           outputStream.write(bytes);
>           outputStream.sync();
>           totalBytes = totalBytes + bytes.length;
>         }
>         outputStream.close();
>
>
> The write method and sync method is synchronized, so the two method is not
> cocurrent.
>
> The write method write data to memory of client, the sync method send
> package to pipelien,  client can execute write  method  until the  sync
> method return sucess,  so I  think the sync method latency time should be
> equal with superposition of each datanode operation.
>
>
>
>
> 2013/3/28 Yanbo Liang <ya...@gmail.com>
>
>> 1st when client wants to write data to HDFS, it should be create
>> DFSOutputStream.
>> Then the client write data to this output stream and this stream will
>> transfer data to all DataNodes with the constructed pipeline by the means
>> of Packet whose size is 64KB.
>> These two operations is concurrent, so the write latency is not simple
>> superposition.
>>
>> 2nd the sync method only flush the last packet ( at most 64KB ) data to
>> the pipeline.
>>
>> Because of the cocurrent processing of all these operations, so the
>> latency is smaller than the superposition of each operation.
>> It's parallel computing rather than serial computing in a sense.
>>
>>
>> 2013/3/28 lei liu <li...@gmail.com>
>>
>>> When client  write data, if there are three replicates,  the sync method
>>> latency time formula should be:
>>> sync method  latency time = first datanode receive data time + sencond
>>> datanode receive data  time +  third datanode receive data time.
>>>
>>> if the three datanode receive data time all are 2 millisecond, so the
>>> sync method  latency time should is 6 millisecond,  but according to our
>>> our monitor, the the sync method  latency time is 2 millisecond.
>>>
>>>
>>> How to calculate sync method  latency time?
>>>
>>>
>>> Thanks,
>>>
>>> LiuLei
>>>
>>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by lei liu <li...@gmail.com>.
Thanks Yanbo for your reply.

I  test code are :
        FSDataOutputStream outputStream = fs.create(path);
        Random r = new Random();
        long totalBytes = 0;
        String str =  new String(new byte[1024]);
        while(totalBytes < 1024 * 1024 * 500) {
          byte[] bytes = ("start_"+r.nextLong() +"_" + str +
r.nextLong()+"_end" + "\n").getBytes();
          outputStream.write(bytes);
          outputStream.sync();
          totalBytes = totalBytes + bytes.length;
        }
        outputStream.close();


The write method and sync method is synchronized, so the two method is not
cocurrent.

The write method write data to memory of client, the sync method send
package to pipelien,  client can execute write  method  until the  sync
method return sucess,  so I  think the sync method latency time should be
equal with superposition of each datanode operation.




2013/3/28 Yanbo Liang <ya...@gmail.com>

> 1st when client wants to write data to HDFS, it should be create
> DFSOutputStream.
> Then the client write data to this output stream and this stream will
> transfer data to all DataNodes with the constructed pipeline by the means
> of Packet whose size is 64KB.
> These two operations is concurrent, so the write latency is not simple
> superposition.
>
> 2nd the sync method only flush the last packet ( at most 64KB ) data to
> the pipeline.
>
> Because of the cocurrent processing of all these operations, so the
> latency is smaller than the superposition of each operation.
> It's parallel computing rather than serial computing in a sense.
>
>
> 2013/3/28 lei liu <li...@gmail.com>
>
>> When client  write data, if there are three replicates,  the sync method
>> latency time formula should be:
>> sync method  latency time = first datanode receive data time + sencond
>> datanode receive data  time +  third datanode receive data time.
>>
>> if the three datanode receive data time all are 2 millisecond, so the
>> sync method  latency time should is 6 millisecond,  but according to our
>> our monitor, the the sync method  latency time is 2 millisecond.
>>
>>
>> How to calculate sync method  latency time?
>>
>>
>> Thanks,
>>
>> LiuLei
>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by lei liu <li...@gmail.com>.
Thanks Yanbo for your reply.

I  test code are :
        FSDataOutputStream outputStream = fs.create(path);
        Random r = new Random();
        long totalBytes = 0;
        String str =  new String(new byte[1024]);
        while(totalBytes < 1024 * 1024 * 500) {
          byte[] bytes = ("start_"+r.nextLong() +"_" + str +
r.nextLong()+"_end" + "\n").getBytes();
          outputStream.write(bytes);
          outputStream.sync();
          totalBytes = totalBytes + bytes.length;
        }
        outputStream.close();


The write method and sync method is synchronized, so the two method is not
cocurrent.

The write method write data to memory of client, the sync method send
package to pipelien,  client can execute write  method  until the  sync
method return sucess,  so I  think the sync method latency time should be
equal with superposition of each datanode operation.




2013/3/28 Yanbo Liang <ya...@gmail.com>

> 1st when client wants to write data to HDFS, it should be create
> DFSOutputStream.
> Then the client write data to this output stream and this stream will
> transfer data to all DataNodes with the constructed pipeline by the means
> of Packet whose size is 64KB.
> These two operations is concurrent, so the write latency is not simple
> superposition.
>
> 2nd the sync method only flush the last packet ( at most 64KB ) data to
> the pipeline.
>
> Because of the cocurrent processing of all these operations, so the
> latency is smaller than the superposition of each operation.
> It's parallel computing rather than serial computing in a sense.
>
>
> 2013/3/28 lei liu <li...@gmail.com>
>
>> When client  write data, if there are three replicates,  the sync method
>> latency time formula should be:
>> sync method  latency time = first datanode receive data time + sencond
>> datanode receive data  time +  third datanode receive data time.
>>
>> if the three datanode receive data time all are 2 millisecond, so the
>> sync method  latency time should is 6 millisecond,  but according to our
>> our monitor, the the sync method  latency time is 2 millisecond.
>>
>>
>> How to calculate sync method  latency time?
>>
>>
>> Thanks,
>>
>> LiuLei
>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by lei liu <li...@gmail.com>.
Thanks Yanbo for your reply.

I  test code are :
        FSDataOutputStream outputStream = fs.create(path);
        Random r = new Random();
        long totalBytes = 0;
        String str =  new String(new byte[1024]);
        while(totalBytes < 1024 * 1024 * 500) {
          byte[] bytes = ("start_"+r.nextLong() +"_" + str +
r.nextLong()+"_end" + "\n").getBytes();
          outputStream.write(bytes);
          outputStream.sync();
          totalBytes = totalBytes + bytes.length;
        }
        outputStream.close();


The write method and sync method is synchronized, so the two method is not
cocurrent.

The write method write data to memory of client, the sync method send
package to pipelien,  client can execute write  method  until the  sync
method return sucess,  so I  think the sync method latency time should be
equal with superposition of each datanode operation.




2013/3/28 Yanbo Liang <ya...@gmail.com>

> 1st when client wants to write data to HDFS, it should be create
> DFSOutputStream.
> Then the client write data to this output stream and this stream will
> transfer data to all DataNodes with the constructed pipeline by the means
> of Packet whose size is 64KB.
> These two operations is concurrent, so the write latency is not simple
> superposition.
>
> 2nd the sync method only flush the last packet ( at most 64KB ) data to
> the pipeline.
>
> Because of the cocurrent processing of all these operations, so the
> latency is smaller than the superposition of each operation.
> It's parallel computing rather than serial computing in a sense.
>
>
> 2013/3/28 lei liu <li...@gmail.com>
>
>> When client  write data, if there are three replicates,  the sync method
>> latency time formula should be:
>> sync method  latency time = first datanode receive data time + sencond
>> datanode receive data  time +  third datanode receive data time.
>>
>> if the three datanode receive data time all are 2 millisecond, so the
>> sync method  latency time should is 6 millisecond,  but according to our
>> our monitor, the the sync method  latency time is 2 millisecond.
>>
>>
>> How to calculate sync method  latency time?
>>
>>
>> Thanks,
>>
>> LiuLei
>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by lei liu <li...@gmail.com>.
Thanks Yanbo for your reply.

I  test code are :
        FSDataOutputStream outputStream = fs.create(path);
        Random r = new Random();
        long totalBytes = 0;
        String str =  new String(new byte[1024]);
        while(totalBytes < 1024 * 1024 * 500) {
          byte[] bytes = ("start_"+r.nextLong() +"_" + str +
r.nextLong()+"_end" + "\n").getBytes();
          outputStream.write(bytes);
          outputStream.sync();
          totalBytes = totalBytes + bytes.length;
        }
        outputStream.close();


The write method and sync method is synchronized, so the two method is not
cocurrent.

The write method write data to memory of client, the sync method send
package to pipelien,  client can execute write  method  until the  sync
method return sucess,  so I  think the sync method latency time should be
equal with superposition of each datanode operation.




2013/3/28 Yanbo Liang <ya...@gmail.com>

> 1st when client wants to write data to HDFS, it should be create
> DFSOutputStream.
> Then the client write data to this output stream and this stream will
> transfer data to all DataNodes with the constructed pipeline by the means
> of Packet whose size is 64KB.
> These two operations is concurrent, so the write latency is not simple
> superposition.
>
> 2nd the sync method only flush the last packet ( at most 64KB ) data to
> the pipeline.
>
> Because of the cocurrent processing of all these operations, so the
> latency is smaller than the superposition of each operation.
> It's parallel computing rather than serial computing in a sense.
>
>
> 2013/3/28 lei liu <li...@gmail.com>
>
>> When client  write data, if there are three replicates,  the sync method
>> latency time formula should be:
>> sync method  latency time = first datanode receive data time + sencond
>> datanode receive data  time +  third datanode receive data time.
>>
>> if the three datanode receive data time all are 2 millisecond, so the
>> sync method  latency time should is 6 millisecond,  but according to our
>> our monitor, the the sync method  latency time is 2 millisecond.
>>
>>
>> How to calculate sync method  latency time?
>>
>>
>> Thanks,
>>
>> LiuLei
>>
>>
>

Re: DFSOutputStream.sync() method latency time

Posted by Yanbo Liang <ya...@gmail.com>.
1st when client wants to write data to HDFS, it should be create
DFSOutputStream.
Then the client write data to this output stream and this stream will
transfer data to all DataNodes with the constructed pipeline by the means
of Packet whose size is 64KB.
These two operations is concurrent, so the write latency is not simple
superposition.

2nd the sync method only flush the last packet ( at most 64KB ) data to the
pipeline.

Because of the cocurrent processing of all these operations, so the latency
is smaller than the superposition of each operation.
It's parallel computing rather than serial computing in a sense.


2013/3/28 lei liu <li...@gmail.com>

> When client  write data, if there are three replicates,  the sync method
> latency time formula should be:
> sync method  latency time = first datanode receive data time + sencond
> datanode receive data  time +  third datanode receive data time.
>
> if the three datanode receive data time all are 2 millisecond, so the sync
> method  latency time should is 6 millisecond,  but according to our our
> monitor, the the sync method  latency time is 2 millisecond.
>
>
> How to calculate sync method  latency time?
>
>
> Thanks,
>
> LiuLei
>
>

Re: DFSOutputStream.sync() method latency time

Posted by Yanbo Liang <ya...@gmail.com>.
1st when client wants to write data to HDFS, it should be create
DFSOutputStream.
Then the client write data to this output stream and this stream will
transfer data to all DataNodes with the constructed pipeline by the means
of Packet whose size is 64KB.
These two operations is concurrent, so the write latency is not simple
superposition.

2nd the sync method only flush the last packet ( at most 64KB ) data to the
pipeline.

Because of the cocurrent processing of all these operations, so the latency
is smaller than the superposition of each operation.
It's parallel computing rather than serial computing in a sense.


2013/3/28 lei liu <li...@gmail.com>

> When client  write data, if there are three replicates,  the sync method
> latency time formula should be:
> sync method  latency time = first datanode receive data time + sencond
> datanode receive data  time +  third datanode receive data time.
>
> if the three datanode receive data time all are 2 millisecond, so the sync
> method  latency time should is 6 millisecond,  but according to our our
> monitor, the the sync method  latency time is 2 millisecond.
>
>
> How to calculate sync method  latency time?
>
>
> Thanks,
>
> LiuLei
>
>

Re: DFSOutputStream.sync() method latency time

Posted by Yanbo Liang <ya...@gmail.com>.
1st when client wants to write data to HDFS, it should be create
DFSOutputStream.
Then the client write data to this output stream and this stream will
transfer data to all DataNodes with the constructed pipeline by the means
of Packet whose size is 64KB.
These two operations is concurrent, so the write latency is not simple
superposition.

2nd the sync method only flush the last packet ( at most 64KB ) data to the
pipeline.

Because of the cocurrent processing of all these operations, so the latency
is smaller than the superposition of each operation.
It's parallel computing rather than serial computing in a sense.


2013/3/28 lei liu <li...@gmail.com>

> When client  write data, if there are three replicates,  the sync method
> latency time formula should be:
> sync method  latency time = first datanode receive data time + sencond
> datanode receive data  time +  third datanode receive data time.
>
> if the three datanode receive data time all are 2 millisecond, so the sync
> method  latency time should is 6 millisecond,  but according to our our
> monitor, the the sync method  latency time is 2 millisecond.
>
>
> How to calculate sync method  latency time?
>
>
> Thanks,
>
> LiuLei
>
>

Re: DFSOutputStream.sync() method latency time

Posted by Yanbo Liang <ya...@gmail.com>.
1st when client wants to write data to HDFS, it should be create
DFSOutputStream.
Then the client write data to this output stream and this stream will
transfer data to all DataNodes with the constructed pipeline by the means
of Packet whose size is 64KB.
These two operations is concurrent, so the write latency is not simple
superposition.

2nd the sync method only flush the last packet ( at most 64KB ) data to the
pipeline.

Because of the cocurrent processing of all these operations, so the latency
is smaller than the superposition of each operation.
It's parallel computing rather than serial computing in a sense.


2013/3/28 lei liu <li...@gmail.com>

> When client  write data, if there are three replicates,  the sync method
> latency time formula should be:
> sync method  latency time = first datanode receive data time + sencond
> datanode receive data  time +  third datanode receive data time.
>
> if the three datanode receive data time all are 2 millisecond, so the sync
> method  latency time should is 6 millisecond,  but according to our our
> monitor, the the sync method  latency time is 2 millisecond.
>
>
> How to calculate sync method  latency time?
>
>
> Thanks,
>
> LiuLei
>
>