You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by 徐传印 <xu...@hust.edu.cn> on 2018/01/29 06:57:31 UTC

performance about writing data to HDFS

Hi community:
  I have a question about the performance of writing to HDFS.

  I've learned that When We write data to HDFS using the interface provided by HDFS such as 'FileSystem.create', our client will block until all the blocks and their replications are done. This will cause efficiency problem if we use HDFS as our final data storage. And many of my colleagues write the data to local disk in the main thread and copy it to HDFS in another thread. Obviously, it increases the disk I/O.

  So, is there a way to optimize this usage? I don't want to increase the disk I/O, neither do I want to be blocked during the writing of extra replications.



  How about writing to HDFS by specifying only one replication in the main thread and set the actual number of replication in another thread? Or is there any better way to do this?

Re: Re: performance about writing data to HDFS

Posted by 徐传印 <xu...@hust.edu.cn>.

Thanks, Miklos.




To achieve the goal, we need to combine two or more interfaces currently provided by HDFS.




So, how do you think about providing an another interface to write extra replications to HDFS to make it more simple.




Before:

FSDataoutputStream create(Path f,

 boolean overwrite,

 int bufferSize,

 short replication,

 long blockSize) throws IOException




After:

FSDataoutputStream create(Path f,

 boolean overwrite,

 int bufferSize,

 short replication,

 short acceptableReplication, // block until the minimum replication finished 

 long blockSize) throws IOException


-----原始邮件-----
发件人:"Miklos Szegedi" <sz...@apache.org>
发送时间:2018-01-30 01:50:23 (星期二)
收件人: "徐传印" <xu...@hust.edu.cn>
抄送: Hdfs-dev <hd...@hadoop.apache.org>, "Hadoop Common" <co...@hadoop.apache.org>, "common-user@hadoop.apache.org" <us...@hadoop.apache.org>
主题: Re: performance about writing data to HDFS


Hello,


Here is an example.


You can set an initial low replication like this code does:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L193



Create and write to a stream instead of dealing with a local copy:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L195



Once you are done, you can set a final replication count and HDFS will replicate in the background:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L250



You can optionally even wait until an acceptable replication count is reached:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L256



Thanks,
Miklos


On Sun, Jan 28, 2018 at 10:57 PM, 徐传印 <xu...@hust.edu.cn> wrote:

Hi community:
  I have a question about the performance of writing to HDFS.

  I've learned that When We write data to HDFS using the interface provided by HDFS such as 'FileSystem.create', our client will block until all the blocks and their replications are done. This will cause efficiency problem if we use HDFS as our final data storage. And many of my colleagues write the data to local disk in the main thread and copy it to HDFS in another thread. Obviously, it increases the disk I/O.

  So, is there a way to optimize this usage? I don't want to increase the disk I/O, neither do I want to be blocked during the writing of extra replications.



  How about writing to HDFS by specifying only one replication in the main thread and set the actual number of replication in another thread? Or is there any better way to do this?

Re: performance about writing data to HDFS

Posted by Miklos Szegedi <sz...@apache.org>.

Hello,

Here is an example.

You can set an initial low replication like this code does:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L193

Create and write to a stream instead of dealing with a local copy:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L195

Once you are done, you can set a final replication count and HDFS will
replicate in the background:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L250

You can optionally even wait until an acceptable replication count is
reached:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L256

Thanks,
Miklos

On Sun, Jan 28, 2018 at 10:57 PM, 徐传印 <xu...@hust.edu.cn> wrote:

>
> Hi community:
>   I have a question about the performance of writing to HDFS.
>
>   I've learned that When We write data to HDFS using the interface
> provided by HDFS such as 'FileSystem.create', our client will block until
> all the blocks and their replications are done. This will cause efficiency
> problem if we use HDFS as our final data storage. And many of my colleagues
> write the data to local disk in the main thread and copy it to HDFS in
> another thread. Obviously, it increases the disk I/O.
>
>   So, is there a way to optimize this usage? I don't want to increase the
> disk I/O, neither do I want to be blocked during the writing of extra
> replications.
>
>   How about writing to HDFS by specifying only one replication in the main
> thread and set the actual number of replication in another thread? Or is
> there any better way to do this?
>

Re: performance about writing data to HDFS

Posted by Miklos Szegedi <sz...@apache.org>.

Hello,

Here is an example.

You can set an initial low replication like this code does:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L193

Create and write to a stream instead of dealing with a local copy:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L195

Once you are done, you can set a final replication count and HDFS will
replicate in the background:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L250

You can optionally even wait until an acceptable replication count is
reached:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L256

Thanks,
Miklos

On Sun, Jan 28, 2018 at 10:57 PM, 徐传印 <xu...@hust.edu.cn> wrote:

>
> Hi community:
>   I have a question about the performance of writing to HDFS.
>
>   I've learned that When We write data to HDFS using the interface
> provided by HDFS such as 'FileSystem.create', our client will block until
> all the blocks and their replications are done. This will cause efficiency
> problem if we use HDFS as our final data storage. And many of my colleagues
> write the data to local disk in the main thread and copy it to HDFS in
> another thread. Obviously, it increases the disk I/O.
>
>   So, is there a way to optimize this usage? I don't want to increase the
> disk I/O, neither do I want to be blocked during the writing of extra
> replications.
>
>   How about writing to HDFS by specifying only one replication in the main
> thread and set the actual number of replication in another thread? Or is
> there any better way to do this?
>

Re: performance about writing data to HDFS

Posted by Miklos Szegedi <sz...@apache.org>.

Hello,

Here is an example.

You can set an initial low replication like this code does:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L193

Create and write to a stream instead of dealing with a local copy:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L195

Once you are done, you can set a final replication count and HDFS will
replicate in the background:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L250

You can optionally even wait until an acceptable replication count is
reached:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L256

Thanks,
Miklos

On Sun, Jan 28, 2018 at 10:57 PM, 徐传印 <xu...@hust.edu.cn> wrote:

>
> Hi community:
>   I have a question about the performance of writing to HDFS.
>
>   I've learned that When We write data to HDFS using the interface
> provided by HDFS such as 'FileSystem.create', our client will block until
> all the blocks and their replications are done. This will cause efficiency
> problem if we use HDFS as our final data storage. And many of my colleagues
> write the data to local disk in the main thread and copy it to HDFS in
> another thread. Obviously, it increases the disk I/O.
>
>   So, is there a way to optimize this usage? I don't want to increase the
> disk I/O, neither do I want to be blocked during the writing of extra
> replications.
>
>   How about writing to HDFS by specifying only one replication in the main
> thread and set the actual number of replication in another thread? Or is
> there any better way to do this?
>