You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Da Zheng <zh...@gmail.com> on 2011/07/18 09:41:08 UTC

replicate data in HDFS with smarter encoding

Hello,

It seems that data replication in HDFS is simply data copy among nodes. Has
anyone considered to use a better encoding to reduce the data size? Say, a block
of data is split into N pieces, and as long as M pieces of data survive in the
network, we can regenerate original data.

There are many benefits to reduce the data size. It can save network and disk
benefit, and thus reduce energy consumption. Computation power might be a
concern, but we can use GPU to encode and decode.

But maybe the idea is stupid or it's hard to reduce the data size. I would like
to hear your comments.

Thanks,
Da

Re: replicate data in HDFS with smarter encoding

Posted by Da Zheng <zh...@gmail.com>.

Hello,

On 07/18/11 21:43, Uma Maheswara Rao G 72686 wrote:
> Hi,
>
> We have already thoughts about it.
No, I think we are talking about different problems. What I'm talking 
about is how to reduce the number of replica while still achieving the 
same data reliability. The replica of data can already be compressed.

To illustrate the problem, here is a more concrete example:
The size of block A is X. After it is compressed, its size is Y. When it 
is written to HDFS, it needs to be replicated if we want the data to be 
reliable. If the replication factor is R, then R*Y bytes will be written 
to the disk, and (R-1)*Y bytes will be transmitted in the network.

Now, if we use some better encoding to achieve data reliability, for B 
blocks of data, we can have P parity blocks. And for each block, we need 
to have (1+P/B)*Y bytes written to the disk and P/B*Y bytes transmitted 
over the network, and thus it's possible to further reduce the network 
and disk bandwidth.

So what Joey showed me is more relevant even though it doesn't reduce 
the data size before data is written to the network or the disk.

To implement that, I think we will probably not use pipeline any more.

> Looks like you are talking about this features right
> https://issues.apache.org/jira/browse/HDFS-1640
> https://issues.apache.org/jira/browse/HDFS-2115
About your patches, I don't know how useful it can be when we can ask 
the applications to compress data. For example, we can enable 
mapred.output.compress in MapReduce to ask reducers to compress data. I 
assume MapReduce is the major user of HDFS.

Thanks,
Da

Re: replicate data in HDFS with smarter encoding

Posted by Da Zheng <zh...@gmail.com>.

Hello,

On 07/18/11 21:43, Uma Maheswara Rao G 72686 wrote:
> Hi,
>
> We have already thoughts about it.
No, I think we are talking about different problems. What I'm talking 
about is how to reduce the number of replica while still achieving the 
same data reliability. The replica of data can already be compressed.

To illustrate the problem, here is a more concrete example:
The size of block A is X. After it is compressed, its size is Y. When it 
is written to HDFS, it needs to be replicated if we want the data to be 
reliable. If the replication factor is R, then R*Y bytes will be written 
to the disk, and (R-1)*Y bytes will be transmitted in the network.

Now, if we use some better encoding to achieve data reliability, for B 
blocks of data, we can have P parity blocks. And for each block, we need 
to have (1+P/B)*Y bytes written to the disk and P/B*Y bytes transmitted 
over the network, and thus it's possible to further reduce the network 
and disk bandwidth.

So what Joey showed me is more relevant even though it doesn't reduce 
the data size before data is written to the network or the disk.

To implement that, I think we will probably not use pipeline any more.

> Looks like you are talking about this features right
> https://issues.apache.org/jira/browse/HDFS-1640
> https://issues.apache.org/jira/browse/HDFS-2115
About your patches, I don't know how useful it can be when we can ask 
the applications to compress data. For example, we can enable 
mapred.output.compress in MapReduce to ask reducers to compress data. I 
assume MapReduce is the major user of HDFS.

Thanks,
Da

Re: replicate data in HDFS with smarter encoding

Posted by Uma Maheswara Rao G 72686 <ma...@huawei.com>.

Hi,

We have already thoughts about it.

Looks like you are talking about this features right
https://issues.apache.org/jira/browse/HDFS-1640
https://issues.apache.org/jira/browse/HDFS-2115

but implementation not yet ready in trunk


Regards,
Uma
******************************************************************************************
 This email and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained here in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it!
 *****************************************************************************************

----- Original Message -----
From: Da Zheng <zh...@gmail.com>
Date: Tuesday, July 19, 2011 9:23 am
Subject: Re: replicate data in HDFS with smarter encoding
To: common-user@hadoop.apache.org
Cc: Joey Echeverria <jo...@cloudera.com>, "hdfs-user@hadoop.apache.org" <hd...@hadoop.apache.org>

> So this kind of feature is desired by the community?
> 
> It seems this implementation can only reduce the data size on the 
> disk 
> by the background daemon RaidNode, but it cannot reduce the disk 
> bandwidth and network bandwidth when the client writes data to 
> HDFS. It 
> might be more interesting to reduce the disk bandwidth and network 
> bandwidth although it might require to modify the implementation of 
> the 
> pipeline in HDFS.
> 
> Thanks,
> Da
> 
> 
> On 07/18/11 04:10, Joey Echeverria wrote:
> > Facebook contributed some code to do something similar called 
> HDFS RAID:
> >
> > http://wiki.apache.org/hadoop/HDFS-RAID
> >
> > -Joey
> >
> >
> > On Jul 18, 2011, at 3:41, Da Zheng<zh...@gmail.com>  wrote:
> >
> >> Hello,
> >>
> >> It seems that data replication in HDFS is simply data copy among 
> nodes. Has
> >> anyone considered to use a better encoding to reduce the data 
> size? Say, a block
> >> of data is split into N pieces, and as long as M pieces of data 
> survive in the
> >> network, we can regenerate original data.
> >>
> >> There are many benefits to reduce the data size. It can save 
> network and disk
> >> benefit, and thus reduce energy consumption. Computation power 
> might be a
> >> concern, but we can use GPU to encode and decode.
> >>
> >> But maybe the idea is stupid or it's hard to reduce the data 
> size. I would like
> >> to hear your comments.
> >>
> >> Thanks,
> >> Da
> 
>

Re: replicate data in HDFS with smarter encoding

Posted by Uma Maheswara Rao G 72686 <ma...@huawei.com>.

Hi,

We have already thoughts about it.

Looks like you are talking about this features right
https://issues.apache.org/jira/browse/HDFS-1640
https://issues.apache.org/jira/browse/HDFS-2115

but implementation not yet ready in trunk


Regards,
Uma
******************************************************************************************
 This email and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained here in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it!
 *****************************************************************************************

----- Original Message -----
From: Da Zheng <zh...@gmail.com>
Date: Tuesday, July 19, 2011 9:23 am
Subject: Re: replicate data in HDFS with smarter encoding
To: common-user@hadoop.apache.org
Cc: Joey Echeverria <jo...@cloudera.com>, "hdfs-user@hadoop.apache.org" <hd...@hadoop.apache.org>

> So this kind of feature is desired by the community?
> 
> It seems this implementation can only reduce the data size on the 
> disk 
> by the background daemon RaidNode, but it cannot reduce the disk 
> bandwidth and network bandwidth when the client writes data to 
> HDFS. It 
> might be more interesting to reduce the disk bandwidth and network 
> bandwidth although it might require to modify the implementation of 
> the 
> pipeline in HDFS.
> 
> Thanks,
> Da
> 
> 
> On 07/18/11 04:10, Joey Echeverria wrote:
> > Facebook contributed some code to do something similar called 
> HDFS RAID:
> >
> > http://wiki.apache.org/hadoop/HDFS-RAID
> >
> > -Joey
> >
> >
> > On Jul 18, 2011, at 3:41, Da Zheng<zh...@gmail.com>  wrote:
> >
> >> Hello,
> >>
> >> It seems that data replication in HDFS is simply data copy among 
> nodes. Has
> >> anyone considered to use a better encoding to reduce the data 
> size? Say, a block
> >> of data is split into N pieces, and as long as M pieces of data 
> survive in the
> >> network, we can regenerate original data.
> >>
> >> There are many benefits to reduce the data size. It can save 
> network and disk
> >> benefit, and thus reduce energy consumption. Computation power 
> might be a
> >> concern, but we can use GPU to encode and decode.
> >>
> >> But maybe the idea is stupid or it's hard to reduce the data 
> size. I would like
> >> to hear your comments.
> >>
> >> Thanks,
> >> Da
> 
>

Re: replicate data in HDFS with smarter encoding

Posted by Da Zheng <zh...@gmail.com>.

So this kind of feature is desired by the community?

It seems this implementation can only reduce the data size on the disk 
by the background daemon RaidNode, but it cannot reduce the disk 
bandwidth and network bandwidth when the client writes data to HDFS. It 
might be more interesting to reduce the disk bandwidth and network 
bandwidth although it might require to modify the implementation of the 
pipeline in HDFS.

Thanks,
Da

On 07/18/11 04:10, Joey Echeverria wrote:
> Facebook contributed some code to do something similar called HDFS RAID:
>
> http://wiki.apache.org/hadoop/HDFS-RAID
>
> -Joey
>
>
> On Jul 18, 2011, at 3:41, Da Zheng<zh...@gmail.com>  wrote:
>
>> Hello,
>>
>> It seems that data replication in HDFS is simply data copy among nodes. Has
>> anyone considered to use a better encoding to reduce the data size? Say, a block
>> of data is split into N pieces, and as long as M pieces of data survive in the
>> network, we can regenerate original data.
>>
>> There are many benefits to reduce the data size. It can save network and disk
>> benefit, and thus reduce energy consumption. Computation power might be a
>> concern, but we can use GPU to encode and decode.
>>
>> But maybe the idea is stupid or it's hard to reduce the data size. I would like
>> to hear your comments.
>>
>> Thanks,
>> Da

Re: replicate data in HDFS with smarter encoding

Posted by Da Zheng <zh...@gmail.com>.

So this kind of feature is desired by the community?

It seems this implementation can only reduce the data size on the disk 
by the background daemon RaidNode, but it cannot reduce the disk 
bandwidth and network bandwidth when the client writes data to HDFS. It 
might be more interesting to reduce the disk bandwidth and network 
bandwidth although it might require to modify the implementation of the 
pipeline in HDFS.

Thanks,
Da

On 07/18/11 04:10, Joey Echeverria wrote:
> Facebook contributed some code to do something similar called HDFS RAID:
>
> http://wiki.apache.org/hadoop/HDFS-RAID
>
> -Joey
>
>
> On Jul 18, 2011, at 3:41, Da Zheng<zh...@gmail.com>  wrote:
>
>> Hello,
>>
>> It seems that data replication in HDFS is simply data copy among nodes. Has
>> anyone considered to use a better encoding to reduce the data size? Say, a block
>> of data is split into N pieces, and as long as M pieces of data survive in the
>> network, we can regenerate original data.
>>
>> There are many benefits to reduce the data size. It can save network and disk
>> benefit, and thus reduce energy consumption. Computation power might be a
>> concern, but we can use GPU to encode and decode.
>>
>> But maybe the idea is stupid or it's hard to reduce the data size. I would like
>> to hear your comments.
>>
>> Thanks,
>> Da

Re: replicate data in HDFS with smarter encoding

Posted by Joey Echeverria <jo...@cloudera.com>.

Facebook contributed some code to do something similar called HDFS RAID:

http://wiki.apache.org/hadoop/HDFS-RAID

-Joey


On Jul 18, 2011, at 3:41, Da Zheng <zh...@gmail.com> wrote:

> Hello,
> 
> It seems that data replication in HDFS is simply data copy among nodes. Has
> anyone considered to use a better encoding to reduce the data size? Say, a block
> of data is split into N pieces, and as long as M pieces of data survive in the
> network, we can regenerate original data.
> 
> There are many benefits to reduce the data size. It can save network and disk
> benefit, and thus reduce energy consumption. Computation power might be a
> concern, but we can use GPU to encode and decode.
> 
> But maybe the idea is stupid or it's hard to reduce the data size. I would like
> to hear your comments.
> 
> Thanks,
> Da

Re: replicate data in HDFS with smarter encoding

Posted by Joey Echeverria <jo...@cloudera.com>.

Facebook contributed some code to do something similar called HDFS RAID:

http://wiki.apache.org/hadoop/HDFS-RAID

-Joey


On Jul 18, 2011, at 3:41, Da Zheng <zh...@gmail.com> wrote:

> Hello,
> 
> It seems that data replication in HDFS is simply data copy among nodes. Has
> anyone considered to use a better encoding to reduce the data size? Say, a block
> of data is split into N pieces, and as long as M pieces of data survive in the
> network, we can regenerate original data.
> 
> There are many benefits to reduce the data size. It can save network and disk
> benefit, and thus reduce energy consumption. Computation power might be a
> concern, but we can use GPU to encode and decode.
> 
> But maybe the idea is stupid or it's hard to reduce the data size. I would like
> to hear your comments.
> 
> Thanks,
> Da