You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by lohit <lo...@gmail.com> on 2013/11/11 19:59:52 UTC

HDFS read/write data throttling

Hello Devs,

Wanted to reach out and see if anyone has thought about ability to throttle
data transfer within HDFS. One option we have been thinking is to throttle
on a per FileSystem basis, similar to Statistics in FileSystem. This would
mean anyone with handle to HDFS/Hftp will be throttled globally within JVM.
Right value to come up for this would be based on type of hardware we use
and how many tasks/clients we allow.

On the other hand doing something like this at FileSystem layer would mean
many other tasks such as Job jar copy, DistributedCache copy and any hidden
data movement would also be throttled. We wanted to know if anyone has had
such requirement on their clusters in the past and what was the thinking
around it. Appreciate your inputs/comments

-- 
Have a Nice Day!
Lohit

Re: HDFS read/write data throttling

Posted by Andrew Wang <an...@cloudera.com>.
https://issues.apache.org/jira/browse/HDFS-5499


On Mon, Nov 18, 2013 at 10:46 AM, Jay Vyas <ja...@gmail.com> wrote:

> Where is the jira for this?
>
> Sent from my iPhone
>
> > On Nov 18, 2013, at 1:25 PM, Andrew Wang <an...@cloudera.com>
> wrote:
> >
> > Thanks for asking, here's a link:
> >
> > http://www.umbrant.com/papers/socc12-cake.pdf
> >
> > I don't think there's a recording of my talk unfortunately.
> >
> > I'll also copy my comments over to the JIRA, though I'd like to not
> > distract too much from what Lohit's trying to do.
> >
> >
> > On Wed, Nov 13, 2013 at 2:54 AM, Steve Loughran <stevel@hortonworks.com
> >wrote:
> >
> >> this is interesting -I've moved my comments over to the JIRA and it
> would
> >> be good for yours to go there too.
> >>
> >> is there a URL for your paper?
> >>
> >>
> >>> On 13 November 2013 06:27, Andrew Wang <an...@cloudera.com>
> wrote:
> >>>
> >>> Hey Steve,
> >>>
> >>> My research project (Cake, published at SoCC '12) was trying to provide
> >>> SLAs for mixed workloads of latency-sensitive and throughput-bound
> >>> applications, e.g. HBase running alongside MR. This was challenging
> >> because
> >>> seeks are a real killer. Basically, we had to strongly limit MR I/O to
> >> keep
> >>> worst-case seek latency down, and did so by putting schedulers on the
> RPC
> >>> queues in HBase and HDFS to restrict queuing in the OS and disk where
> we
> >>> lacked preemption.
> >>>
> >>> Regarding citations of note, most academics consider throughput-sharing
> >> to
> >>> be a solved problem. It's not dissimilar from normal time slicing, you
> >> try
> >>> to ensure fairness over some coarse timescale. I think cgroups [1] and
> >>> ioprio_set [2] essentially provide this.
> >>>
> >>> Mixing throughput and latency though is difficult, and my conclusion is
> >>> that there isn't a really great solution for spinning disks besides
> >>> physical isolation. As we all know, you can get either IOPS or
> bandwidth,
> >>> but not both, and it's not a linear tradeoff between the two. If you're
> >>> interested in this though, I can dig up some related work from my Cake
> >>> paper.
> >>>
> >>> However, since it seems that we're more concerned with throughput-bound
> >>> apps, we might be okay just using cgroups and ioprio_set to do
> >>> time-slicing. I actually hacked up some code a while ago which passed a
> >>> client-provided priority byte to the DN, which used it to set the I/O
> >>> priority of the handling DataXceiver accordingly. This isn't the most
> >>> outlandish idea, since we've put QoS fields in our RPC protocol for
> >>> instance; this would just be another byte. Short-circuit reads are
> >> outside
> >>> this paradigm, but then you can use cgroup controls instead.
> >>>
> >>> My casual conversations with Googlers indicate that there isn't any
> >> special
> >>> Borg/Omega sauce either, just that they heavily prioritize DFS I/O over
> >>> non-DFS. Maybe that's another approach: if we can separate block
> >> management
> >>> in HDFS, MR tasks could just write their output to a raw HDFS block,
> thus
> >>> bringing a lot of I/O back into the fold of "datanode as I/O manager"
> >> for a
> >>> machine.
> >>>
> >>> Overall, I strongly agree with you that it's important to first define
> >> what
> >>> our goals are regarding I/O QoS. The general case is a tarpit, so it'd
> be
> >>> good to carve off useful things that can be done now (like Lohit's
> >>> direction of per-stream/FS throughput throttling with trusted clients)
> >> and
> >>> then carefully grow the scope as we find more usecases we can
> confidently
> >>> solve.
> >>>
> >>> Best,
> >>> Andrew
> >>>
> >>> [1] cgroups blkio controller
> >>> https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt
> >>> [2] ioprio_set http://man7.org/linux/man-pages/man2/ioprio_set.2.html
> >>>
> >>>
> >>> On Tue, Nov 12, 2013 at 1:38 AM, Steve Loughran <
> stevel@hortonworks.com
> >>>> wrote:
> >>>
> >>>> I've looked at it a bit within the context of YARN.
> >>>>
> >>>> YARN containers are where this would be ideal, as then you'd be able
> to
> >>>> request IO capacity as well as CPU and RAM. For that to work, the
> >>>> throttling would have to be outside the App, as you are trying to
> limit
> >>>> code whether or not it wants to be, and because you probably (*) want
> >> to
> >>>> give it more bandwidth if the system is otherwise idle.
> Self-throttling
> >>>> doesn't pick up spare IO
> >>>>
> >>>>
> >>>>   1. you can use cgroups in YARN to throttle local disk IO through the
> >>>>   file:// URLs or the java filesystem APIs -such as for MR temp data
> >>>>   2. you can't c-group throttle HDFS per YARN container, which would
> >> be
> >>>>   the ideal use case for it. The IO is taking place in the DN, and
> >>> cgroups
> >>>>   only limits IO in the throttled process group.
> >>>>   3. implementing it in the DN would require a lot more complex code
> >>> there
> >>>>   to prioritise work based on block ID (sole identifier that goes
> >> around
> >>>>   everywhere) or input source (local sockets for HBase IO vs TCP
> >> stack)
> >>>>   4. One you go to a heterogenous filesystem you need to think about
> >> IO
> >>>>   load per storage layer as well as/alongside per-volume
> >>>>   5. There's also generic RPC request throttle to prevent DoS against
> >>> the
> >>>>   NN and other HDFS services. That would need to be server side, but
> >>> once
> >>>>   implemented in the RPC code be universal.
> >>>>
> >>>> You also need to define what is the load you are trying to throttle,
> >> pure
> >>>> RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a
> >> file
> >>> is
> >>>> lined up for sequential reading, you'd almost want it to stream
> through
> >>> the
> >>>> next blocks until a high priority request came through, but operations
> >>> like
> >>>> a seek which would involve a disk head movement backwards would be
> >>>> something to throttle (hence you need to be storage type aware as SSD
> >>> seeks
> >>>> costs less). You also need to consider that although the cost of
> writes
> >>> is
> >>>> high, it's usually being done with the goal of preserving data -and
> you
> >>>> don't want to impact durability.
> >>>>
> >>>> (*) probably, because that's one of the issues that causes debates in
> >>> other
> >>>> datacentre platforms, such as Google Omega: do you want max cluster
> >>>> utilisation vs max determinism of workload.
> >>>>
> >>>> If someone were to do IOP throttling in the 3.x+ timeline,
> >>>>
> >>>>   1. It needs clear use cases, YARN containers being #1 for me
> >>>>   2. We'd have to look at all the research done on this in the past to
> >>> see
> >>>>   what works, doesn't
> >>>>
> >>>> Andrew, what citations of relevance do you have?
> >>>>
> >>>> -steve
> >>>>
> >>>>
> >>>>> On 12 November 2013 04:24, lohit <lo...@gmail.com> wrote:
> >>>>>
> >>>>> 2013/11/11 Andrew Wang <an...@cloudera.com>
> >>>>>
> >>>>>> Hey Lohit,
> >>>>>>
> >>>>>> This is an interesting topic, and something I actually worked on in
> >>>> grad
> >>>>>> school before coming to Cloudera. It'd help if you could outline
> >> some
> >>>> of
> >>>>>> your usecases and how per-FileSystem throttling would help. For
> >> what
> >>> I
> >>>>> was
> >>>>>> doing, it made more sense to throttle on the DN side since you
> >> have a
> >>>>>> better view over all the I/O happening on the system, and you have
> >>>>>> knowledge of different volumes so you can set limits per-disk. This
> >>>> still
> >>>>>> isn't 100% reliable though since normally a portion of each disk is
> >>>> used
> >>>>>> for MR scratch space, which the DN doesn't have control over. I
> >> tried
> >>>>>> playing with thread I/O priorities here, but didn't see much
> >>>> improvement.
> >>>>>> Maybe the newer cgroups stuff can help out.
> >>>>>
> >>>>> Thanks. Yes, we also thought about having something on DataNode. This
> >>>> would
> >>>>> also mean one could easily throttle client who access from outside
> >> the
> >>>>> cluster, for example distcp or hftp copies. Clients need not worry
> >>> about
> >>>>> throttle configs and each cluster can control how much much
> >> throughput
> >>>> can
> >>>>> be achieved. We do want to have something like this.
> >>>>>
> >>>>>>
> >>>>>> I'm sure per-FileSystem throttling will have some benefits (and
> >>>> probably
> >>>>> be
> >>>>>> easier than some DN-side implementation) but again, it'd help to
> >>> better
> >>>>>> understand the problem you are trying to solve.
> >>>>>
> >>>>> One idea was flexibility for client to override and have value they
> >> can
> >>>>> set. For on trusted cluster we could allow clients to go beyond
> >> default
> >>>>> value for some usecases. Alternatively we also thought about having
> >>>> default
> >>>>> value and max value where clients could change default, but not go
> >>> beyond
> >>>>> default. Another problem with DN side config is having different
> >> values
> >>>> for
> >>>>> different clients and easily changing those for selective clients.
> >>>>>
> >>>>> As, Haosong also suggested we could wrap
> >> FSDataOutputStream/FSDataInput
> >>>>> stream with ThrottleInputStream. But we might have to be careful of
> >> any
> >>>>> code which uses FileSystem APIs and accidentally throttling itself.
> >>> (like
> >>>>> reducer copy,  distributed cache and such...)
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Best,
> >>>>>> Andrew
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <haosdent@gmail.com
> >>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hi, lohit. There is a Class named
> >>>>>>> ThrottledInputStream<
> >>
> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
> >>>>>>> in hadoop-distcp, you could check it out and find more details.
> >>>>>>>
> >>>>>>> In addition to this, I am working on this and try to achieve
> >>>> resources
> >>>>>>> control(include CPU, Network, Disk IO) in JVM. But my
> >>> implementation
> >>>> is
> >>>>>>> depends on cgroup, which only could run in Linux. I would push my
> >>>>>>> library(java-cgroup) to github in the next several months. If you
> >>> are
> >>>>>>> interested at it, give my any advices and help me improve it
> >>> please.
> >>>>> :-)
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Nov 12, 2013 at 3:47 AM, lohit <
> >> lohit.vijayarenu@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Adam,
> >>>>>>>>
> >>>>>>>> Thanks for the reply. The changes I was referring was in
> >>>>>> FileSystem.java
> >>>>>>>> layer which should not affect HDFS Replication/NameNode
> >>> operations.
> >>>>>>>> To give better idea this would affect clients something like
> >> this
> >>>>>>>>
> >>>>>>>> Configuration conf = new Configuration();
> >>>>>>>> conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
> >>>>>>>> FileSystem fs = FileSystem.get(conf);
> >>>>>>>>
> >>>>>>>> FSDataInputStream fis = fs.open("/path/to/file.xt");
> >>>>>>>> fis.read(); // <-- This would be max of 20MB/s
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2013/11/11 Adam Muise <am...@hortonworks.com>
> >>>>>>>>
> >>>>>>>>> See https://issues.apache.org/jira/browse/HDFS-3475
> >>>>>>>>>
> >>>>>>>>> Please note that this has met with many unexpected impacts on
> >>>>>> workload.
> >>>>>>>> Be
> >>>>>>>>> careful and be mindful of your Datanode memory and network
> >>>>> capacity.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Nov 11, 2013 at 1:59 PM, lohit <
> >>>> lohit.vijayarenu@gmail.com
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hello Devs,
> >>>>>>>>>>
> >>>>>>>>>> Wanted to reach out and see if anyone has thought about
> >>> ability
> >>>>> to
> >>>>>>>>> throttle
> >>>>>>>>>> data transfer within HDFS. One option we have been thinking
> >>> is
> >>>> to
> >>>>>>>>> throttle
> >>>>>>>>>> on a per FileSystem basis, similar to Statistics in
> >>> FileSystem.
> >>>>>> This
> >>>>>>>>> would
> >>>>>>>>>> mean anyone with handle to HDFS/Hftp will be throttled
> >>> globally
> >>>>>>> within
> >>>>>>>>> JVM.
> >>>>>>>>>> Right value to come up for this would be based on type of
> >>>>> hardware
> >>>>>> we
> >>>>>>>> use
> >>>>>>>>>> and how many tasks/clients we allow.
> >>>>>>>>>>
> >>>>>>>>>> On the other hand doing something like this at FileSystem
> >>> layer
> >>>>>> would
> >>>>>>>>> mean
> >>>>>>>>>> many other tasks such as Job jar copy, DistributedCache
> >> copy
> >>>> and
> >>>>>> any
> >>>>>>>>> hidden
> >>>>>>>>>> data movement would also be throttled. We wanted to know if
> >>>>> anyone
> >>>>>>> has
> >>>>>>>>> had
> >>>>>>>>>> such requirement on their clusters in the past and what was
> >>> the
> >>>>>>>> thinking
> >>>>>>>>>> around it. Appreciate your inputs/comments
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Have a Nice Day!
> >>>>>>>>>> Lohit
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>   * Adam Muise *       Solutions Engineer
> >>>>>>>>> ------------------------------
> >>>>>>>>>
> >>>>>>>>>    Phone:        416-417-4037
> >>>>>>>>>  Email:      amuise@hortonworks.com
> >>>>>>>>>  Website:   http://www.hortonworks.com/
> >>>>>>>>>
> >>>>>>>>>      * Follow Us: *
> >>>>>>>>> <
> >>
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >>>>>>>>> <
> >>
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >>>>>>>>> <
> >>
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >>>>>>>>>
> >>>>>>>>> [image: photo]
> >>>>>>>>>
> >>>>>>>>>  Latest From Our Blog:  How to use R and other non-Java
> >>>> languages
> >>>>> in
> >>>>>>>>> MapReduce and Hive
> >>>>>>>>> <
> >>
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>> NOTICE: This message is intended for the use of the
> >> individual
> >>> or
> >>>>>>> entity
> >>>>>>>> to
> >>>>>>>>> which it is addressed and may contain information that is
> >>>>>> confidential,
> >>>>>>>>> privileged and exempt from disclosure under applicable law.
> >> If
> >>>> the
> >>>>>>> reader
> >>>>>>>>> of this message is not the intended recipient, you are hereby
> >>>>>> notified
> >>>>>>>> that
> >>>>>>>>> any printing, copying, dissemination, distribution,
> >> disclosure
> >>> or
> >>>>>>>>> forwarding of this communication is strictly prohibited. If
> >> you
> >>>>> have
> >>>>>>>>> received this communication in error, please contact the
> >> sender
> >>>>>>>> immediately
> >>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Have a Nice Day!
> >>>>>>>> Lohit
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Best Regards,
> >>>>>>> Haosdent Huang
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Have a Nice Day!
> >>>>> Lohit
> >>>>
> >>>> --
> >>>> CONFIDENTIALITY NOTICE
> >>>> NOTICE: This message is intended for the use of the individual or
> >> entity
> >>> to
> >>>> which it is addressed and may contain information that is
> confidential,
> >>>> privileged and exempt from disclosure under applicable law. If the
> >> reader
> >>>> of this message is not the intended recipient, you are hereby notified
> >>> that
> >>>> any printing, copying, dissemination, distribution, disclosure or
> >>>> forwarding of this communication is strictly prohibited. If you have
> >>>> received this communication in error, please contact the sender
> >>> immediately
> >>>> and delete it from your system. Thank You.
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> entity to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> immediately
> >> and delete it from your system. Thank You.
> >>
>

Re: HDFS read/write data throttling

Posted by Jay Vyas <ja...@gmail.com>.
Where is the jira for this?

Sent from my iPhone

> On Nov 18, 2013, at 1:25 PM, Andrew Wang <an...@cloudera.com> wrote:
> 
> Thanks for asking, here's a link:
> 
> http://www.umbrant.com/papers/socc12-cake.pdf
> 
> I don't think there's a recording of my talk unfortunately.
> 
> I'll also copy my comments over to the JIRA, though I'd like to not
> distract too much from what Lohit's trying to do.
> 
> 
> On Wed, Nov 13, 2013 at 2:54 AM, Steve Loughran <st...@hortonworks.com>wrote:
> 
>> this is interesting -I've moved my comments over to the JIRA and it would
>> be good for yours to go there too.
>> 
>> is there a URL for your paper?
>> 
>> 
>>> On 13 November 2013 06:27, Andrew Wang <an...@cloudera.com> wrote:
>>> 
>>> Hey Steve,
>>> 
>>> My research project (Cake, published at SoCC '12) was trying to provide
>>> SLAs for mixed workloads of latency-sensitive and throughput-bound
>>> applications, e.g. HBase running alongside MR. This was challenging
>> because
>>> seeks are a real killer. Basically, we had to strongly limit MR I/O to
>> keep
>>> worst-case seek latency down, and did so by putting schedulers on the RPC
>>> queues in HBase and HDFS to restrict queuing in the OS and disk where we
>>> lacked preemption.
>>> 
>>> Regarding citations of note, most academics consider throughput-sharing
>> to
>>> be a solved problem. It's not dissimilar from normal time slicing, you
>> try
>>> to ensure fairness over some coarse timescale. I think cgroups [1] and
>>> ioprio_set [2] essentially provide this.
>>> 
>>> Mixing throughput and latency though is difficult, and my conclusion is
>>> that there isn't a really great solution for spinning disks besides
>>> physical isolation. As we all know, you can get either IOPS or bandwidth,
>>> but not both, and it's not a linear tradeoff between the two. If you're
>>> interested in this though, I can dig up some related work from my Cake
>>> paper.
>>> 
>>> However, since it seems that we're more concerned with throughput-bound
>>> apps, we might be okay just using cgroups and ioprio_set to do
>>> time-slicing. I actually hacked up some code a while ago which passed a
>>> client-provided priority byte to the DN, which used it to set the I/O
>>> priority of the handling DataXceiver accordingly. This isn't the most
>>> outlandish idea, since we've put QoS fields in our RPC protocol for
>>> instance; this would just be another byte. Short-circuit reads are
>> outside
>>> this paradigm, but then you can use cgroup controls instead.
>>> 
>>> My casual conversations with Googlers indicate that there isn't any
>> special
>>> Borg/Omega sauce either, just that they heavily prioritize DFS I/O over
>>> non-DFS. Maybe that's another approach: if we can separate block
>> management
>>> in HDFS, MR tasks could just write their output to a raw HDFS block, thus
>>> bringing a lot of I/O back into the fold of "datanode as I/O manager"
>> for a
>>> machine.
>>> 
>>> Overall, I strongly agree with you that it's important to first define
>> what
>>> our goals are regarding I/O QoS. The general case is a tarpit, so it'd be
>>> good to carve off useful things that can be done now (like Lohit's
>>> direction of per-stream/FS throughput throttling with trusted clients)
>> and
>>> then carefully grow the scope as we find more usecases we can confidently
>>> solve.
>>> 
>>> Best,
>>> Andrew
>>> 
>>> [1] cgroups blkio controller
>>> https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt
>>> [2] ioprio_set http://man7.org/linux/man-pages/man2/ioprio_set.2.html
>>> 
>>> 
>>> On Tue, Nov 12, 2013 at 1:38 AM, Steve Loughran <stevel@hortonworks.com
>>>> wrote:
>>> 
>>>> I've looked at it a bit within the context of YARN.
>>>> 
>>>> YARN containers are where this would be ideal, as then you'd be able to
>>>> request IO capacity as well as CPU and RAM. For that to work, the
>>>> throttling would have to be outside the App, as you are trying to limit
>>>> code whether or not it wants to be, and because you probably (*) want
>> to
>>>> give it more bandwidth if the system is otherwise idle. Self-throttling
>>>> doesn't pick up spare IO
>>>> 
>>>> 
>>>>   1. you can use cgroups in YARN to throttle local disk IO through the
>>>>   file:// URLs or the java filesystem APIs -such as for MR temp data
>>>>   2. you can't c-group throttle HDFS per YARN container, which would
>> be
>>>>   the ideal use case for it. The IO is taking place in the DN, and
>>> cgroups
>>>>   only limits IO in the throttled process group.
>>>>   3. implementing it in the DN would require a lot more complex code
>>> there
>>>>   to prioritise work based on block ID (sole identifier that goes
>> around
>>>>   everywhere) or input source (local sockets for HBase IO vs TCP
>> stack)
>>>>   4. One you go to a heterogenous filesystem you need to think about
>> IO
>>>>   load per storage layer as well as/alongside per-volume
>>>>   5. There's also generic RPC request throttle to prevent DoS against
>>> the
>>>>   NN and other HDFS services. That would need to be server side, but
>>> once
>>>>   implemented in the RPC code be universal.
>>>> 
>>>> You also need to define what is the load you are trying to throttle,
>> pure
>>>> RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a
>> file
>>> is
>>>> lined up for sequential reading, you'd almost want it to stream through
>>> the
>>>> next blocks until a high priority request came through, but operations
>>> like
>>>> a seek which would involve a disk head movement backwards would be
>>>> something to throttle (hence you need to be storage type aware as SSD
>>> seeks
>>>> costs less). You also need to consider that although the cost of writes
>>> is
>>>> high, it's usually being done with the goal of preserving data -and you
>>>> don't want to impact durability.
>>>> 
>>>> (*) probably, because that's one of the issues that causes debates in
>>> other
>>>> datacentre platforms, such as Google Omega: do you want max cluster
>>>> utilisation vs max determinism of workload.
>>>> 
>>>> If someone were to do IOP throttling in the 3.x+ timeline,
>>>> 
>>>>   1. It needs clear use cases, YARN containers being #1 for me
>>>>   2. We'd have to look at all the research done on this in the past to
>>> see
>>>>   what works, doesn't
>>>> 
>>>> Andrew, what citations of relevance do you have?
>>>> 
>>>> -steve
>>>> 
>>>> 
>>>>> On 12 November 2013 04:24, lohit <lo...@gmail.com> wrote:
>>>>> 
>>>>> 2013/11/11 Andrew Wang <an...@cloudera.com>
>>>>> 
>>>>>> Hey Lohit,
>>>>>> 
>>>>>> This is an interesting topic, and something I actually worked on in
>>>> grad
>>>>>> school before coming to Cloudera. It'd help if you could outline
>> some
>>>> of
>>>>>> your usecases and how per-FileSystem throttling would help. For
>> what
>>> I
>>>>> was
>>>>>> doing, it made more sense to throttle on the DN side since you
>> have a
>>>>>> better view over all the I/O happening on the system, and you have
>>>>>> knowledge of different volumes so you can set limits per-disk. This
>>>> still
>>>>>> isn't 100% reliable though since normally a portion of each disk is
>>>> used
>>>>>> for MR scratch space, which the DN doesn't have control over. I
>> tried
>>>>>> playing with thread I/O priorities here, but didn't see much
>>>> improvement.
>>>>>> Maybe the newer cgroups stuff can help out.
>>>>> 
>>>>> Thanks. Yes, we also thought about having something on DataNode. This
>>>> would
>>>>> also mean one could easily throttle client who access from outside
>> the
>>>>> cluster, for example distcp or hftp copies. Clients need not worry
>>> about
>>>>> throttle configs and each cluster can control how much much
>> throughput
>>>> can
>>>>> be achieved. We do want to have something like this.
>>>>> 
>>>>>> 
>>>>>> I'm sure per-FileSystem throttling will have some benefits (and
>>>> probably
>>>>> be
>>>>>> easier than some DN-side implementation) but again, it'd help to
>>> better
>>>>>> understand the problem you are trying to solve.
>>>>> 
>>>>> One idea was flexibility for client to override and have value they
>> can
>>>>> set. For on trusted cluster we could allow clients to go beyond
>> default
>>>>> value for some usecases. Alternatively we also thought about having
>>>> default
>>>>> value and max value where clients could change default, but not go
>>> beyond
>>>>> default. Another problem with DN side config is having different
>> values
>>>> for
>>>>> different clients and easily changing those for selective clients.
>>>>> 
>>>>> As, Haosong also suggested we could wrap
>> FSDataOutputStream/FSDataInput
>>>>> stream with ThrottleInputStream. But we might have to be careful of
>> any
>>>>> code which uses FileSystem APIs and accidentally throttling itself.
>>> (like
>>>>> reducer copy,  distributed cache and such...)
>>>>> 
>>>>> 
>>>>> 
>>>>>> Best,
>>>>>> Andrew
>>>>>> 
>>>>>> 
>>>>>> On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <haosdent@gmail.com
>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> Hi, lohit. There is a Class named
>>>>>>> ThrottledInputStream<
>> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
>>>>>>> in hadoop-distcp, you could check it out and find more details.
>>>>>>> 
>>>>>>> In addition to this, I am working on this and try to achieve
>>>> resources
>>>>>>> control(include CPU, Network, Disk IO) in JVM. But my
>>> implementation
>>>> is
>>>>>>> depends on cgroup, which only could run in Linux. I would push my
>>>>>>> library(java-cgroup) to github in the next several months. If you
>>> are
>>>>>>> interested at it, give my any advices and help me improve it
>>> please.
>>>>> :-)
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Nov 12, 2013 at 3:47 AM, lohit <
>> lohit.vijayarenu@gmail.com
>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Adam,
>>>>>>>> 
>>>>>>>> Thanks for the reply. The changes I was referring was in
>>>>>> FileSystem.java
>>>>>>>> layer which should not affect HDFS Replication/NameNode
>>> operations.
>>>>>>>> To give better idea this would affect clients something like
>> this
>>>>>>>> 
>>>>>>>> Configuration conf = new Configuration();
>>>>>>>> conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
>>>>>>>> FileSystem fs = FileSystem.get(conf);
>>>>>>>> 
>>>>>>>> FSDataInputStream fis = fs.open("/path/to/file.xt");
>>>>>>>> fis.read(); // <-- This would be max of 20MB/s
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2013/11/11 Adam Muise <am...@hortonworks.com>
>>>>>>>> 
>>>>>>>>> See https://issues.apache.org/jira/browse/HDFS-3475
>>>>>>>>> 
>>>>>>>>> Please note that this has met with many unexpected impacts on
>>>>>> workload.
>>>>>>>> Be
>>>>>>>>> careful and be mindful of your Datanode memory and network
>>>>> capacity.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Nov 11, 2013 at 1:59 PM, lohit <
>>>> lohit.vijayarenu@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello Devs,
>>>>>>>>>> 
>>>>>>>>>> Wanted to reach out and see if anyone has thought about
>>> ability
>>>>> to
>>>>>>>>> throttle
>>>>>>>>>> data transfer within HDFS. One option we have been thinking
>>> is
>>>> to
>>>>>>>>> throttle
>>>>>>>>>> on a per FileSystem basis, similar to Statistics in
>>> FileSystem.
>>>>>> This
>>>>>>>>> would
>>>>>>>>>> mean anyone with handle to HDFS/Hftp will be throttled
>>> globally
>>>>>>> within
>>>>>>>>> JVM.
>>>>>>>>>> Right value to come up for this would be based on type of
>>>>> hardware
>>>>>> we
>>>>>>>> use
>>>>>>>>>> and how many tasks/clients we allow.
>>>>>>>>>> 
>>>>>>>>>> On the other hand doing something like this at FileSystem
>>> layer
>>>>>> would
>>>>>>>>> mean
>>>>>>>>>> many other tasks such as Job jar copy, DistributedCache
>> copy
>>>> and
>>>>>> any
>>>>>>>>> hidden
>>>>>>>>>> data movement would also be throttled. We wanted to know if
>>>>> anyone
>>>>>>> has
>>>>>>>>> had
>>>>>>>>>> such requirement on their clusters in the past and what was
>>> the
>>>>>>>> thinking
>>>>>>>>>> around it. Appreciate your inputs/comments
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Have a Nice Day!
>>>>>>>>>> Lohit
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>>   * Adam Muise *       Solutions Engineer
>>>>>>>>> ------------------------------
>>>>>>>>> 
>>>>>>>>>    Phone:        416-417-4037
>>>>>>>>>  Email:      amuise@hortonworks.com
>>>>>>>>>  Website:   http://www.hortonworks.com/
>>>>>>>>> 
>>>>>>>>>      * Follow Us: *
>>>>>>>>> <
>> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
>>>>>>>>> <
>> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
>>>>>>>>> <
>> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
>>>>>>>>> 
>>>>>>>>> [image: photo]
>>>>>>>>> 
>>>>>>>>>  Latest From Our Blog:  How to use R and other non-Java
>>>> languages
>>>>> in
>>>>>>>>> MapReduce and Hive
>>>>>>>>> <
>> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>> NOTICE: This message is intended for the use of the
>> individual
>>> or
>>>>>>> entity
>>>>>>>> to
>>>>>>>>> which it is addressed and may contain information that is
>>>>>> confidential,
>>>>>>>>> privileged and exempt from disclosure under applicable law.
>> If
>>>> the
>>>>>>> reader
>>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>> notified
>>>>>>>> that
>>>>>>>>> any printing, copying, dissemination, distribution,
>> disclosure
>>> or
>>>>>>>>> forwarding of this communication is strictly prohibited. If
>> you
>>>>> have
>>>>>>>>> received this communication in error, please contact the
>> sender
>>>>>>>> immediately
>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Have a Nice Day!
>>>>>>>> Lohit
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Haosdent Huang
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Have a Nice Day!
>>>>> Lohit
>>>> 
>>>> --
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>> entity
>>> to
>>>> which it is addressed and may contain information that is confidential,
>>>> privileged and exempt from disclosure under applicable law. If the
>> reader
>>>> of this message is not the intended recipient, you are hereby notified
>>> that
>>>> any printing, copying, dissemination, distribution, disclosure or
>>>> forwarding of this communication is strictly prohibited. If you have
>>>> received this communication in error, please contact the sender
>>> immediately
>>>> and delete it from your system. Thank You.
>> 
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>> 

Re: HDFS read/write data throttling

Posted by Andrew Wang <an...@cloudera.com>.
Thanks for asking, here's a link:

http://www.umbrant.com/papers/socc12-cake.pdf

I don't think there's a recording of my talk unfortunately.

I'll also copy my comments over to the JIRA, though I'd like to not
distract too much from what Lohit's trying to do.


On Wed, Nov 13, 2013 at 2:54 AM, Steve Loughran <st...@hortonworks.com>wrote:

> this is interesting -I've moved my comments over to the JIRA and it would
> be good for yours to go there too.
>
> is there a URL for your paper?
>
>
> On 13 November 2013 06:27, Andrew Wang <an...@cloudera.com> wrote:
>
> > Hey Steve,
> >
> > My research project (Cake, published at SoCC '12) was trying to provide
> > SLAs for mixed workloads of latency-sensitive and throughput-bound
> > applications, e.g. HBase running alongside MR. This was challenging
> because
> > seeks are a real killer. Basically, we had to strongly limit MR I/O to
> keep
> > worst-case seek latency down, and did so by putting schedulers on the RPC
> > queues in HBase and HDFS to restrict queuing in the OS and disk where we
> > lacked preemption.
> >
> > Regarding citations of note, most academics consider throughput-sharing
> to
> > be a solved problem. It's not dissimilar from normal time slicing, you
> try
> > to ensure fairness over some coarse timescale. I think cgroups [1] and
> > ioprio_set [2] essentially provide this.
> >
> > Mixing throughput and latency though is difficult, and my conclusion is
> > that there isn't a really great solution for spinning disks besides
> > physical isolation. As we all know, you can get either IOPS or bandwidth,
> > but not both, and it's not a linear tradeoff between the two. If you're
> > interested in this though, I can dig up some related work from my Cake
> > paper.
> >
> > However, since it seems that we're more concerned with throughput-bound
> > apps, we might be okay just using cgroups and ioprio_set to do
> > time-slicing. I actually hacked up some code a while ago which passed a
> > client-provided priority byte to the DN, which used it to set the I/O
> > priority of the handling DataXceiver accordingly. This isn't the most
> > outlandish idea, since we've put QoS fields in our RPC protocol for
> > instance; this would just be another byte. Short-circuit reads are
> outside
> > this paradigm, but then you can use cgroup controls instead.
> >
> > My casual conversations with Googlers indicate that there isn't any
> special
> > Borg/Omega sauce either, just that they heavily prioritize DFS I/O over
> > non-DFS. Maybe that's another approach: if we can separate block
> management
> > in HDFS, MR tasks could just write their output to a raw HDFS block, thus
> > bringing a lot of I/O back into the fold of "datanode as I/O manager"
> for a
> > machine.
> >
> > Overall, I strongly agree with you that it's important to first define
> what
> > our goals are regarding I/O QoS. The general case is a tarpit, so it'd be
> > good to carve off useful things that can be done now (like Lohit's
> > direction of per-stream/FS throughput throttling with trusted clients)
> and
> > then carefully grow the scope as we find more usecases we can confidently
> > solve.
> >
> > Best,
> > Andrew
> >
> > [1] cgroups blkio controller
> > https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt
> > [2] ioprio_set http://man7.org/linux/man-pages/man2/ioprio_set.2.html
> >
> >
> > On Tue, Nov 12, 2013 at 1:38 AM, Steve Loughran <stevel@hortonworks.com
> > >wrote:
> >
> > > I've looked at it a bit within the context of YARN.
> > >
> > > YARN containers are where this would be ideal, as then you'd be able to
> > > request IO capacity as well as CPU and RAM. For that to work, the
> > > throttling would have to be outside the App, as you are trying to limit
> > > code whether or not it wants to be, and because you probably (*) want
> to
> > > give it more bandwidth if the system is otherwise idle. Self-throttling
> > > doesn't pick up spare IO
> > >
> > >
> > >    1. you can use cgroups in YARN to throttle local disk IO through the
> > >    file:// URLs or the java filesystem APIs -such as for MR temp data
> > >    2. you can't c-group throttle HDFS per YARN container, which would
> be
> > >    the ideal use case for it. The IO is taking place in the DN, and
> > cgroups
> > >    only limits IO in the throttled process group.
> > >    3. implementing it in the DN would require a lot more complex code
> > there
> > >    to prioritise work based on block ID (sole identifier that goes
> around
> > >    everywhere) or input source (local sockets for HBase IO vs TCP
> stack)
> > >    4. One you go to a heterogenous filesystem you need to think about
> IO
> > >    load per storage layer as well as/alongside per-volume
> > >    5. There's also generic RPC request throttle to prevent DoS against
> > the
> > >    NN and other HDFS services. That would need to be server side, but
> > once
> > >    implemented in the RPC code be universal.
> > >
> > > You also need to define what is the load you are trying to throttle,
> pure
> > > RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a
> file
> > is
> > > lined up for sequential reading, you'd almost want it to stream through
> > the
> > > next blocks until a high priority request came through, but operations
> > like
> > > a seek which would involve a disk head movement backwards would be
> > > something to throttle (hence you need to be storage type aware as SSD
> > seeks
> > > costs less). You also need to consider that although the cost of writes
> > is
> > > high, it's usually being done with the goal of preserving data -and you
> > > don't want to impact durability.
> > >
> > > (*) probably, because that's one of the issues that causes debates in
> > other
> > > datacentre platforms, such as Google Omega: do you want max cluster
> > > utilisation vs max determinism of workload.
> > >
> > > If someone were to do IOP throttling in the 3.x+ timeline,
> > >
> > >    1. It needs clear use cases, YARN containers being #1 for me
> > >    2. We'd have to look at all the research done on this in the past to
> > see
> > >    what works, doesn't
> > >
> > > Andrew, what citations of relevance do you have?
> > >
> > > -steve
> > >
> > >
> > > On 12 November 2013 04:24, lohit <lo...@gmail.com> wrote:
> > >
> > > > 2013/11/11 Andrew Wang <an...@cloudera.com>
> > > >
> > > > > Hey Lohit,
> > > > >
> > > > > This is an interesting topic, and something I actually worked on in
> > > grad
> > > > > school before coming to Cloudera. It'd help if you could outline
> some
> > > of
> > > > > your usecases and how per-FileSystem throttling would help. For
> what
> > I
> > > > was
> > > > > doing, it made more sense to throttle on the DN side since you
> have a
> > > > > better view over all the I/O happening on the system, and you have
> > > > > knowledge of different volumes so you can set limits per-disk. This
> > > still
> > > > > isn't 100% reliable though since normally a portion of each disk is
> > > used
> > > > > for MR scratch space, which the DN doesn't have control over. I
> tried
> > > > > playing with thread I/O priorities here, but didn't see much
> > > improvement.
> > > > > Maybe the newer cgroups stuff can help out.
> > > > >
> > > >
> > > > Thanks. Yes, we also thought about having something on DataNode. This
> > > would
> > > > also mean one could easily throttle client who access from outside
> the
> > > > cluster, for example distcp or hftp copies. Clients need not worry
> > about
> > > > throttle configs and each cluster can control how much much
> throughput
> > > can
> > > > be achieved. We do want to have something like this.
> > > >
> > > > >
> > > > > I'm sure per-FileSystem throttling will have some benefits (and
> > > probably
> > > > be
> > > > > easier than some DN-side implementation) but again, it'd help to
> > better
> > > > > understand the problem you are trying to solve.
> > > > >
> > > >
> > > > One idea was flexibility for client to override and have value they
> can
> > > > set. For on trusted cluster we could allow clients to go beyond
> default
> > > > value for some usecases. Alternatively we also thought about having
> > > default
> > > > value and max value where clients could change default, but not go
> > beyond
> > > > default. Another problem with DN side config is having different
> values
> > > for
> > > > different clients and easily changing those for selective clients.
> > > >
> > > > As, Haosong also suggested we could wrap
> FSDataOutputStream/FSDataInput
> > > > stream with ThrottleInputStream. But we might have to be careful of
> any
> > > > code which uses FileSystem APIs and accidentally throttling itself.
> > (like
> > > > reducer copy,  distributed cache and such...)
> > > >
> > > >
> > > >
> > > > > Best,
> > > > > Andrew
> > > > >
> > > > >
> > > > > On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <haosdent@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi, lohit. There is a Class named
> > > > > > ThrottledInputStream<
> > > > > >
> > > > >
> > > >
> > >
> >
> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
> > > > > > >
> > > > > >  in hadoop-distcp, you could check it out and find more details.
> > > > > >
> > > > > > In addition to this, I am working on this and try to achieve
> > > resources
> > > > > > control(include CPU, Network, Disk IO) in JVM. But my
> > implementation
> > > is
> > > > > > depends on cgroup, which only could run in Linux. I would push my
> > > > > > library(java-cgroup) to github in the next several months. If you
> > are
> > > > > > interested at it, give my any advices and help me improve it
> > please.
> > > > :-)
> > > > > >
> > > > > >
> > > > > > On Tue, Nov 12, 2013 at 3:47 AM, lohit <
> lohit.vijayarenu@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Adam,
> > > > > > >
> > > > > > > Thanks for the reply. The changes I was referring was in
> > > > > FileSystem.java
> > > > > > > layer which should not affect HDFS Replication/NameNode
> > operations.
> > > > > > > To give better idea this would affect clients something like
> this
> > > > > > >
> > > > > > > Configuration conf = new Configuration();
> > > > > > > conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
> > > > > > > FileSystem fs = FileSystem.get(conf);
> > > > > > >
> > > > > > > FSDataInputStream fis = fs.open("/path/to/file.xt");
> > > > > > > fis.read(); // <-- This would be max of 20MB/s
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 2013/11/11 Adam Muise <am...@hortonworks.com>
> > > > > > >
> > > > > > > > See https://issues.apache.org/jira/browse/HDFS-3475
> > > > > > > >
> > > > > > > > Please note that this has met with many unexpected impacts on
> > > > > workload.
> > > > > > > Be
> > > > > > > > careful and be mindful of your Datanode memory and network
> > > > capacity.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Nov 11, 2013 at 1:59 PM, lohit <
> > > lohit.vijayarenu@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello Devs,
> > > > > > > > >
> > > > > > > > > Wanted to reach out and see if anyone has thought about
> > ability
> > > > to
> > > > > > > > throttle
> > > > > > > > > data transfer within HDFS. One option we have been thinking
> > is
> > > to
> > > > > > > > throttle
> > > > > > > > > on a per FileSystem basis, similar to Statistics in
> > FileSystem.
> > > > > This
> > > > > > > > would
> > > > > > > > > mean anyone with handle to HDFS/Hftp will be throttled
> > globally
> > > > > > within
> > > > > > > > JVM.
> > > > > > > > > Right value to come up for this would be based on type of
> > > > hardware
> > > > > we
> > > > > > > use
> > > > > > > > > and how many tasks/clients we allow.
> > > > > > > > >
> > > > > > > > > On the other hand doing something like this at FileSystem
> > layer
> > > > > would
> > > > > > > > mean
> > > > > > > > > many other tasks such as Job jar copy, DistributedCache
> copy
> > > and
> > > > > any
> > > > > > > > hidden
> > > > > > > > > data movement would also be throttled. We wanted to know if
> > > > anyone
> > > > > > has
> > > > > > > > had
> > > > > > > > > such requirement on their clusters in the past and what was
> > the
> > > > > > > thinking
> > > > > > > > > around it. Appreciate your inputs/comments
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Have a Nice Day!
> > > > > > > > > Lohit
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >    * Adam Muise *       Solutions Engineer
> > > > > > > > ------------------------------
> > > > > > > >
> > > > > > > >     Phone:        416-417-4037
> > > > > > > >   Email:      amuise@hortonworks.com
> > > > > > > >   Website:   http://www.hortonworks.com/
> > > > > > > >
> > > > > > > >       * Follow Us: *
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > > > >
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > > > >
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > > > >
> > > > > > > >
> > > > > > > >  [image: photo]
> > > > > > > >
> > > > > > > >   Latest From Our Blog:  How to use R and other non-Java
> > > languages
> > > > in
> > > > > > > > MapReduce and Hive
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > CONFIDENTIALITY NOTICE
> > > > > > > > NOTICE: This message is intended for the use of the
> individual
> > or
> > > > > > entity
> > > > > > > to
> > > > > > > > which it is addressed and may contain information that is
> > > > > confidential,
> > > > > > > > privileged and exempt from disclosure under applicable law.
> If
> > > the
> > > > > > reader
> > > > > > > > of this message is not the intended recipient, you are hereby
> > > > > notified
> > > > > > > that
> > > > > > > > any printing, copying, dissemination, distribution,
> disclosure
> > or
> > > > > > > > forwarding of this communication is strictly prohibited. If
> you
> > > > have
> > > > > > > > received this communication in error, please contact the
> sender
> > > > > > > immediately
> > > > > > > > and delete it from your system. Thank You.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Have a Nice Day!
> > > > > > > Lohit
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards,
> > > > > > Haosdent Huang
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Have a Nice Day!
> > > > Lohit
> > > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: HDFS read/write data throttling

Posted by Steve Loughran <st...@hortonworks.com>.
this is interesting -I've moved my comments over to the JIRA and it would
be good for yours to go there too.

is there a URL for your paper?


On 13 November 2013 06:27, Andrew Wang <an...@cloudera.com> wrote:

> Hey Steve,
>
> My research project (Cake, published at SoCC '12) was trying to provide
> SLAs for mixed workloads of latency-sensitive and throughput-bound
> applications, e.g. HBase running alongside MR. This was challenging because
> seeks are a real killer. Basically, we had to strongly limit MR I/O to keep
> worst-case seek latency down, and did so by putting schedulers on the RPC
> queues in HBase and HDFS to restrict queuing in the OS and disk where we
> lacked preemption.
>
> Regarding citations of note, most academics consider throughput-sharing to
> be a solved problem. It's not dissimilar from normal time slicing, you try
> to ensure fairness over some coarse timescale. I think cgroups [1] and
> ioprio_set [2] essentially provide this.
>
> Mixing throughput and latency though is difficult, and my conclusion is
> that there isn't a really great solution for spinning disks besides
> physical isolation. As we all know, you can get either IOPS or bandwidth,
> but not both, and it's not a linear tradeoff between the two. If you're
> interested in this though, I can dig up some related work from my Cake
> paper.
>
> However, since it seems that we're more concerned with throughput-bound
> apps, we might be okay just using cgroups and ioprio_set to do
> time-slicing. I actually hacked up some code a while ago which passed a
> client-provided priority byte to the DN, which used it to set the I/O
> priority of the handling DataXceiver accordingly. This isn't the most
> outlandish idea, since we've put QoS fields in our RPC protocol for
> instance; this would just be another byte. Short-circuit reads are outside
> this paradigm, but then you can use cgroup controls instead.
>
> My casual conversations with Googlers indicate that there isn't any special
> Borg/Omega sauce either, just that they heavily prioritize DFS I/O over
> non-DFS. Maybe that's another approach: if we can separate block management
> in HDFS, MR tasks could just write their output to a raw HDFS block, thus
> bringing a lot of I/O back into the fold of "datanode as I/O manager" for a
> machine.
>
> Overall, I strongly agree with you that it's important to first define what
> our goals are regarding I/O QoS. The general case is a tarpit, so it'd be
> good to carve off useful things that can be done now (like Lohit's
> direction of per-stream/FS throughput throttling with trusted clients) and
> then carefully grow the scope as we find more usecases we can confidently
> solve.
>
> Best,
> Andrew
>
> [1] cgroups blkio controller
> https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt
> [2] ioprio_set http://man7.org/linux/man-pages/man2/ioprio_set.2.html
>
>
> On Tue, Nov 12, 2013 at 1:38 AM, Steve Loughran <stevel@hortonworks.com
> >wrote:
>
> > I've looked at it a bit within the context of YARN.
> >
> > YARN containers are where this would be ideal, as then you'd be able to
> > request IO capacity as well as CPU and RAM. For that to work, the
> > throttling would have to be outside the App, as you are trying to limit
> > code whether or not it wants to be, and because you probably (*) want to
> > give it more bandwidth if the system is otherwise idle. Self-throttling
> > doesn't pick up spare IO
> >
> >
> >    1. you can use cgroups in YARN to throttle local disk IO through the
> >    file:// URLs or the java filesystem APIs -such as for MR temp data
> >    2. you can't c-group throttle HDFS per YARN container, which would be
> >    the ideal use case for it. The IO is taking place in the DN, and
> cgroups
> >    only limits IO in the throttled process group.
> >    3. implementing it in the DN would require a lot more complex code
> there
> >    to prioritise work based on block ID (sole identifier that goes around
> >    everywhere) or input source (local sockets for HBase IO vs TCP stack)
> >    4. One you go to a heterogenous filesystem you need to think about IO
> >    load per storage layer as well as/alongside per-volume
> >    5. There's also generic RPC request throttle to prevent DoS against
> the
> >    NN and other HDFS services. That would need to be server side, but
> once
> >    implemented in the RPC code be universal.
> >
> > You also need to define what is the load you are trying to throttle, pure
> > RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a file
> is
> > lined up for sequential reading, you'd almost want it to stream through
> the
> > next blocks until a high priority request came through, but operations
> like
> > a seek which would involve a disk head movement backwards would be
> > something to throttle (hence you need to be storage type aware as SSD
> seeks
> > costs less). You also need to consider that although the cost of writes
> is
> > high, it's usually being done with the goal of preserving data -and you
> > don't want to impact durability.
> >
> > (*) probably, because that's one of the issues that causes debates in
> other
> > datacentre platforms, such as Google Omega: do you want max cluster
> > utilisation vs max determinism of workload.
> >
> > If someone were to do IOP throttling in the 3.x+ timeline,
> >
> >    1. It needs clear use cases, YARN containers being #1 for me
> >    2. We'd have to look at all the research done on this in the past to
> see
> >    what works, doesn't
> >
> > Andrew, what citations of relevance do you have?
> >
> > -steve
> >
> >
> > On 12 November 2013 04:24, lohit <lo...@gmail.com> wrote:
> >
> > > 2013/11/11 Andrew Wang <an...@cloudera.com>
> > >
> > > > Hey Lohit,
> > > >
> > > > This is an interesting topic, and something I actually worked on in
> > grad
> > > > school before coming to Cloudera. It'd help if you could outline some
> > of
> > > > your usecases and how per-FileSystem throttling would help. For what
> I
> > > was
> > > > doing, it made more sense to throttle on the DN side since you have a
> > > > better view over all the I/O happening on the system, and you have
> > > > knowledge of different volumes so you can set limits per-disk. This
> > still
> > > > isn't 100% reliable though since normally a portion of each disk is
> > used
> > > > for MR scratch space, which the DN doesn't have control over. I tried
> > > > playing with thread I/O priorities here, but didn't see much
> > improvement.
> > > > Maybe the newer cgroups stuff can help out.
> > > >
> > >
> > > Thanks. Yes, we also thought about having something on DataNode. This
> > would
> > > also mean one could easily throttle client who access from outside the
> > > cluster, for example distcp or hftp copies. Clients need not worry
> about
> > > throttle configs and each cluster can control how much much throughput
> > can
> > > be achieved. We do want to have something like this.
> > >
> > > >
> > > > I'm sure per-FileSystem throttling will have some benefits (and
> > probably
> > > be
> > > > easier than some DN-side implementation) but again, it'd help to
> better
> > > > understand the problem you are trying to solve.
> > > >
> > >
> > > One idea was flexibility for client to override and have value they can
> > > set. For on trusted cluster we could allow clients to go beyond default
> > > value for some usecases. Alternatively we also thought about having
> > default
> > > value and max value where clients could change default, but not go
> beyond
> > > default. Another problem with DN side config is having different values
> > for
> > > different clients and easily changing those for selective clients.
> > >
> > > As, Haosong also suggested we could wrap FSDataOutputStream/FSDataInput
> > > stream with ThrottleInputStream. But we might have to be careful of any
> > > code which uses FileSystem APIs and accidentally throttling itself.
> (like
> > > reducer copy,  distributed cache and such...)
> > >
> > >
> > >
> > > > Best,
> > > > Andrew
> > > >
> > > >
> > > > On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <ha...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi, lohit. There is a Class named
> > > > > ThrottledInputStream<
> > > > >
> > > >
> > >
> >
> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
> > > > > >
> > > > >  in hadoop-distcp, you could check it out and find more details.
> > > > >
> > > > > In addition to this, I am working on this and try to achieve
> > resources
> > > > > control(include CPU, Network, Disk IO) in JVM. But my
> implementation
> > is
> > > > > depends on cgroup, which only could run in Linux. I would push my
> > > > > library(java-cgroup) to github in the next several months. If you
> are
> > > > > interested at it, give my any advices and help me improve it
> please.
> > > :-)
> > > > >
> > > > >
> > > > > On Tue, Nov 12, 2013 at 3:47 AM, lohit <lohit.vijayarenu@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi Adam,
> > > > > >
> > > > > > Thanks for the reply. The changes I was referring was in
> > > > FileSystem.java
> > > > > > layer which should not affect HDFS Replication/NameNode
> operations.
> > > > > > To give better idea this would affect clients something like this
> > > > > >
> > > > > > Configuration conf = new Configuration();
> > > > > > conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
> > > > > > FileSystem fs = FileSystem.get(conf);
> > > > > >
> > > > > > FSDataInputStream fis = fs.open("/path/to/file.xt");
> > > > > > fis.read(); // <-- This would be max of 20MB/s
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2013/11/11 Adam Muise <am...@hortonworks.com>
> > > > > >
> > > > > > > See https://issues.apache.org/jira/browse/HDFS-3475
> > > > > > >
> > > > > > > Please note that this has met with many unexpected impacts on
> > > > workload.
> > > > > > Be
> > > > > > > careful and be mindful of your Datanode memory and network
> > > capacity.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Nov 11, 2013 at 1:59 PM, lohit <
> > lohit.vijayarenu@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hello Devs,
> > > > > > > >
> > > > > > > > Wanted to reach out and see if anyone has thought about
> ability
> > > to
> > > > > > > throttle
> > > > > > > > data transfer within HDFS. One option we have been thinking
> is
> > to
> > > > > > > throttle
> > > > > > > > on a per FileSystem basis, similar to Statistics in
> FileSystem.
> > > > This
> > > > > > > would
> > > > > > > > mean anyone with handle to HDFS/Hftp will be throttled
> globally
> > > > > within
> > > > > > > JVM.
> > > > > > > > Right value to come up for this would be based on type of
> > > hardware
> > > > we
> > > > > > use
> > > > > > > > and how many tasks/clients we allow.
> > > > > > > >
> > > > > > > > On the other hand doing something like this at FileSystem
> layer
> > > > would
> > > > > > > mean
> > > > > > > > many other tasks such as Job jar copy, DistributedCache copy
> > and
> > > > any
> > > > > > > hidden
> > > > > > > > data movement would also be throttled. We wanted to know if
> > > anyone
> > > > > has
> > > > > > > had
> > > > > > > > such requirement on their clusters in the past and what was
> the
> > > > > > thinking
> > > > > > > > around it. Appreciate your inputs/comments
> > > > > > > >
> > > > > > > > --
> > > > > > > > Have a Nice Day!
> > > > > > > > Lohit
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >    * Adam Muise *       Solutions Engineer
> > > > > > > ------------------------------
> > > > > > >
> > > > > > >     Phone:        416-417-4037
> > > > > > >   Email:      amuise@hortonworks.com
> > > > > > >   Website:   http://www.hortonworks.com/
> > > > > > >
> > > > > > >       * Follow Us: *
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > > >
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > > >
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > > >
> > > > > > >
> > > > > > >  [image: photo]
> > > > > > >
> > > > > > >   Latest From Our Blog:  How to use R and other non-Java
> > languages
> > > in
> > > > > > > MapReduce and Hive
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > CONFIDENTIALITY NOTICE
> > > > > > > NOTICE: This message is intended for the use of the individual
> or
> > > > > entity
> > > > > > to
> > > > > > > which it is addressed and may contain information that is
> > > > confidential,
> > > > > > > privileged and exempt from disclosure under applicable law. If
> > the
> > > > > reader
> > > > > > > of this message is not the intended recipient, you are hereby
> > > > notified
> > > > > > that
> > > > > > > any printing, copying, dissemination, distribution, disclosure
> or
> > > > > > > forwarding of this communication is strictly prohibited. If you
> > > have
> > > > > > > received this communication in error, please contact the sender
> > > > > > immediately
> > > > > > > and delete it from your system. Thank You.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Have a Nice Day!
> > > > > > Lohit
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards,
> > > > > Haosdent Huang
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Have a Nice Day!
> > > Lohit
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: HDFS read/write data throttling

Posted by Andrew Wang <an...@cloudera.com>.
Hey Steve,

My research project (Cake, published at SoCC '12) was trying to provide
SLAs for mixed workloads of latency-sensitive and throughput-bound
applications, e.g. HBase running alongside MR. This was challenging because
seeks are a real killer. Basically, we had to strongly limit MR I/O to keep
worst-case seek latency down, and did so by putting schedulers on the RPC
queues in HBase and HDFS to restrict queuing in the OS and disk where we
lacked preemption.

Regarding citations of note, most academics consider throughput-sharing to
be a solved problem. It's not dissimilar from normal time slicing, you try
to ensure fairness over some coarse timescale. I think cgroups [1] and
ioprio_set [2] essentially provide this.

Mixing throughput and latency though is difficult, and my conclusion is
that there isn't a really great solution for spinning disks besides
physical isolation. As we all know, you can get either IOPS or bandwidth,
but not both, and it's not a linear tradeoff between the two. If you're
interested in this though, I can dig up some related work from my Cake
paper.

However, since it seems that we're more concerned with throughput-bound
apps, we might be okay just using cgroups and ioprio_set to do
time-slicing. I actually hacked up some code a while ago which passed a
client-provided priority byte to the DN, which used it to set the I/O
priority of the handling DataXceiver accordingly. This isn't the most
outlandish idea, since we've put QoS fields in our RPC protocol for
instance; this would just be another byte. Short-circuit reads are outside
this paradigm, but then you can use cgroup controls instead.

My casual conversations with Googlers indicate that there isn't any special
Borg/Omega sauce either, just that they heavily prioritize DFS I/O over
non-DFS. Maybe that's another approach: if we can separate block management
in HDFS, MR tasks could just write their output to a raw HDFS block, thus
bringing a lot of I/O back into the fold of "datanode as I/O manager" for a
machine.

Overall, I strongly agree with you that it's important to first define what
our goals are regarding I/O QoS. The general case is a tarpit, so it'd be
good to carve off useful things that can be done now (like Lohit's
direction of per-stream/FS throughput throttling with trusted clients) and
then carefully grow the scope as we find more usecases we can confidently
solve.

Best,
Andrew

[1] cgroups blkio controller
https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt
[2] ioprio_set http://man7.org/linux/man-pages/man2/ioprio_set.2.html


On Tue, Nov 12, 2013 at 1:38 AM, Steve Loughran <st...@hortonworks.com>wrote:

> I've looked at it a bit within the context of YARN.
>
> YARN containers are where this would be ideal, as then you'd be able to
> request IO capacity as well as CPU and RAM. For that to work, the
> throttling would have to be outside the App, as you are trying to limit
> code whether or not it wants to be, and because you probably (*) want to
> give it more bandwidth if the system is otherwise idle. Self-throttling
> doesn't pick up spare IO
>
>
>    1. you can use cgroups in YARN to throttle local disk IO through the
>    file:// URLs or the java filesystem APIs -such as for MR temp data
>    2. you can't c-group throttle HDFS per YARN container, which would be
>    the ideal use case for it. The IO is taking place in the DN, and cgroups
>    only limits IO in the throttled process group.
>    3. implementing it in the DN would require a lot more complex code there
>    to prioritise work based on block ID (sole identifier that goes around
>    everywhere) or input source (local sockets for HBase IO vs TCP stack)
>    4. One you go to a heterogenous filesystem you need to think about IO
>    load per storage layer as well as/alongside per-volume
>    5. There's also generic RPC request throttle to prevent DoS against the
>    NN and other HDFS services. That would need to be server side, but once
>    implemented in the RPC code be universal.
>
> You also need to define what is the load you are trying to throttle, pure
> RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a file is
> lined up for sequential reading, you'd almost want it to stream through the
> next blocks until a high priority request came through, but operations like
> a seek which would involve a disk head movement backwards would be
> something to throttle (hence you need to be storage type aware as SSD seeks
> costs less). You also need to consider that although the cost of writes is
> high, it's usually being done with the goal of preserving data -and you
> don't want to impact durability.
>
> (*) probably, because that's one of the issues that causes debates in other
> datacentre platforms, such as Google Omega: do you want max cluster
> utilisation vs max determinism of workload.
>
> If someone were to do IOP throttling in the 3.x+ timeline,
>
>    1. It needs clear use cases, YARN containers being #1 for me
>    2. We'd have to look at all the research done on this in the past to see
>    what works, doesn't
>
> Andrew, what citations of relevance do you have?
>
> -steve
>
>
> On 12 November 2013 04:24, lohit <lo...@gmail.com> wrote:
>
> > 2013/11/11 Andrew Wang <an...@cloudera.com>
> >
> > > Hey Lohit,
> > >
> > > This is an interesting topic, and something I actually worked on in
> grad
> > > school before coming to Cloudera. It'd help if you could outline some
> of
> > > your usecases and how per-FileSystem throttling would help. For what I
> > was
> > > doing, it made more sense to throttle on the DN side since you have a
> > > better view over all the I/O happening on the system, and you have
> > > knowledge of different volumes so you can set limits per-disk. This
> still
> > > isn't 100% reliable though since normally a portion of each disk is
> used
> > > for MR scratch space, which the DN doesn't have control over. I tried
> > > playing with thread I/O priorities here, but didn't see much
> improvement.
> > > Maybe the newer cgroups stuff can help out.
> > >
> >
> > Thanks. Yes, we also thought about having something on DataNode. This
> would
> > also mean one could easily throttle client who access from outside the
> > cluster, for example distcp or hftp copies. Clients need not worry about
> > throttle configs and each cluster can control how much much throughput
> can
> > be achieved. We do want to have something like this.
> >
> > >
> > > I'm sure per-FileSystem throttling will have some benefits (and
> probably
> > be
> > > easier than some DN-side implementation) but again, it'd help to better
> > > understand the problem you are trying to solve.
> > >
> >
> > One idea was flexibility for client to override and have value they can
> > set. For on trusted cluster we could allow clients to go beyond default
> > value for some usecases. Alternatively we also thought about having
> default
> > value and max value where clients could change default, but not go beyond
> > default. Another problem with DN side config is having different values
> for
> > different clients and easily changing those for selective clients.
> >
> > As, Haosong also suggested we could wrap FSDataOutputStream/FSDataInput
> > stream with ThrottleInputStream. But we might have to be careful of any
> > code which uses FileSystem APIs and accidentally throttling itself. (like
> > reducer copy,  distributed cache and such...)
> >
> >
> >
> > > Best,
> > > Andrew
> > >
> > >
> > > On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <ha...@gmail.com>
> > wrote:
> > >
> > > > Hi, lohit. There is a Class named
> > > > ThrottledInputStream<
> > > >
> > >
> >
> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
> > > > >
> > > >  in hadoop-distcp, you could check it out and find more details.
> > > >
> > > > In addition to this, I am working on this and try to achieve
> resources
> > > > control(include CPU, Network, Disk IO) in JVM. But my implementation
> is
> > > > depends on cgroup, which only could run in Linux. I would push my
> > > > library(java-cgroup) to github in the next several months. If you are
> > > > interested at it, give my any advices and help me improve it please.
> > :-)
> > > >
> > > >
> > > > On Tue, Nov 12, 2013 at 3:47 AM, lohit <lo...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Adam,
> > > > >
> > > > > Thanks for the reply. The changes I was referring was in
> > > FileSystem.java
> > > > > layer which should not affect HDFS Replication/NameNode operations.
> > > > > To give better idea this would affect clients something like this
> > > > >
> > > > > Configuration conf = new Configuration();
> > > > > conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
> > > > > FileSystem fs = FileSystem.get(conf);
> > > > >
> > > > > FSDataInputStream fis = fs.open("/path/to/file.xt");
> > > > > fis.read(); // <-- This would be max of 20MB/s
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 2013/11/11 Adam Muise <am...@hortonworks.com>
> > > > >
> > > > > > See https://issues.apache.org/jira/browse/HDFS-3475
> > > > > >
> > > > > > Please note that this has met with many unexpected impacts on
> > > workload.
> > > > > Be
> > > > > > careful and be mindful of your Datanode memory and network
> > capacity.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Nov 11, 2013 at 1:59 PM, lohit <
> lohit.vijayarenu@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hello Devs,
> > > > > > >
> > > > > > > Wanted to reach out and see if anyone has thought about ability
> > to
> > > > > > throttle
> > > > > > > data transfer within HDFS. One option we have been thinking is
> to
> > > > > > throttle
> > > > > > > on a per FileSystem basis, similar to Statistics in FileSystem.
> > > This
> > > > > > would
> > > > > > > mean anyone with handle to HDFS/Hftp will be throttled globally
> > > > within
> > > > > > JVM.
> > > > > > > Right value to come up for this would be based on type of
> > hardware
> > > we
> > > > > use
> > > > > > > and how many tasks/clients we allow.
> > > > > > >
> > > > > > > On the other hand doing something like this at FileSystem layer
> > > would
> > > > > > mean
> > > > > > > many other tasks such as Job jar copy, DistributedCache copy
> and
> > > any
> > > > > > hidden
> > > > > > > data movement would also be throttled. We wanted to know if
> > anyone
> > > > has
> > > > > > had
> > > > > > > such requirement on their clusters in the past and what was the
> > > > > thinking
> > > > > > > around it. Appreciate your inputs/comments
> > > > > > >
> > > > > > > --
> > > > > > > Have a Nice Day!
> > > > > > > Lohit
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >    * Adam Muise *       Solutions Engineer
> > > > > > ------------------------------
> > > > > >
> > > > > >     Phone:        416-417-4037
> > > > > >   Email:      amuise@hortonworks.com
> > > > > >   Website:   http://www.hortonworks.com/
> > > > > >
> > > > > >       * Follow Us: *
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > >
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > >
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > >
> > > > > >
> > > > > >  [image: photo]
> > > > > >
> > > > > >   Latest From Our Blog:  How to use R and other non-Java
> languages
> > in
> > > > > > MapReduce and Hive
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > CONFIDENTIALITY NOTICE
> > > > > > NOTICE: This message is intended for the use of the individual or
> > > > entity
> > > > > to
> > > > > > which it is addressed and may contain information that is
> > > confidential,
> > > > > > privileged and exempt from disclosure under applicable law. If
> the
> > > > reader
> > > > > > of this message is not the intended recipient, you are hereby
> > > notified
> > > > > that
> > > > > > any printing, copying, dissemination, distribution, disclosure or
> > > > > > forwarding of this communication is strictly prohibited. If you
> > have
> > > > > > received this communication in error, please contact the sender
> > > > > immediately
> > > > > > and delete it from your system. Thank You.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Have a Nice Day!
> > > > > Lohit
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards,
> > > > Haosdent Huang
> > > >
> > >
> >
> >
> >
> > --
> > Have a Nice Day!
> > Lohit
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: HDFS read/write data throttling

Posted by Steve Loughran <st...@hortonworks.com>.
I've looked at it a bit within the context of YARN.

YARN containers are where this would be ideal, as then you'd be able to
request IO capacity as well as CPU and RAM. For that to work, the
throttling would have to be outside the App, as you are trying to limit
code whether or not it wants to be, and because you probably (*) want to
give it more bandwidth if the system is otherwise idle. Self-throttling
doesn't pick up spare IO


   1. you can use cgroups in YARN to throttle local disk IO through the
   file:// URLs or the java filesystem APIs -such as for MR temp data
   2. you can't c-group throttle HDFS per YARN container, which would be
   the ideal use case for it. The IO is taking place in the DN, and cgroups
   only limits IO in the throttled process group.
   3. implementing it in the DN would require a lot more complex code there
   to prioritise work based on block ID (sole identifier that goes around
   everywhere) or input source (local sockets for HBase IO vs TCP stack)
   4. One you go to a heterogenous filesystem you need to think about IO
   load per storage layer as well as/alongside per-volume
   5. There's also generic RPC request throttle to prevent DoS against the
   NN and other HDFS services. That would need to be server side, but once
   implemented in the RPC code be universal.

You also need to define what is the load you are trying to throttle, pure
RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a file is
lined up for sequential reading, you'd almost want it to stream through the
next blocks until a high priority request came through, but operations like
a seek which would involve a disk head movement backwards would be
something to throttle (hence you need to be storage type aware as SSD seeks
costs less). You also need to consider that although the cost of writes is
high, it's usually being done with the goal of preserving data -and you
don't want to impact durability.

(*) probably, because that's one of the issues that causes debates in other
datacentre platforms, such as Google Omega: do you want max cluster
utilisation vs max determinism of workload.

If someone were to do IOP throttling in the 3.x+ timeline,

   1. It needs clear use cases, YARN containers being #1 for me
   2. We'd have to look at all the research done on this in the past to see
   what works, doesn't

Andrew, what citations of relevance do you have?

-steve


On 12 November 2013 04:24, lohit <lo...@gmail.com> wrote:

> 2013/11/11 Andrew Wang <an...@cloudera.com>
>
> > Hey Lohit,
> >
> > This is an interesting topic, and something I actually worked on in grad
> > school before coming to Cloudera. It'd help if you could outline some of
> > your usecases and how per-FileSystem throttling would help. For what I
> was
> > doing, it made more sense to throttle on the DN side since you have a
> > better view over all the I/O happening on the system, and you have
> > knowledge of different volumes so you can set limits per-disk. This still
> > isn't 100% reliable though since normally a portion of each disk is used
> > for MR scratch space, which the DN doesn't have control over. I tried
> > playing with thread I/O priorities here, but didn't see much improvement.
> > Maybe the newer cgroups stuff can help out.
> >
>
> Thanks. Yes, we also thought about having something on DataNode. This would
> also mean one could easily throttle client who access from outside the
> cluster, for example distcp or hftp copies. Clients need not worry about
> throttle configs and each cluster can control how much much throughput can
> be achieved. We do want to have something like this.
>
> >
> > I'm sure per-FileSystem throttling will have some benefits (and probably
> be
> > easier than some DN-side implementation) but again, it'd help to better
> > understand the problem you are trying to solve.
> >
>
> One idea was flexibility for client to override and have value they can
> set. For on trusted cluster we could allow clients to go beyond default
> value for some usecases. Alternatively we also thought about having default
> value and max value where clients could change default, but not go beyond
> default. Another problem with DN side config is having different values for
> different clients and easily changing those for selective clients.
>
> As, Haosong also suggested we could wrap FSDataOutputStream/FSDataInput
> stream with ThrottleInputStream. But we might have to be careful of any
> code which uses FileSystem APIs and accidentally throttling itself. (like
> reducer copy,  distributed cache and such...)
>
>
>
> > Best,
> > Andrew
> >
> >
> > On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <ha...@gmail.com>
> wrote:
> >
> > > Hi, lohit. There is a Class named
> > > ThrottledInputStream<
> > >
> >
> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
> > > >
> > >  in hadoop-distcp, you could check it out and find more details.
> > >
> > > In addition to this, I am working on this and try to achieve resources
> > > control(include CPU, Network, Disk IO) in JVM. But my implementation is
> > > depends on cgroup, which only could run in Linux. I would push my
> > > library(java-cgroup) to github in the next several months. If you are
> > > interested at it, give my any advices and help me improve it please.
> :-)
> > >
> > >
> > > On Tue, Nov 12, 2013 at 3:47 AM, lohit <lo...@gmail.com>
> > wrote:
> > >
> > > > Hi Adam,
> > > >
> > > > Thanks for the reply. The changes I was referring was in
> > FileSystem.java
> > > > layer which should not affect HDFS Replication/NameNode operations.
> > > > To give better idea this would affect clients something like this
> > > >
> > > > Configuration conf = new Configuration();
> > > > conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
> > > > FileSystem fs = FileSystem.get(conf);
> > > >
> > > > FSDataInputStream fis = fs.open("/path/to/file.xt");
> > > > fis.read(); // <-- This would be max of 20MB/s
> > > >
> > > >
> > > >
> > > >
> > > > 2013/11/11 Adam Muise <am...@hortonworks.com>
> > > >
> > > > > See https://issues.apache.org/jira/browse/HDFS-3475
> > > > >
> > > > > Please note that this has met with many unexpected impacts on
> > workload.
> > > > Be
> > > > > careful and be mindful of your Datanode memory and network
> capacity.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Nov 11, 2013 at 1:59 PM, lohit <lohit.vijayarenu@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hello Devs,
> > > > > >
> > > > > > Wanted to reach out and see if anyone has thought about ability
> to
> > > > > throttle
> > > > > > data transfer within HDFS. One option we have been thinking is to
> > > > > throttle
> > > > > > on a per FileSystem basis, similar to Statistics in FileSystem.
> > This
> > > > > would
> > > > > > mean anyone with handle to HDFS/Hftp will be throttled globally
> > > within
> > > > > JVM.
> > > > > > Right value to come up for this would be based on type of
> hardware
> > we
> > > > use
> > > > > > and how many tasks/clients we allow.
> > > > > >
> > > > > > On the other hand doing something like this at FileSystem layer
> > would
> > > > > mean
> > > > > > many other tasks such as Job jar copy, DistributedCache copy and
> > any
> > > > > hidden
> > > > > > data movement would also be throttled. We wanted to know if
> anyone
> > > has
> > > > > had
> > > > > > such requirement on their clusters in the past and what was the
> > > > thinking
> > > > > > around it. Appreciate your inputs/comments
> > > > > >
> > > > > > --
> > > > > > Have a Nice Day!
> > > > > > Lohit
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >    * Adam Muise *       Solutions Engineer
> > > > > ------------------------------
> > > > >
> > > > >     Phone:        416-417-4037
> > > > >   Email:      amuise@hortonworks.com
> > > > >   Website:   http://www.hortonworks.com/
> > > > >
> > > > >       * Follow Us: *
> > > > > <
> > > > >
> > > >
> > >
> >
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > >
> > > > > <
> > > > >
> > > >
> > >
> >
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > >
> > > > > <
> > > > >
> > > >
> > >
> >
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > >
> > > > >
> > > > >  [image: photo]
> > > > >
> > > > >   Latest From Our Blog:  How to use R and other non-Java languages
> in
> > > > > MapReduce and Hive
> > > > > <
> > > > >
> > > >
> > >
> >
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > > >
> > > > >
> > > > > --
> > > > > CONFIDENTIALITY NOTICE
> > > > > NOTICE: This message is intended for the use of the individual or
> > > entity
> > > > to
> > > > > which it is addressed and may contain information that is
> > confidential,
> > > > > privileged and exempt from disclosure under applicable law. If the
> > > reader
> > > > > of this message is not the intended recipient, you are hereby
> > notified
> > > > that
> > > > > any printing, copying, dissemination, distribution, disclosure or
> > > > > forwarding of this communication is strictly prohibited. If you
> have
> > > > > received this communication in error, please contact the sender
> > > > immediately
> > > > > and delete it from your system. Thank You.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Have a Nice Day!
> > > > Lohit
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> > >
> >
>
>
>
> --
> Have a Nice Day!
> Lohit
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: HDFS read/write data throttling

Posted by lohit <lo...@gmail.com>.
2013/11/11 Andrew Wang <an...@cloudera.com>

> Hey Lohit,
>
> This is an interesting topic, and something I actually worked on in grad
> school before coming to Cloudera. It'd help if you could outline some of
> your usecases and how per-FileSystem throttling would help. For what I was
> doing, it made more sense to throttle on the DN side since you have a
> better view over all the I/O happening on the system, and you have
> knowledge of different volumes so you can set limits per-disk. This still
> isn't 100% reliable though since normally a portion of each disk is used
> for MR scratch space, which the DN doesn't have control over. I tried
> playing with thread I/O priorities here, but didn't see much improvement.
> Maybe the newer cgroups stuff can help out.
>

Thanks. Yes, we also thought about having something on DataNode. This would
also mean one could easily throttle client who access from outside the
cluster, for example distcp or hftp copies. Clients need not worry about
throttle configs and each cluster can control how much much throughput can
be achieved. We do want to have something like this.

>
> I'm sure per-FileSystem throttling will have some benefits (and probably be
> easier than some DN-side implementation) but again, it'd help to better
> understand the problem you are trying to solve.
>

One idea was flexibility for client to override and have value they can
set. For on trusted cluster we could allow clients to go beyond default
value for some usecases. Alternatively we also thought about having default
value and max value where clients could change default, but not go beyond
default. Another problem with DN side config is having different values for
different clients and easily changing those for selective clients.

As, Haosong also suggested we could wrap FSDataOutputStream/FSDataInput
stream with ThrottleInputStream. But we might have to be careful of any
code which uses FileSystem APIs and accidentally throttling itself. (like
reducer copy,  distributed cache and such...)



> Best,
> Andrew
>
>
> On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <ha...@gmail.com> wrote:
>
> > Hi, lohit. There is a Class named
> > ThrottledInputStream<
> >
> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
> > >
> >  in hadoop-distcp, you could check it out and find more details.
> >
> > In addition to this, I am working on this and try to achieve resources
> > control(include CPU, Network, Disk IO) in JVM. But my implementation is
> > depends on cgroup, which only could run in Linux. I would push my
> > library(java-cgroup) to github in the next several months. If you are
> > interested at it, give my any advices and help me improve it please. :-)
> >
> >
> > On Tue, Nov 12, 2013 at 3:47 AM, lohit <lo...@gmail.com>
> wrote:
> >
> > > Hi Adam,
> > >
> > > Thanks for the reply. The changes I was referring was in
> FileSystem.java
> > > layer which should not affect HDFS Replication/NameNode operations.
> > > To give better idea this would affect clients something like this
> > >
> > > Configuration conf = new Configuration();
> > > conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
> > > FileSystem fs = FileSystem.get(conf);
> > >
> > > FSDataInputStream fis = fs.open("/path/to/file.xt");
> > > fis.read(); // <-- This would be max of 20MB/s
> > >
> > >
> > >
> > >
> > > 2013/11/11 Adam Muise <am...@hortonworks.com>
> > >
> > > > See https://issues.apache.org/jira/browse/HDFS-3475
> > > >
> > > > Please note that this has met with many unexpected impacts on
> workload.
> > > Be
> > > > careful and be mindful of your Datanode memory and network capacity.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Nov 11, 2013 at 1:59 PM, lohit <lo...@gmail.com>
> > > wrote:
> > > >
> > > > > Hello Devs,
> > > > >
> > > > > Wanted to reach out and see if anyone has thought about ability to
> > > > throttle
> > > > > data transfer within HDFS. One option we have been thinking is to
> > > > throttle
> > > > > on a per FileSystem basis, similar to Statistics in FileSystem.
> This
> > > > would
> > > > > mean anyone with handle to HDFS/Hftp will be throttled globally
> > within
> > > > JVM.
> > > > > Right value to come up for this would be based on type of hardware
> we
> > > use
> > > > > and how many tasks/clients we allow.
> > > > >
> > > > > On the other hand doing something like this at FileSystem layer
> would
> > > > mean
> > > > > many other tasks such as Job jar copy, DistributedCache copy and
> any
> > > > hidden
> > > > > data movement would also be throttled. We wanted to know if anyone
> > has
> > > > had
> > > > > such requirement on their clusters in the past and what was the
> > > thinking
> > > > > around it. Appreciate your inputs/comments
> > > > >
> > > > > --
> > > > > Have a Nice Day!
> > > > > Lohit
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >    * Adam Muise *       Solutions Engineer
> > > > ------------------------------
> > > >
> > > >     Phone:        416-417-4037
> > > >   Email:      amuise@hortonworks.com
> > > >   Website:   http://www.hortonworks.com/
> > > >
> > > >       * Follow Us: *
> > > > <
> > > >
> > >
> >
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > >
> > > > <
> > > >
> > >
> >
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > >
> > > > <
> > > >
> > >
> >
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > >
> > > >
> > > >  [image: photo]
> > > >
> > > >   Latest From Our Blog:  How to use R and other non-Java languages in
> > > > MapReduce and Hive
> > > > <
> > > >
> > >
> >
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > > >
> > > >
> > > > --
> > > > CONFIDENTIALITY NOTICE
> > > > NOTICE: This message is intended for the use of the individual or
> > entity
> > > to
> > > > which it is addressed and may contain information that is
> confidential,
> > > > privileged and exempt from disclosure under applicable law. If the
> > reader
> > > > of this message is not the intended recipient, you are hereby
> notified
> > > that
> > > > any printing, copying, dissemination, distribution, disclosure or
> > > > forwarding of this communication is strictly prohibited. If you have
> > > > received this communication in error, please contact the sender
> > > immediately
> > > > and delete it from your system. Thank You.
> > > >
> > >
> > >
> > >
> > > --
> > > Have a Nice Day!
> > > Lohit
> > >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
>



-- 
Have a Nice Day!
Lohit

Re: HDFS read/write data throttling

Posted by Andrew Wang <an...@cloudera.com>.
Hey Lohit,

This is an interesting topic, and something I actually worked on in grad
school before coming to Cloudera. It'd help if you could outline some of
your usecases and how per-FileSystem throttling would help. For what I was
doing, it made more sense to throttle on the DN side since you have a
better view over all the I/O happening on the system, and you have
knowledge of different volumes so you can set limits per-disk. This still
isn't 100% reliable though since normally a portion of each disk is used
for MR scratch space, which the DN doesn't have control over. I tried
playing with thread I/O priorities here, but didn't see much improvement.
Maybe the newer cgroups stuff can help out.

I'm sure per-FileSystem throttling will have some benefits (and probably be
easier than some DN-side implementation) but again, it'd help to better
understand the problem you are trying to solve.

Best,
Andrew


On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <ha...@gmail.com> wrote:

> Hi, lohit. There is a Class named
> ThrottledInputStream<
> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
> >
>  in hadoop-distcp, you could check it out and find more details.
>
> In addition to this, I am working on this and try to achieve resources
> control(include CPU, Network, Disk IO) in JVM. But my implementation is
> depends on cgroup, which only could run in Linux. I would push my
> library(java-cgroup) to github in the next several months. If you are
> interested at it, give my any advices and help me improve it please. :-)
>
>
> On Tue, Nov 12, 2013 at 3:47 AM, lohit <lo...@gmail.com> wrote:
>
> > Hi Adam,
> >
> > Thanks for the reply. The changes I was referring was in FileSystem.java
> > layer which should not affect HDFS Replication/NameNode operations.
> > To give better idea this would affect clients something like this
> >
> > Configuration conf = new Configuration();
> > conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
> > FileSystem fs = FileSystem.get(conf);
> >
> > FSDataInputStream fis = fs.open("/path/to/file.xt");
> > fis.read(); // <-- This would be max of 20MB/s
> >
> >
> >
> >
> > 2013/11/11 Adam Muise <am...@hortonworks.com>
> >
> > > See https://issues.apache.org/jira/browse/HDFS-3475
> > >
> > > Please note that this has met with many unexpected impacts on workload.
> > Be
> > > careful and be mindful of your Datanode memory and network capacity.
> > >
> > >
> > >
> > >
> > > On Mon, Nov 11, 2013 at 1:59 PM, lohit <lo...@gmail.com>
> > wrote:
> > >
> > > > Hello Devs,
> > > >
> > > > Wanted to reach out and see if anyone has thought about ability to
> > > throttle
> > > > data transfer within HDFS. One option we have been thinking is to
> > > throttle
> > > > on a per FileSystem basis, similar to Statistics in FileSystem. This
> > > would
> > > > mean anyone with handle to HDFS/Hftp will be throttled globally
> within
> > > JVM.
> > > > Right value to come up for this would be based on type of hardware we
> > use
> > > > and how many tasks/clients we allow.
> > > >
> > > > On the other hand doing something like this at FileSystem layer would
> > > mean
> > > > many other tasks such as Job jar copy, DistributedCache copy and any
> > > hidden
> > > > data movement would also be throttled. We wanted to know if anyone
> has
> > > had
> > > > such requirement on their clusters in the past and what was the
> > thinking
> > > > around it. Appreciate your inputs/comments
> > > >
> > > > --
> > > > Have a Nice Day!
> > > > Lohit
> > > >
> > >
> > >
> > >
> > > --
> > >    * Adam Muise *       Solutions Engineer
> > > ------------------------------
> > >
> > >     Phone:        416-417-4037
> > >   Email:      amuise@hortonworks.com
> > >   Website:   http://www.hortonworks.com/
> > >
> > >       * Follow Us: *
> > > <
> > >
> >
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > >
> > > <
> > >
> >
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > >
> > > <
> > >
> >
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > >
> > >
> > >  [image: photo]
> > >
> > >   Latest From Our Blog:  How to use R and other non-Java languages in
> > > MapReduce and Hive
> > > <
> > >
> >
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
> >
> >
> > --
> > Have a Nice Day!
> > Lohit
> >
>
>
>
> --
> Best Regards,
> Haosdent Huang
>

Re: HDFS read/write data throttling

Posted by Haosong Huang <ha...@gmail.com>.
Hi, lohit. There is a Class named
ThrottledInputStream<http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java>
 in hadoop-distcp, you could check it out and find more details.

In addition to this, I am working on this and try to achieve resources
control(include CPU, Network, Disk IO) in JVM. But my implementation is
depends on cgroup, which only could run in Linux. I would push my
library(java-cgroup) to github in the next several months. If you are
interested at it, give my any advices and help me improve it please. :-)


On Tue, Nov 12, 2013 at 3:47 AM, lohit <lo...@gmail.com> wrote:

> Hi Adam,
>
> Thanks for the reply. The changes I was referring was in FileSystem.java
> layer which should not affect HDFS Replication/NameNode operations.
> To give better idea this would affect clients something like this
>
> Configuration conf = new Configuration();
> conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
> FileSystem fs = FileSystem.get(conf);
>
> FSDataInputStream fis = fs.open("/path/to/file.xt");
> fis.read(); // <-- This would be max of 20MB/s
>
>
>
>
> 2013/11/11 Adam Muise <am...@hortonworks.com>
>
> > See https://issues.apache.org/jira/browse/HDFS-3475
> >
> > Please note that this has met with many unexpected impacts on workload.
> Be
> > careful and be mindful of your Datanode memory and network capacity.
> >
> >
> >
> >
> > On Mon, Nov 11, 2013 at 1:59 PM, lohit <lo...@gmail.com>
> wrote:
> >
> > > Hello Devs,
> > >
> > > Wanted to reach out and see if anyone has thought about ability to
> > throttle
> > > data transfer within HDFS. One option we have been thinking is to
> > throttle
> > > on a per FileSystem basis, similar to Statistics in FileSystem. This
> > would
> > > mean anyone with handle to HDFS/Hftp will be throttled globally within
> > JVM.
> > > Right value to come up for this would be based on type of hardware we
> use
> > > and how many tasks/clients we allow.
> > >
> > > On the other hand doing something like this at FileSystem layer would
> > mean
> > > many other tasks such as Job jar copy, DistributedCache copy and any
> > hidden
> > > data movement would also be throttled. We wanted to know if anyone has
> > had
> > > such requirement on their clusters in the past and what was the
> thinking
> > > around it. Appreciate your inputs/comments
> > >
> > > --
> > > Have a Nice Day!
> > > Lohit
> > >
> >
> >
> >
> > --
> >    * Adam Muise *       Solutions Engineer
> > ------------------------------
> >
> >     Phone:        416-417-4037
> >   Email:      amuise@hortonworks.com
> >   Website:   http://www.hortonworks.com/
> >
> >       * Follow Us: *
> > <
> >
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > >
> > <
> >
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > >
> > <
> >
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > >
> >
> >  [image: photo]
> >
> >   Latest From Our Blog:  How to use R and other non-Java languages in
> > MapReduce and Hive
> > <
> >
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>
>
>
> --
> Have a Nice Day!
> Lohit
>



-- 
Best Regards,
Haosdent Huang

Re: HDFS read/write data throttling

Posted by lohit <lo...@gmail.com>.
Hi Adam,

Thanks for the reply. The changes I was referring was in FileSystem.java
layer which should not affect HDFS Replication/NameNode operations.
To give better idea this would affect clients something like this

Configuration conf = new Configuration();
conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
FileSystem fs = FileSystem.get(conf);

FSDataInputStream fis = fs.open("/path/to/file.xt");
fis.read(); // <-- This would be max of 20MB/s




2013/11/11 Adam Muise <am...@hortonworks.com>

> See https://issues.apache.org/jira/browse/HDFS-3475
>
> Please note that this has met with many unexpected impacts on workload. Be
> careful and be mindful of your Datanode memory and network capacity.
>
>
>
>
> On Mon, Nov 11, 2013 at 1:59 PM, lohit <lo...@gmail.com> wrote:
>
> > Hello Devs,
> >
> > Wanted to reach out and see if anyone has thought about ability to
> throttle
> > data transfer within HDFS. One option we have been thinking is to
> throttle
> > on a per FileSystem basis, similar to Statistics in FileSystem. This
> would
> > mean anyone with handle to HDFS/Hftp will be throttled globally within
> JVM.
> > Right value to come up for this would be based on type of hardware we use
> > and how many tasks/clients we allow.
> >
> > On the other hand doing something like this at FileSystem layer would
> mean
> > many other tasks such as Job jar copy, DistributedCache copy and any
> hidden
> > data movement would also be throttled. We wanted to know if anyone has
> had
> > such requirement on their clusters in the past and what was the thinking
> > around it. Appreciate your inputs/comments
> >
> > --
> > Have a Nice Day!
> > Lohit
> >
>
>
>
> --
>    * Adam Muise *       Solutions Engineer
> ------------------------------
>
>     Phone:        416-417-4037
>   Email:      amuise@hortonworks.com
>   Website:   http://www.hortonworks.com/
>
>       * Follow Us: *
> <
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >
> <
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >
> <
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >
>
>  [image: photo]
>
>   Latest From Our Blog:  How to use R and other non-Java languages in
> MapReduce and Hive
> <
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
Have a Nice Day!
Lohit

Re: HDFS read/write data throttling

Posted by Adam Muise <am...@hortonworks.com>.
See https://issues.apache.org/jira/browse/HDFS-3475

Please note that this has met with many unexpected impacts on workload. Be
careful and be mindful of your Datanode memory and network capacity.




On Mon, Nov 11, 2013 at 1:59 PM, lohit <lo...@gmail.com> wrote:

> Hello Devs,
>
> Wanted to reach out and see if anyone has thought about ability to throttle
> data transfer within HDFS. One option we have been thinking is to throttle
> on a per FileSystem basis, similar to Statistics in FileSystem. This would
> mean anyone with handle to HDFS/Hftp will be throttled globally within JVM.
> Right value to come up for this would be based on type of hardware we use
> and how many tasks/clients we allow.
>
> On the other hand doing something like this at FileSystem layer would mean
> many other tasks such as Job jar copy, DistributedCache copy and any hidden
> data movement would also be throttled. We wanted to know if anyone has had
> such requirement on their clusters in the past and what was the thinking
> around it. Appreciate your inputs/comments
>
> --
> Have a Nice Day!
> Lohit
>



-- 
   * Adam Muise *       Solutions Engineer
------------------------------

    Phone:        416-417-4037
  Email:      amuise@hortonworks.com
  Website:   http://www.hortonworks.com/

      * Follow Us: *
<http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

 [image: photo]

  Latest From Our Blog:  How to use R and other non-Java languages in
MapReduce and Hive
<http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.