You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Konstantin Shvachko <sh...@gmail.com> on 2018/12/13 18:09:50 UTC

[Result] [VOTE] Merge HDFS-12943 branch to trunk - Consistent Reads from Standby

This vote failed due to Daryn Sharp's veto.
The concern is being addressed by HDFS-13873. I will start a new vote once
this is committed.

Note for Daryn. Your non-responsive handling of the veto makes a bad
precedence and is a bad example of communication on the lists from a
respected member of this community. Please check your availability for
followup discussions if you choose to get involved with important decisions.

On Fri, Dec 7, 2018 at 4:10 PM Konstantin Shvachko <sh...@gmail.com>
wrote:

> Hi Daryn,
>
> Wanted to backup Chen's earlier response to your concerns about rotating
> calls in the call queue.
> Our design
> 1. targets directly the livelock problem by rejecting calls on the
> Observer that are not likely to be responded in timely matter: HDFS-13873.
> 2. The call queue rotation is only done on Observers, and never on the
> active NN, so it stays free of attacks like you suggest.
>
> If this is a satisfactory mitigation for the problem could you please
> reconsider your -1, so that people could continue voting on this thread.
>
> Thanks,
> --Konst
>
> On Thu, Dec 6, 2018 at 10:38 AM Daryn Sharp <da...@oath.com> wrote:
>
>> -1 pending additional info.  After a cursory scan, I have serious
>> concerns regarding the design.  This seems like a feature that should have
>> been purely implemented in hdfs w/o touching the common IPC layer.
>>
>> The biggest issue in the alignment context.  It's purpose appears to be
>> for allowing handlers to reinsert calls back into the call queue.  That's
>> completely unacceptable.  A buggy or malicious client can easily cause
>> livelock in the IPC layer with handlers only looping on calls that never
>> satisfy the condition.  Why is this not implemented via RetriableExceptions?
>>
>> On Thu, Dec 6, 2018 at 1:24 AM Yongjun Zhang <yz...@cloudera.com.invalid>
>> wrote:
>>
>>> Great work guys.
>>>
>>> Wonder if we can elaborate what's impact of not having #2 fixed, and why
>>> #2
>>> is not needed for the feature to complete?
>>> 2. Need to fix automatic failover with ZKFC. Currently it does not
>>> doesn't
>>> know about ObserverNodes trying to convert them to SBNs.
>>>
>>> Thanks.
>>> --Yongjun
>>>
>>>
>>> On Wed, Dec 5, 2018 at 5:27 PM Konstantin Shvachko <shv.hadoop@gmail.com
>>> >
>>> wrote:
>>>
>>> > Hi Hadoop developers,
>>> >
>>> > I would like to propose to merge to trunk the feature branch
>>> HDFS-12943 for
>>> > Consistent Reads from Standby Node. The feature is intended to scale
>>> read
>>> > RPC workloads. On large clusters reads comprise 95% of all RPCs to the
>>> > NameNode. We should be able to accommodate higher overall RPC
>>> workloads (up
>>> > to 4x by some estimates) by adding multiple ObserverNodes.
>>> >
>>> > The main functionality has been implemented see sub-tasks of
>>> HDFS-12943.
>>> > We followed up with the test plan. Testing was done on two independent
>>> > clusters (see HDFS-14058 and HDFS-14059) with security enabled.
>>> > We ran standard HDFS commands, MR jobs, admin commands including manual
>>> > failover.
>>> > We know of one cluster running this feature in production.
>>> >
>>> > There are a few outstanding issues:
>>> > 1. Need to provide proper documentation - a user guide for the new
>>> feature
>>> > 2. Need to fix automatic failover with ZKFC. Currently it does not
>>> doesn't
>>> > know about ObserverNodes trying to convert them to SBNs.
>>> > 3. Scale testing and performance fine-tuning
>>> > 4. As testing progresses, we continue fixing non-critical bugs like
>>> > HDFS-14116.
>>> >
>>> > I attached a unified patch to the umbrella jira for the review and
>>> Jenkins
>>> > build.
>>> > Please vote on this thread. The vote will run for 7 days until Wed Dec
>>> 12.
>>> >
>>> > Thanks,
>>> > --Konstantin
>>> >
>>>
>>
>>
>> --
>>
>> Daryn
>>
>

Re: [Result] [VOTE] Merge HDFS-12943 branch to trunk - Consistent Reads from Standby

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

Agree, it isn't productive this way.

I can't seem to find it, but was there a DISCUSS thread for this branch-merge? I usually recommend addressing issues on a DISCUSS thread instead of fighting things over a VOTE.

+Vinod

> On Dec 13, 2018, at 10:09 AM, Konstantin Shvachko <sh...@gmail.com> wrote:
> 
> This vote failed due to Daryn Sharp's veto.
> The concern is being addressed by HDFS-13873. I will start a new vote once
> this is committed.
> 
> Note for Daryn. Your non-responsive handling of the veto makes a bad
> precedence and is a bad example of communication on the lists from a
> respected member of this community. Please check your availability for
> followup discussions if you choose to get involved with important decisions.
> 
> On Fri, Dec 7, 2018 at 4:10 PM Konstantin Shvachko <sh...@gmail.com>
> wrote:
> 
>> Hi Daryn,
>> 
>> Wanted to backup Chen's earlier response to your concerns about rotating
>> calls in the call queue.
>> Our design
>> 1. targets directly the livelock problem by rejecting calls on the
>> Observer that are not likely to be responded in timely matter: HDFS-13873.
>> 2. The call queue rotation is only done on Observers, and never on the
>> active NN, so it stays free of attacks like you suggest.
>> 
>> If this is a satisfactory mitigation for the problem could you please
>> reconsider your -1, so that people could continue voting on this thread.
>> 
>> Thanks,
>> --Konst
>> 
>> On Thu, Dec 6, 2018 at 10:38 AM Daryn Sharp <da...@oath.com> wrote:
>> 
>>> -1 pending additional info.  After a cursory scan, I have serious
>>> concerns regarding the design.  This seems like a feature that should have
>>> been purely implemented in hdfs w/o touching the common IPC layer.
>>> 
>>> The biggest issue in the alignment context.  It's purpose appears to be
>>> for allowing handlers to reinsert calls back into the call queue.  That's
>>> completely unacceptable.  A buggy or malicious client can easily cause
>>> livelock in the IPC layer with handlers only looping on calls that never
>>> satisfy the condition.  Why is this not implemented via RetriableExceptions?
>>> 
>>> On Thu, Dec 6, 2018 at 1:24 AM Yongjun Zhang <yz...@cloudera.com.invalid>
>>> wrote:
>>> 
>>>> Great work guys.
>>>> 
>>>> Wonder if we can elaborate what's impact of not having #2 fixed, and why
>>>> #2
>>>> is not needed for the feature to complete?
>>>> 2. Need to fix automatic failover with ZKFC. Currently it does not
>>>> doesn't
>>>> know about ObserverNodes trying to convert them to SBNs.
>>>> 
>>>> Thanks.
>>>> --Yongjun
>>>> 
>>>> 
>>>> On Wed, Dec 5, 2018 at 5:27 PM Konstantin Shvachko <shv.hadoop@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Hadoop developers,
>>>>> 
>>>>> I would like to propose to merge to trunk the feature branch
>>>> HDFS-12943 for
>>>>> Consistent Reads from Standby Node. The feature is intended to scale
>>>> read
>>>>> RPC workloads. On large clusters reads comprise 95% of all RPCs to the
>>>>> NameNode. We should be able to accommodate higher overall RPC
>>>> workloads (up
>>>>> to 4x by some estimates) by adding multiple ObserverNodes.
>>>>> 
>>>>> The main functionality has been implemented see sub-tasks of
>>>> HDFS-12943.
>>>>> We followed up with the test plan. Testing was done on two independent
>>>>> clusters (see HDFS-14058 and HDFS-14059) with security enabled.
>>>>> We ran standard HDFS commands, MR jobs, admin commands including manual
>>>>> failover.
>>>>> We know of one cluster running this feature in production.
>>>>> 
>>>>> There are a few outstanding issues:
>>>>> 1. Need to provide proper documentation - a user guide for the new
>>>> feature
>>>>> 2. Need to fix automatic failover with ZKFC. Currently it does not
>>>> doesn't
>>>>> know about ObserverNodes trying to convert them to SBNs.
>>>>> 3. Scale testing and performance fine-tuning
>>>>> 4. As testing progresses, we continue fixing non-critical bugs like
>>>>> HDFS-14116.
>>>>> 
>>>>> I attached a unified patch to the umbrella jira for the review and
>>>> Jenkins
>>>>> build.
>>>>> Please vote on this thread. The vote will run for 7 days until Wed Dec
>>>> 12.
>>>>> 
>>>>> Thanks,
>>>>> --Konstantin
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> Daryn
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org

Re: [Result] [VOTE] Merge HDFS-12943 branch to trunk - Consistent Reads from Standby

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

Agree, it isn't productive this way.

I can't seem to find it, but was there a DISCUSS thread for this branch-merge? I usually recommend addressing issues on a DISCUSS thread instead of fighting things over a VOTE.

+Vinod

> On Dec 13, 2018, at 10:09 AM, Konstantin Shvachko <sh...@gmail.com> wrote:
> 
> This vote failed due to Daryn Sharp's veto.
> The concern is being addressed by HDFS-13873. I will start a new vote once
> this is committed.
> 
> Note for Daryn. Your non-responsive handling of the veto makes a bad
> precedence and is a bad example of communication on the lists from a
> respected member of this community. Please check your availability for
> followup discussions if you choose to get involved with important decisions.
> 
> On Fri, Dec 7, 2018 at 4:10 PM Konstantin Shvachko <sh...@gmail.com>
> wrote:
> 
>> Hi Daryn,
>> 
>> Wanted to backup Chen's earlier response to your concerns about rotating
>> calls in the call queue.
>> Our design
>> 1. targets directly the livelock problem by rejecting calls on the
>> Observer that are not likely to be responded in timely matter: HDFS-13873.
>> 2. The call queue rotation is only done on Observers, and never on the
>> active NN, so it stays free of attacks like you suggest.
>> 
>> If this is a satisfactory mitigation for the problem could you please
>> reconsider your -1, so that people could continue voting on this thread.
>> 
>> Thanks,
>> --Konst
>> 
>> On Thu, Dec 6, 2018 at 10:38 AM Daryn Sharp <da...@oath.com> wrote:
>> 
>>> -1 pending additional info.  After a cursory scan, I have serious
>>> concerns regarding the design.  This seems like a feature that should have
>>> been purely implemented in hdfs w/o touching the common IPC layer.
>>> 
>>> The biggest issue in the alignment context.  It's purpose appears to be
>>> for allowing handlers to reinsert calls back into the call queue.  That's
>>> completely unacceptable.  A buggy or malicious client can easily cause
>>> livelock in the IPC layer with handlers only looping on calls that never
>>> satisfy the condition.  Why is this not implemented via RetriableExceptions?
>>> 
>>> On Thu, Dec 6, 2018 at 1:24 AM Yongjun Zhang <yz...@cloudera.com.invalid>
>>> wrote:
>>> 
>>>> Great work guys.
>>>> 
>>>> Wonder if we can elaborate what's impact of not having #2 fixed, and why
>>>> #2
>>>> is not needed for the feature to complete?
>>>> 2. Need to fix automatic failover with ZKFC. Currently it does not
>>>> doesn't
>>>> know about ObserverNodes trying to convert them to SBNs.
>>>> 
>>>> Thanks.
>>>> --Yongjun
>>>> 
>>>> 
>>>> On Wed, Dec 5, 2018 at 5:27 PM Konstantin Shvachko <shv.hadoop@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Hadoop developers,
>>>>> 
>>>>> I would like to propose to merge to trunk the feature branch
>>>> HDFS-12943 for
>>>>> Consistent Reads from Standby Node. The feature is intended to scale
>>>> read
>>>>> RPC workloads. On large clusters reads comprise 95% of all RPCs to the
>>>>> NameNode. We should be able to accommodate higher overall RPC
>>>> workloads (up
>>>>> to 4x by some estimates) by adding multiple ObserverNodes.
>>>>> 
>>>>> The main functionality has been implemented see sub-tasks of
>>>> HDFS-12943.
>>>>> We followed up with the test plan. Testing was done on two independent
>>>>> clusters (see HDFS-14058 and HDFS-14059) with security enabled.
>>>>> We ran standard HDFS commands, MR jobs, admin commands including manual
>>>>> failover.
>>>>> We know of one cluster running this feature in production.
>>>>> 
>>>>> There are a few outstanding issues:
>>>>> 1. Need to provide proper documentation - a user guide for the new
>>>> feature
>>>>> 2. Need to fix automatic failover with ZKFC. Currently it does not
>>>> doesn't
>>>>> know about ObserverNodes trying to convert them to SBNs.
>>>>> 3. Scale testing and performance fine-tuning
>>>>> 4. As testing progresses, we continue fixing non-critical bugs like
>>>>> HDFS-14116.
>>>>> 
>>>>> I attached a unified patch to the umbrella jira for the review and
>>>> Jenkins
>>>>> build.
>>>>> Please vote on this thread. The vote will run for 7 days until Wed Dec
>>>> 12.
>>>>> 
>>>>> Thanks,
>>>>> --Konstantin
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> Daryn
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org

Re: [Result] [VOTE] Merge HDFS-12943 branch to trunk - Consistent Reads from Standby

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

Agree, it isn't productive this way.

I can't seem to find it, but was there a DISCUSS thread for this branch-merge? I usually recommend addressing issues on a DISCUSS thread instead of fighting things over a VOTE.

+Vinod

> On Dec 13, 2018, at 10:09 AM, Konstantin Shvachko <sh...@gmail.com> wrote:
> 
> This vote failed due to Daryn Sharp's veto.
> The concern is being addressed by HDFS-13873. I will start a new vote once
> this is committed.
> 
> Note for Daryn. Your non-responsive handling of the veto makes a bad
> precedence and is a bad example of communication on the lists from a
> respected member of this community. Please check your availability for
> followup discussions if you choose to get involved with important decisions.
> 
> On Fri, Dec 7, 2018 at 4:10 PM Konstantin Shvachko <sh...@gmail.com>
> wrote:
> 
>> Hi Daryn,
>> 
>> Wanted to backup Chen's earlier response to your concerns about rotating
>> calls in the call queue.
>> Our design
>> 1. targets directly the livelock problem by rejecting calls on the
>> Observer that are not likely to be responded in timely matter: HDFS-13873.
>> 2. The call queue rotation is only done on Observers, and never on the
>> active NN, so it stays free of attacks like you suggest.
>> 
>> If this is a satisfactory mitigation for the problem could you please
>> reconsider your -1, so that people could continue voting on this thread.
>> 
>> Thanks,
>> --Konst
>> 
>> On Thu, Dec 6, 2018 at 10:38 AM Daryn Sharp <da...@oath.com> wrote:
>> 
>>> -1 pending additional info.  After a cursory scan, I have serious
>>> concerns regarding the design.  This seems like a feature that should have
>>> been purely implemented in hdfs w/o touching the common IPC layer.
>>> 
>>> The biggest issue in the alignment context.  It's purpose appears to be
>>> for allowing handlers to reinsert calls back into the call queue.  That's
>>> completely unacceptable.  A buggy or malicious client can easily cause
>>> livelock in the IPC layer with handlers only looping on calls that never
>>> satisfy the condition.  Why is this not implemented via RetriableExceptions?
>>> 
>>> On Thu, Dec 6, 2018 at 1:24 AM Yongjun Zhang <yz...@cloudera.com.invalid>
>>> wrote:
>>> 
>>>> Great work guys.
>>>> 
>>>> Wonder if we can elaborate what's impact of not having #2 fixed, and why
>>>> #2
>>>> is not needed for the feature to complete?
>>>> 2. Need to fix automatic failover with ZKFC. Currently it does not
>>>> doesn't
>>>> know about ObserverNodes trying to convert them to SBNs.
>>>> 
>>>> Thanks.
>>>> --Yongjun
>>>> 
>>>> 
>>>> On Wed, Dec 5, 2018 at 5:27 PM Konstantin Shvachko <shv.hadoop@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Hadoop developers,
>>>>> 
>>>>> I would like to propose to merge to trunk the feature branch
>>>> HDFS-12943 for
>>>>> Consistent Reads from Standby Node. The feature is intended to scale
>>>> read
>>>>> RPC workloads. On large clusters reads comprise 95% of all RPCs to the
>>>>> NameNode. We should be able to accommodate higher overall RPC
>>>> workloads (up
>>>>> to 4x by some estimates) by adding multiple ObserverNodes.
>>>>> 
>>>>> The main functionality has been implemented see sub-tasks of
>>>> HDFS-12943.
>>>>> We followed up with the test plan. Testing was done on two independent
>>>>> clusters (see HDFS-14058 and HDFS-14059) with security enabled.
>>>>> We ran standard HDFS commands, MR jobs, admin commands including manual
>>>>> failover.
>>>>> We know of one cluster running this feature in production.
>>>>> 
>>>>> There are a few outstanding issues:
>>>>> 1. Need to provide proper documentation - a user guide for the new
>>>> feature
>>>>> 2. Need to fix automatic failover with ZKFC. Currently it does not
>>>> doesn't
>>>>> know about ObserverNodes trying to convert them to SBNs.
>>>>> 3. Scale testing and performance fine-tuning
>>>>> 4. As testing progresses, we continue fixing non-critical bugs like
>>>>> HDFS-14116.
>>>>> 
>>>>> I attached a unified patch to the umbrella jira for the review and
>>>> Jenkins
>>>>> build.
>>>>> Please vote on this thread. The vote will run for 7 days until Wed Dec
>>>> 12.
>>>>> 
>>>>> Thanks,
>>>>> --Konstantin
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> Daryn
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org

Re: [Result] [VOTE] Merge HDFS-12943 branch to trunk - Consistent Reads from Standby

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

Agree, it isn't productive this way.

I can't seem to find it, but was there a DISCUSS thread for this branch-merge? I usually recommend addressing issues on a DISCUSS thread instead of fighting things over a VOTE.

+Vinod

> On Dec 13, 2018, at 10:09 AM, Konstantin Shvachko <sh...@gmail.com> wrote:
> 
> This vote failed due to Daryn Sharp's veto.
> The concern is being addressed by HDFS-13873. I will start a new vote once
> this is committed.
> 
> Note for Daryn. Your non-responsive handling of the veto makes a bad
> precedence and is a bad example of communication on the lists from a
> respected member of this community. Please check your availability for
> followup discussions if you choose to get involved with important decisions.
> 
> On Fri, Dec 7, 2018 at 4:10 PM Konstantin Shvachko <sh...@gmail.com>
> wrote:
> 
>> Hi Daryn,
>> 
>> Wanted to backup Chen's earlier response to your concerns about rotating
>> calls in the call queue.
>> Our design
>> 1. targets directly the livelock problem by rejecting calls on the
>> Observer that are not likely to be responded in timely matter: HDFS-13873.
>> 2. The call queue rotation is only done on Observers, and never on the
>> active NN, so it stays free of attacks like you suggest.
>> 
>> If this is a satisfactory mitigation for the problem could you please
>> reconsider your -1, so that people could continue voting on this thread.
>> 
>> Thanks,
>> --Konst
>> 
>> On Thu, Dec 6, 2018 at 10:38 AM Daryn Sharp <da...@oath.com> wrote:
>> 
>>> -1 pending additional info.  After a cursory scan, I have serious
>>> concerns regarding the design.  This seems like a feature that should have
>>> been purely implemented in hdfs w/o touching the common IPC layer.
>>> 
>>> The biggest issue in the alignment context.  It's purpose appears to be
>>> for allowing handlers to reinsert calls back into the call queue.  That's
>>> completely unacceptable.  A buggy or malicious client can easily cause
>>> livelock in the IPC layer with handlers only looping on calls that never
>>> satisfy the condition.  Why is this not implemented via RetriableExceptions?
>>> 
>>> On Thu, Dec 6, 2018 at 1:24 AM Yongjun Zhang <yz...@cloudera.com.invalid>
>>> wrote:
>>> 
>>>> Great work guys.
>>>> 
>>>> Wonder if we can elaborate what's impact of not having #2 fixed, and why
>>>> #2
>>>> is not needed for the feature to complete?
>>>> 2. Need to fix automatic failover with ZKFC. Currently it does not
>>>> doesn't
>>>> know about ObserverNodes trying to convert them to SBNs.
>>>> 
>>>> Thanks.
>>>> --Yongjun
>>>> 
>>>> 
>>>> On Wed, Dec 5, 2018 at 5:27 PM Konstantin Shvachko <shv.hadoop@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Hadoop developers,
>>>>> 
>>>>> I would like to propose to merge to trunk the feature branch
>>>> HDFS-12943 for
>>>>> Consistent Reads from Standby Node. The feature is intended to scale
>>>> read
>>>>> RPC workloads. On large clusters reads comprise 95% of all RPCs to the
>>>>> NameNode. We should be able to accommodate higher overall RPC
>>>> workloads (up
>>>>> to 4x by some estimates) by adding multiple ObserverNodes.
>>>>> 
>>>>> The main functionality has been implemented see sub-tasks of
>>>> HDFS-12943.
>>>>> We followed up with the test plan. Testing was done on two independent
>>>>> clusters (see HDFS-14058 and HDFS-14059) with security enabled.
>>>>> We ran standard HDFS commands, MR jobs, admin commands including manual
>>>>> failover.
>>>>> We know of one cluster running this feature in production.
>>>>> 
>>>>> There are a few outstanding issues:
>>>>> 1. Need to provide proper documentation - a user guide for the new
>>>> feature
>>>>> 2. Need to fix automatic failover with ZKFC. Currently it does not
>>>> doesn't
>>>>> know about ObserverNodes trying to convert them to SBNs.
>>>>> 3. Scale testing and performance fine-tuning
>>>>> 4. As testing progresses, we continue fixing non-critical bugs like
>>>>> HDFS-14116.
>>>>> 
>>>>> I attached a unified patch to the umbrella jira for the review and
>>>> Jenkins
>>>>> build.
>>>>> Please vote on this thread. The vote will run for 7 days until Wed Dec
>>>> 12.
>>>>> 
>>>>> Thanks,
>>>>> --Konstantin
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> Daryn
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org