You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Robert Metzger <rm...@apache.org> on 2017/04/12 15:35:30 UTC

Re: Flink job on secure Yarn fails after many hours

Niels, are you still facing this issue?

As far as I understood it, the security changes in Flink 1.2.0 use a new
Kerberos mechanism that allows infinite token renewal.

On Thu, Mar 17, 2016 at 7:30 AM, Maximilian Michels <mx...@apache.org> wrote:

> Hi Niels,
>
> Thanks for the feedback. As far as I know, Hadoop deliberately
> defaults to the one week maximum life time of delegation tokens. Have
> you tried increasing the maximum token life time or was that not an
> option?
>
> I wonder why do you use a while loop? Would it be possible to use the
> Yarn failover mechanism which starts a new ApplicationMaster and
> resubmits the job?
>
> Thanks,
> Max
>
>
> On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> > Hi,
> >
> > In my environment doing the "proxy" thing didn't work.
> > With an token expire of 168 hours (1 week) the job consistently
> terminates
> > at exactly (within a margin of 10 seconds) 173.5 hours.
> > So far we have not been able to solve this problem.
> >
> > Our teams now simply assume the thing fails once in a while and have an
> > automatic restart feature (i.e. shell script with a while true loop).
> > The best guess at a root cause is this
> > https://issues.apache.org/jira/browse/HDFS-9276
> >
> > If you have a real solution or a reference to a related bug report to
> this
> > problem then please share!
> >
> > Niels Basjes
> >
> >
> >
> > On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
> > <th...@ericsson.com> wrote:
> >>
> >> Hi Max,
> >>
> >> I will try these workaround.
> >> Thanks
> >>
> >> Thomas
> >>
> >> ________________________________________
> >> De : Maximilian Michels [mxm@apache.org]
> >> Envoyé : mardi 15 mars 2016 16:51
> >> À : user@flink.apache.org
> >> Cc : Niels Basjes
> >> Objet : Re: Flink job on secure Yarn fails after many hours
> >>
> >> Hi Thomas,
> >>
> >> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
> >> to properly run Kerberos applications on Hadoop clusters. Versions
> >> before that have critical bugs related to the internal security token
> >> handling that may expire the token although it is still valid.
> >>
> >> That said, there is another limitation of Hadoop that the maximum
> >> internal token life time is one week. To work around this limit, you
> >> have two options:
> >>
> >> a) increasing the maximum token life time
> >>
> >> In yarn-site.xml:
> >>
> >> <property>
> >>   <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
> >>   <value>9223372036854775807</value>
> >> </property>
> >>
> >> In hdfs-site.xml
> >>
> >> <property>
> >>   <name>dfs.namenode.delegation.token.max-lifetime</name>
> >>   <value>9223372036854775807</value>
> >> </property>
> >>
> >>
> >> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
> >>
> >> From
> >> http://www.cloudera.com/documentation/enterprise/5-3-
> x/topics/cm_sg_yarn_long_jobs.html
> >>
> >> "You can work around this by configuring the ResourceManager as a
> >> proxy user for the corresponding HDFS NameNode so that the
> >> ResourceManager can request new tokens when the existing ones are past
> >> their maximum lifetime."
> >>
> >> @Nils: Could you comment on what worked best for you?
> >>
> >> Best,
> >> Max
> >>
> >>
> >> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
> >> <th...@ericsson.com> wrote:
> >> >
> >> > Hello everyone,
> >> >
> >> >
> >> >
> >> > We are facing the same probleme now in our Flink applications, launch
> >> > using YARN.
> >> >
> >> > Just want to know if there is any update about this exception ?
> >> >
> >> >
> >> >
> >> > Thanks
> >> >
> >> >
> >> >
> >> > Thomas
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >
> >> > De : niels@basj.es [niels@basj.es] de la part de Niels Basjes
> >> > [Niels@basjes.nl]
> >> > Envoyé : vendredi 4 décembre 2015 10:40
> >> > À : user@flink.apache.org
> >> > Objet : Re: Flink job on secure Yarn fails after many hours
> >> >
> >> > Hi Maximilian,
> >> >
> >> > I just downloaded the version from your google drive and used that to
> >> > run my test topology that accesses HBase.
> >> > I deliberately started it twice to double the chance to run into this
> >> > situation.
> >> >
> >> > I'll keep you posted.
> >> >
> >> > Niels
> >> >
> >> >
> >> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org>
> >> > wrote:
> >> >>
> >> >> Hi Niels,
> >> >>
> >> >> Just got back from our CI. The build above would fail with a
> >> >> Checkstyle error. I corrected that. Also I have built the binaries
> for
> >> >> your Hadoop version 2.6.0.
> >> >>
> >> >> Binaries:
> >> >>
> >> >>
> >> >> https://github.com/mxm/flink/archive/kerberos-yarn-
> heartbeat-fail-0.10.1.zip
> >> >>
> >> >> Thanks,
> >> >> Max
> >> >>
> >> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
> >> >> >>>> >> >> > 21:30:28,185 ERROR
> >> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
> >> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912
> terminated,
> >> >> >>>> >> >> > stopping
> >> >> >>>> >> >> > process...
> >> >> >>>> >> >> > 21:30:28,286 INFO
> >> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> >> >> >>>> >> >> > - Removing web root dir
> >> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
> >> >> >>>> >> >> >
> >> >> >>>> >> >> >
> >> >> >>>> >> >> > --
> >> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
> >> >> >>>> >> >> >
> >> >> >>>> >> >> > Niels Basjes
> >> >> >>>> >> >
> >> >> >>>> >> >
> >> >> >>>> >> >
> >> >> >>>> >> >
> >> >> >>>> >> > --
> >> >> >>>> >> > Best regards / Met vriendelijke groeten,
> >> >> >>>> >> >
> >> >> >>>> >> > Niels Basjes
> >> >> >>>> >
> >> >> >>>> >
> >> >> >>>> >
> >> >> >>>> >
> >> >> >>>> > --
> >> >> >>>> > Best regards / Met vriendelijke groeten,
> >> >> >>>> >
> >> >> >>>> > Niels Basjes
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> Best regards / Met vriendelijke groeten,
> >> >> >>>
> >> >> >>> Niels Basjes
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best regards / Met vriendelijke groeten,
> >> >
> >> > Niels Basjes
> >
> >
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>

Re: Flink job on secure Yarn fails after many hours

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

No, this issue is now gone for us.
The fixed in 1.2.0 ensured that we are now able to run jobs on our cluster
beyond the 7 days limit.

Niels

On Wed, Apr 12, 2017 at 5:35 PM, Robert Metzger <rm...@apache.org> wrote:

> Niels, are you still facing this issue?
>
> As far as I understood it, the security changes in Flink 1.2.0 use a new
> Kerberos mechanism that allows infinite token renewal.
>
> On Thu, Mar 17, 2016 at 7:30 AM, Maximilian Michels <mx...@apache.org>
> wrote:
>
>> Hi Niels,
>>
>> Thanks for the feedback. As far as I know, Hadoop deliberately
>> defaults to the one week maximum life time of delegation tokens. Have
>> you tried increasing the maximum token life time or was that not an
>> option?
>>
>> I wonder why do you use a while loop? Would it be possible to use the
>> Yarn failover mechanism which starts a new ApplicationMaster and
>> resubmits the job?
>>
>> Thanks,
>> Max
>>
>>
>> On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> > Hi,
>> >
>> > In my environment doing the "proxy" thing didn't work.
>> > With an token expire of 168 hours (1 week) the job consistently
>> terminates
>> > at exactly (within a margin of 10 seconds) 173.5 hours.
>> > So far we have not been able to solve this problem.
>> >
>> > Our teams now simply assume the thing fails once in a while and have an
>> > automatic restart feature (i.e. shell script with a while true loop).
>> > The best guess at a root cause is this
>> > https://issues.apache.org/jira/browse/HDFS-9276
>> >
>> > If you have a real solution or a reference to a related bug report to
>> this
>> > problem then please share!
>> >
>> > Niels Basjes
>> >
>> >
>> >
>> > On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
>> > <th...@ericsson.com> wrote:
>> >>
>> >> Hi Max,
>> >>
>> >> I will try these workaround.
>> >> Thanks
>> >>
>> >> Thomas
>> >>
>> >> ________________________________________
>> >> De : Maximilian Michels [mxm@apache.org]
>> >> Envoyé : mardi 15 mars 2016 16:51
>> >> À : user@flink.apache.org
>> >> Cc : Niels Basjes
>> >> Objet : Re: Flink job on secure Yarn fails after many hours
>> >>
>> >> Hi Thomas,
>> >>
>> >> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
>> >> to properly run Kerberos applications on Hadoop clusters. Versions
>> >> before that have critical bugs related to the internal security token
>> >> handling that may expire the token although it is still valid.
>> >>
>> >> That said, there is another limitation of Hadoop that the maximum
>> >> internal token life time is one week. To work around this limit, you
>> >> have two options:
>> >>
>> >> a) increasing the maximum token life time
>> >>
>> >> In yarn-site.xml:
>> >>
>> >> <property>
>> >>   <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
>> >>   <value>9223372036854775807</value>
>> >> </property>
>> >>
>> >> In hdfs-site.xml
>> >>
>> >> <property>
>> >>   <name>dfs.namenode.delegation.token.max-lifetime</name>
>> >>   <value>9223372036854775807</value>
>> >> </property>
>> >>
>> >>
>> >> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
>> >>
>> >> From
>> >> http://www.cloudera.com/documentation/enterprise/5-3-x/
>> topics/cm_sg_yarn_long_jobs.html
>> >>
>> >> "You can work around this by configuring the ResourceManager as a
>> >> proxy user for the corresponding HDFS NameNode so that the
>> >> ResourceManager can request new tokens when the existing ones are past
>> >> their maximum lifetime."
>> >>
>> >> @Nils: Could you comment on what worked best for you?
>> >>
>> >> Best,
>> >> Max
>> >>
>> >>
>> >> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
>> >> <th...@ericsson.com> wrote:
>> >> >
>> >> > Hello everyone,
>> >> >
>> >> >
>> >> >
>> >> > We are facing the same probleme now in our Flink applications, launch
>> >> > using YARN.
>> >> >
>> >> > Just want to know if there is any update about this exception ?
>> >> >
>> >> >
>> >> >
>> >> > Thanks
>> >> >
>> >> >
>> >> >
>> >> > Thomas
>> >> >
>> >> >
>> >> >
>> >> > ________________________________
>> >> >
>> >> > De : niels@basj.es [niels@basj.es] de la part de Niels Basjes
>> >> > [Niels@basjes.nl]
>> >> > Envoyé : vendredi 4 décembre 2015 10:40
>> >> > À : user@flink.apache.org
>> >> > Objet : Re: Flink job on secure Yarn fails after many hours
>> >> >
>> >> > Hi Maximilian,
>> >> >
>> >> > I just downloaded the version from your google drive and used that to
>> >> > run my test topology that accesses HBase.
>> >> > I deliberately started it twice to double the chance to run into this
>> >> > situation.
>> >> >
>> >> > I'll keep you posted.
>> >> >
>> >> > Niels
>> >> >
>> >> >
>> >> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org>
>> >> > wrote:
>> >> >>
>> >> >> Hi Niels,
>> >> >>
>> >> >> Just got back from our CI. The build above would fail with a
>> >> >> Checkstyle error. I corrected that. Also I have built the binaries
>> for
>> >> >> your Hadoop version 2.6.0.
>> >> >>
>> >> >> Binaries:
>> >> >>
>> >> >>
>> >> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat
>> -fail-0.10.1.zip
>> >> >>
>> >> >> Thanks,
>> >> >> Max
>> >> >>
>> >> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >> >> >>>> >> >> > 21:30:28,185 ERROR
>> >> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
>> >> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912
>> terminated,
>> >> >> >>>> >> >> > stopping
>> >> >> >>>> >> >> > process...
>> >> >> >>>> >> >> > 21:30:28,286 INFO
>> >> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >> >>>> >> >> > - Removing web root dir
>> >> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >> >>>> >> >> >
>> >> >> >>>> >> >> >
>> >> >> >>>> >> >> > --
>> >> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >> >> >>>> >> >> >
>> >> >> >>>> >> >> > Niels Basjes
>> >> >> >>>> >> >
>> >> >> >>>> >> >
>> >> >> >>>> >> >
>> >> >> >>>> >> >
>> >> >> >>>> >> > --
>> >> >> >>>> >> > Best regards / Met vriendelijke groeten,
>> >> >> >>>> >> >
>> >> >> >>>> >> > Niels Basjes
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> > --
>> >> >> >>>> > Best regards / Met vriendelijke groeten,
>> >> >> >>>> >
>> >> >> >>>> > Niels Basjes
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> --
>> >> >> >>> Best regards / Met vriendelijke groeten,
>> >> >> >>>
>> >> >> >>> Niels Basjes
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best regards / Met vriendelijke groeten,
>> >> >
>> >> > Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>>
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes