You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by 王 杰 <ja...@Outlook.com> on 2021/01/25 02:32:32 UTC

Re: [DISCUSS] Support obtaining Hive delegation tokens when submitting application to Yarn

Hi Till,

Sorry for late response, I just did some investigations about Spark. Spark adopted the SPI way to obtain delegations for different components. It has a HadoopDelegationTokenManager.scala<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala> to manage all Hadoop delegation tokens including obtaining and renewing the delegation tokens.

When the HadoopDelegationTokenManager is initializing, it will use ServiceLoader to load all HadoopDelegationTokenProviders in different connectors. As for Hive, the provider implementation is HadoopDelegationTokenProvider<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala>.

Thanks,
Jie


On 2021/01/13 08:51:29, Till Rohrmann <tr...@apache.org>> wrote:
> Hi Jie Wang,
>
> thanks for starting this discussion. To me the SPI approach sounds better
> because it is not as brittle as using reflection. Concerning the
> configuration, we could think about introducing some Hive specific
> configuration options which allow us to specify these paths. How are other
> projects which integrate with Hive are solving this problem?
>
> Cheers,
> Till
>
> On Tue, Jan 12, 2021 at 4:13 PM 王 杰 <ja...@outlook.com>> wrote:
>
> > Hi everyone,
> >
> > Currently, Hive delegation token is not obtained when Flink submits the
> > application in Yarn mode using kinit way. The ticket is
> > https://issues.apache.org/jira/browse/FLINK-20714. I'd like to start a
> > discussion about how to support this feature.
> >
> > Maybe we have two options:
> > 1. Using a reflection way to construct a Hive client to obtain the token,
> > just same as the org.apache.flink.yarn.Utils.obtainTokenForHBase
> > implementation.
> > 2. Introduce a pluggable delegation provider via SPI. Delegation provider
> > could be placed in connector related code, so reflection is not needed and
> > is more extendable.
> >
> >
> >
> > Both options have to handle how to specify the HiveConf to use. In Hive
> > connector, user could specify both hiveConfDir and hadoopConfDir when
> > creating HiveCatalog. The hadoopConfDir may not the same as the Hadoop
> > configuration in HadoopModule.
> >
> > Looking forward to your suggestions.
> >
> > --
> > Best regards!
> > Jie Wang
> >
> >
>

Re: [DISCUSS] Support obtaining Hive delegation tokens when submitting application to Yarn

Posted by Jack W <ja...@gmail.com>.
Hi Rui,

I agree with you that we can implement puggable DT providers firstly, I have created a new ticket to track it: https://issues.apache.org/jira/browse/FLINK-21232. 

Spark’s HadoopDelegationTokenManager could run on both client and driver(Application master) sides. On the client side, HadoopDelegationTokenManager is used to obtain tokens when users use keytab or `kinit`(credential cache);  on the driver side, it is used to obtain and renew DTs. To explain this, there are some backgrounds. Currently, Flink will distribute keytab to JobManager and TaskManagers, the kerberos credentials are renewed by the keytab on JobManager and TaskManagers. However, Spark adopts a different way solution, it only ships the keytab to Driver and Driver will use this keytab to renew all delegation tokens periodically and then distribute the renewed tokens to Executors. In this way, Spark can reduce the load on KDC. You could refer this doc for details: https://docs.google.com/document/d/10V7LiNlUJKeKZ58mkR7oVv1t6BrC6TZi3FGf2Dm6-i8/edit

Thanks,
Jie

On 2021/01/27 03:33:37, Rui Li <li...@gmail.com> wrote: 
> Hi Jie,
> 
> Thanks for the investigation. I think we can first implement pluggable DT
> providers, and add renewal abilities incrementally. I'm also curious where
> Spark runs its HadoopDelegationTokenManager when renewal is enabled?
> Because it seems HadoopDelegationTokenManager needs access to keytab to
> create new tokens, does that mean it can only run on the client side?
> 
> On Mon, Jan 25, 2021 at 10:32 AM 王 杰 <ja...@outlook.com> wrote:
> 
> > Hi Till,
> >
> > Sorry for late response, I just did some investigations about Spark. Spark
> > adopted the SPI way to obtain delegations for different components. It has
> > a HadoopDelegationTokenManager.scala<
> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala>
> > to manage all Hadoop delegation tokens including obtaining and renewing the
> > delegation tokens.
> >
> > When the HadoopDelegationTokenManager is initializing, it will use
> > ServiceLoader to load all HadoopDelegationTokenProviders in different
> > connectors. As for Hive, the provider implementation is
> > HadoopDelegationTokenProvider<
> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala
> > >.
> >
> > Thanks,
> > Jie
> >
> >
> > On 2021/01/13 08:51:29, Till Rohrmann <trohrmann@apache.org<mailto:
> > trohrmann@apache.org>> wrote:
> > > Hi Jie Wang,
> > >
> > > thanks for starting this discussion. To me the SPI approach sounds better
> > > because it is not as brittle as using reflection. Concerning the
> > > configuration, we could think about introducing some Hive specific
> > > configuration options which allow us to specify these paths. How are
> > other
> > > projects which integrate with Hive are solving this problem?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Jan 12, 2021 at 4:13 PM 王 杰 <jackwangcs@outlook.com<mailto:
> > jackwangcs@outlook.com>> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Currently, Hive delegation token is not obtained when Flink submits the
> > > > application in Yarn mode using kinit way. The ticket is
> > > > https://issues.apache.org/jira/browse/FLINK-20714. I'd like to start a
> > > > discussion about how to support this feature.
> > > >
> > > > Maybe we have two options:
> > > > 1. Using a reflection way to construct a Hive client to obtain the
> > token,
> > > > just same as the org.apache.flink.yarn.Utils.obtainTokenForHBase
> > > > implementation.
> > > > 2. Introduce a pluggable delegation provider via SPI. Delegation
> > provider
> > > > could be placed in connector related code, so reflection is not needed
> > and
> > > > is more extendable.
> > > >
> > > >
> > > >
> > > > Both options have to handle how to specify the HiveConf to use. In Hive
> > > > connector, user could specify both hiveConfDir and hadoopConfDir when
> > > > creating HiveCatalog. The hadoopConfDir may not the same as the Hadoop
> > > > configuration in HadoopModule.
> > > >
> > > > Looking forward to your suggestions.
> > > >
> > > > --
> > > > Best regards!
> > > > Jie Wang
> > > >
> > > >
> > >
> >
> 
> 
> -- 
> Best regards!
> Rui Li
> 

Re: [DISCUSS] Support obtaining Hive delegation tokens when submitting application to Yarn

Posted by Rui Li <li...@gmail.com>.
Hi Jie,

Thanks for the investigation. I think we can first implement pluggable DT
providers, and add renewal abilities incrementally. I'm also curious where
Spark runs its HadoopDelegationTokenManager when renewal is enabled?
Because it seems HadoopDelegationTokenManager needs access to keytab to
create new tokens, does that mean it can only run on the client side?

On Mon, Jan 25, 2021 at 10:32 AM 王 杰 <ja...@outlook.com> wrote:

> Hi Till,
>
> Sorry for late response, I just did some investigations about Spark. Spark
> adopted the SPI way to obtain delegations for different components. It has
> a HadoopDelegationTokenManager.scala<
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala>
> to manage all Hadoop delegation tokens including obtaining and renewing the
> delegation tokens.
>
> When the HadoopDelegationTokenManager is initializing, it will use
> ServiceLoader to load all HadoopDelegationTokenProviders in different
> connectors. As for Hive, the provider implementation is
> HadoopDelegationTokenProvider<
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala
> >.
>
> Thanks,
> Jie
>
>
> On 2021/01/13 08:51:29, Till Rohrmann <trohrmann@apache.org<mailto:
> trohrmann@apache.org>> wrote:
> > Hi Jie Wang,
> >
> > thanks for starting this discussion. To me the SPI approach sounds better
> > because it is not as brittle as using reflection. Concerning the
> > configuration, we could think about introducing some Hive specific
> > configuration options which allow us to specify these paths. How are
> other
> > projects which integrate with Hive are solving this problem?
> >
> > Cheers,
> > Till
> >
> > On Tue, Jan 12, 2021 at 4:13 PM 王 杰 <jackwangcs@outlook.com<mailto:
> jackwangcs@outlook.com>> wrote:
> >
> > > Hi everyone,
> > >
> > > Currently, Hive delegation token is not obtained when Flink submits the
> > > application in Yarn mode using kinit way. The ticket is
> > > https://issues.apache.org/jira/browse/FLINK-20714. I'd like to start a
> > > discussion about how to support this feature.
> > >
> > > Maybe we have two options:
> > > 1. Using a reflection way to construct a Hive client to obtain the
> token,
> > > just same as the org.apache.flink.yarn.Utils.obtainTokenForHBase
> > > implementation.
> > > 2. Introduce a pluggable delegation provider via SPI. Delegation
> provider
> > > could be placed in connector related code, so reflection is not needed
> and
> > > is more extendable.
> > >
> > >
> > >
> > > Both options have to handle how to specify the HiveConf to use. In Hive
> > > connector, user could specify both hiveConfDir and hadoopConfDir when
> > > creating HiveCatalog. The hadoopConfDir may not the same as the Hadoop
> > > configuration in HadoopModule.
> > >
> > > Looking forward to your suggestions.
> > >
> > > --
> > > Best regards!
> > > Jie Wang
> > >
> > >
> >
>


-- 
Best regards!
Rui Li