You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by James Srinivasan <ja...@gmail.com> on 2017/05/11 19:14:30 UTC

Fwd: ClientConfiguration using Kerberos & MapReduce

Hi,

I am attempting to add Kerberos support to GeoMesa's [1] MapReduce
code, running on Accumulo. The current code calls:

ConfiguratorBase.setZooKeeperInstance(classOf[AccumuloInputFormat],
conf, instance, zookeepers)

where conf is the generic Hadoop configuration read from XML files,
and instance & zookeepers are the Accumulo instance name and Zookeeper
address(es) respectively.

This in turn calls:

org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.setZooKeeperInstance(implementingClass,
conf, new ClientConfiguration().withInstance(instanceName).withZkHosts(zooKeepers));

Which creates an Accumulo ClientConfiguration with the default
settings, specifying the Accumulo instance name and the Zookeeper
address(es). This is fine for non-Kerberized usage, but with Kerberos
I need to ensure withSasl is added to the ClientConfiguration,
otherwise it will try to connect without SASL and annoyingly hang.

ConfiguratorBase has no other overrides for setZookeeperInstance, so I
don't see how this would ever work with Kerberos. It is marked as
deprecated, which points me to AccumuloInputFormat, but I'm a little
confused as to how this API relates to ConfiguratorBase.

Any help is much appreciated,

James



[1] https://github.com/locationtech/geomesa

Re: ClientConfiguration using Kerberos & MapReduce

Posted by James Srinivasan <ja...@gmail.com>.

>> Brilliant - that's it! Now my custom InputFormat is working.
> If you're interested and have the ability to cut/paste your code, updating
> the Accumulo examples[1] with more of the nitty-gritty on using
> Kerberos+MapReduce would be great! You definitely have the hard-stuff done
> :)
>
> [1] https://accumulo.apache.org/1.8/examples/

I would...but it's Scala. The most interesting bit is here:

https://github.com/jrs53/geomesa/blob/17a3da56a041f5dde5d61d57fc3ca92dff0a1dc0/geomesa-accumulo/geomesa-accumulo-jobs/src/main/scala/org/locationtech/geomesa/jobs/mapreduce/GeoMesaAccumuloInputFormat.scala#L176-L215

Re: ClientConfiguration using Kerberos & MapReduce

Posted by Josh Elser <jo...@gmail.com>.

On 6/8/17 4:10 PM, James Srinivasan wrote:
> [snip]
>> https://github.com/apache/accumulo/blob/f81a8ec7410e789d11941351d5899b8894c6a322/core/src/main/java/org/apache/accumulo/core/client/mapreduce/lib/impl/ConfiguratorBase.java#L485-L500
>>
>> This pulls the "DelegationTokenStub" out of the InputFormat and creates a
>> real Accumulo AuthenticationToken (which you can use with a Connector
>> per-usual).
> 
> Brilliant - that's it! Now my custom InputFormat is working.
> 
> Thanks ever so much for your help - it is hugely appreciated!
> 
> James
> 

Fantastic. Glad to hear it!

If you're interested and have the ability to cut/paste your code, 
updating the Accumulo examples[1] with more of the nitty-gritty on using 
Kerberos+MapReduce would be great! You definitely have the hard-stuff 
done :)

[1] https://accumulo.apache.org/1.8/examples/

Re: ClientConfiguration using Kerberos & MapReduce

Posted by James Srinivasan <ja...@gmail.com>.

[snip]
> https://github.com/apache/accumulo/blob/f81a8ec7410e789d11941351d5899b8894c6a322/core/src/main/java/org/apache/accumulo/core/client/mapreduce/lib/impl/ConfiguratorBase.java#L485-L500
>
> This pulls the "DelegationTokenStub" out of the InputFormat and creates a
> real Accumulo AuthenticationToken (which you can use with a Connector
> per-usual).

Brilliant - that's it! Now my custom InputFormat is working.

Thanks ever so much for your help - it is hugely appreciated!

James

Re: ClientConfiguration using Kerberos & MapReduce

Posted by Josh Elser <jo...@gmail.com>.

On 6/7/17 3:54 PM, James Srinivasan wrote:
> [snip]
>>> Fortunately I found this:
>>>
>>> https://github.com/apache/hive/blob/master/accumulo-handler/src/java/org/apache/hadoop/hive/accumulo/mr/HiveAccumuloTableInputFormat.java
>>>
>>> Is it a good example of Accumulo + MapReduce that I can copy?
>> That one is definitely over-kill. There's a bit of reflection in there to
>> work around older versions of Accumulo. However, it should be an example of
>> something that does work with Kerberos authentication.
>> Also, take note that Hive uses the InputFormat regardless of the execution
>> engine (local, MapReduce, Tez, etc). There are some comments to that effect
>> in the code. You can likely simplify those methods/blocks as well :)
> 
> Think those are two things I'll need to handle at some point anyways.
> I think I'm setting all the AccumuloInputFormat statics correctly, and
> see the DelegationToken in my job's and context's credentials.
> However, my custom InputFormat's createRecordReader function needs to
> connect to Accumulo to get some config. Am I right in thinking I need
> to convert the Hadoop wrapped token (kind=ACCUMULO_AUTH_TOKEN) into an
> Accumulo DelegationToken to create my connector? If so, how do I do
> that?

Yes, you need to deserialize the AuthenticationToken from the 
InputSplit. You can look back into the AccumuloInputFormat 
implementation to see how this is done:

https://github.com/apache/accumulo/blob/f81a8ec7410e789d11941351d5899b8894c6a322/core/src/main/java/org/apache/accumulo/core/client/mapreduce/AbstractInputFormat.java#L515-L518

calls

https://github.com/apache/accumulo/blob/f81a8ec7410e789d11941351d5899b8894c6a322/core/src/main/java/org/apache/accumulo/core/client/mapreduce/AbstractInputFormat.java#L240-L243

calls

https://github.com/apache/accumulo/blob/f81a8ec7410e789d11941351d5899b8894c6a322/core/src/main/java/org/apache/accumulo/core/client/mapreduce/lib/impl/ConfiguratorBase.java#L485-L500

This pulls the "DelegationTokenStub" out of the InputFormat and creates 
a real Accumulo AuthenticationToken (which you can use with a Connector 
per-usual).

Re: ClientConfiguration using Kerberos & MapReduce

Posted by James Srinivasan <ja...@gmail.com>.

[snip]
>> Fortunately I found this:
>>
>> https://github.com/apache/hive/blob/master/accumulo-handler/src/java/org/apache/hadoop/hive/accumulo/mr/HiveAccumuloTableInputFormat.java
>>
>> Is it a good example of Accumulo + MapReduce that I can copy?
> That one is definitely over-kill. There's a bit of reflection in there to
> work around older versions of Accumulo. However, it should be an example of
> something that does work with Kerberos authentication.
> Also, take note that Hive uses the InputFormat regardless of the execution
> engine (local, MapReduce, Tez, etc). There are some comments to that effect
> in the code. You can likely simplify those methods/blocks as well :)

Think those are two things I'll need to handle at some point anyways.
I think I'm setting all the AccumuloInputFormat statics correctly, and
see the DelegationToken in my job's and context's credentials.
However, my custom InputFormat's createRecordReader function needs to
connect to Accumulo to get some config. Am I right in thinking I need
to convert the Hadoop wrapped token (kind=ACCUMULO_AUTH_TOKEN) into an
Accumulo DelegationToken to create my connector? If so, how do I do
that?

Thanks very much,

James

Re: ClientConfiguration using Kerberos & MapReduce

Posted by Josh Elser <jo...@gmail.com>.


On 5/28/17 12:13 PM, James Srinivasan wrote:
> [snip]
>>> I can't call AccumuloInputFormat.setConnectorInfo again since it has
>>> already been called, and I presume adding the serialised token to the
>>> Configuration would be insecure?
>> Yeah, the configuration can't protect sensitive information. MapReduce/YARN
>> has special handling to make sure those tokens serialized in the Job's
>> credentials are only readable by you (the job submitter).
>>
>> The thing I don't entirely follow is how you've gotten into this situation
>> to begin with. The adding of the delegation tokens to the Job's credentials
>> should be done by Accumulo's MR code on your behalf (just like it's
>> obtaining the delegation token, it would automatically add it to the job for
>> ya).
>>
>> Any chance you can provide an end-to-end example? I am also pretty
>> Spark-ignorant -- so maybe I just don't understand what is possible and what
>> isn't..
> 
> Hmm, after further investigation concentrating on just MapReduce (and
> not Spark) it seems the GeoMesaAccumuloInputFormat class might need
> more significant work than just s/PasswordToken/KerberosToken that I
> got away with previously. For example, sending an Accumulo password in
> the Hadoop conf probably isn't ideal either.
> 
> Fortunately I found this:
> 
> https://github.com/apache/hive/blob/master/accumulo-handler/src/java/org/apache/hadoop/hive/accumulo/mr/HiveAccumuloTableInputFormat.java
> 
> Is it a good example of Accumulo + MapReduce that I can copy?
> 
> Thanks,
> 
> James
> 

That one is definitely over-kill. There's a bit of reflection in there 
to work around older versions of Accumulo. However, it should be an 
example of something that does work with Kerberos authentication.

Also, take note that Hive uses the InputFormat regardless of the 
execution engine (local, MapReduce, Tez, etc). There are some comments 
to that effect in the code. You can likely simplify those methods/blocks 
as well :)

Re: ClientConfiguration using Kerberos & MapReduce

Posted by James Srinivasan <ja...@gmail.com>.

[snip]
>> I can't call AccumuloInputFormat.setConnectorInfo again since it has
>> already been called, and I presume adding the serialised token to the
>> Configuration would be insecure?
> Yeah, the configuration can't protect sensitive information. MapReduce/YARN
> has special handling to make sure those tokens serialized in the Job's
> credentials are only readable by you (the job submitter).
>
> The thing I don't entirely follow is how you've gotten into this situation
> to begin with. The adding of the delegation tokens to the Job's credentials
> should be done by Accumulo's MR code on your behalf (just like it's
> obtaining the delegation token, it would automatically add it to the job for
> ya).
>
> Any chance you can provide an end-to-end example? I am also pretty
> Spark-ignorant -- so maybe I just don't understand what is possible and what
> isn't..

Hmm, after further investigation concentrating on just MapReduce (and
not Spark) it seems the GeoMesaAccumuloInputFormat class might need
more significant work than just s/PasswordToken/KerberosToken that I
got away with previously. For example, sending an Accumulo password in
the Hadoop conf probably isn't ideal either.

Fortunately I found this:

https://github.com/apache/hive/blob/master/accumulo-handler/src/java/org/apache/hadoop/hive/accumulo/mr/HiveAccumuloTableInputFormat.java

Is it a good example of Accumulo + MapReduce that I can copy?

Thanks,

James

Re: ClientConfiguration using Kerberos & MapReduce

Posted by Josh Elser <jo...@gmail.com>.

James Srinivasan wrote:
>>> Delegation tokens are serialized into the Job's "credentials" section and
>>> distributed securely that way.
>> Ah, that's my problem. Will probably have to update the GeoMesa code
>> to wok with Jobs rather than Configurations, so that the Credentials
>> aren't lost.
>
> Hmm, not so easy it seems. My callstack which triggers the exception
> when the credentials are missing from the Job is this:
>
> java.lang.NullPointerException
>    at org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.unwrapAuthenticationToken(ConfiguratorBase.java:493)
>    at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:390)
>    at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:668)
>    at org.locationtech.geomesa.jobs.mapreduce.GeoMesaAccumuloInputFormat.getSplits(GeoMesaAccumuloInputFormat.scala:174)
>    at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121)
> ...
>
> Now org.apache.spark.rdd.NewHadoopRDD.getPartitions does this:
>
>    val jobContext = new JobContextImpl(_conf, jobId)
>
> So doesn't seem to support tokens (Jobs) being supplied, just Configurations.
>
> I can't call AccumuloInputFormat.setConnectorInfo again since it has
> already been called, and I presume adding the serialised token to the
> Configuration would be insecure?

Yeah, the configuration can't protect sensitive information. 
MapReduce/YARN has special handling to make sure those tokens serialized 
in the Job's credentials are only readable by you (the job submitter).

The thing I don't entirely follow is how you've gotten into this 
situation to begin with. The adding of the delegation tokens to the 
Job's credentials should be done by Accumulo's MR code on your behalf 
(just like it's obtaining the delegation token, it would automatically 
add it to the job for ya).

Any chance you can provide an end-to-end example? I am also pretty 
Spark-ignorant -- so maybe I just don't understand what is possible and 
what isn't..

> Yours in puzzlement,
>
> James

Re: ClientConfiguration using Kerberos & MapReduce

Posted by James Srinivasan <ja...@gmail.com>.

>> Delegation tokens are serialized into the Job's "credentials" section and
>> distributed securely that way.
> Ah, that's my problem. Will probably have to update the GeoMesa code
> to wok with Jobs rather than Configurations, so that the Credentials
> aren't lost.

Hmm, not so easy it seems. My callstack which triggers the exception
when the credentials are missing from the Job is this:

java.lang.NullPointerException
  at org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.unwrapAuthenticationToken(ConfiguratorBase.java:493)
  at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:390)
  at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:668)
  at org.locationtech.geomesa.jobs.mapreduce.GeoMesaAccumuloInputFormat.getSplits(GeoMesaAccumuloInputFormat.scala:174)
  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121)
...

Now org.apache.spark.rdd.NewHadoopRDD.getPartitions does this:

  val jobContext = new JobContextImpl(_conf, jobId)

So doesn't seem to support tokens (Jobs) being supplied, just Configurations.

I can't call AccumuloInputFormat.setConnectorInfo again since it has
already been called, and I presume adding the serialised token to the
Configuration would be insecure?

Yours in puzzlement,

James

Re: ClientConfiguration using Kerberos & MapReduce

Posted by James Srinivasan <ja...@gmail.com>.

> Delegation tokens are serialized into the Job's "credentials" section and
> distributed securely that way.

Ah, that's my problem. Will probably have to update the GeoMesa code
to wok with Jobs rather than Configurations, so that the Credentials
aren't lost.

Thanks!

James

Re: ClientConfiguration using Kerberos & MapReduce

Posted by Josh Elser <jo...@gmail.com>.

James Srinivasan wrote:
> However, I seem to get this when trying to use the DelegationToken:
>
> scala>  rdd.count()
> 17/05/19 21:30:55  INFO UserGroupInformation: Login successful for user
> accumulo-wink@VBOX.LOCAL  using keytab file
> /tmp/accumulo.headless.keytab
> java.lang.NullPointerException
>    at org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.unwrapAuthenticationToken(ConfiguratorBase.java:493)
>    at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:390)
>    at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:668)
>    at org.locationtech.geomesa.jobs.mapreduce.GeoMesaAccumuloInputFormat.getSplits(GeoMesaAccumuloInputFormat.scala:174)
>    at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121)
>
> Looking over the code, I can't see an obvious reason it would be null
> on those lines. Any help is much appreciated!

Delegation tokens are serialized into the Job's "credentials" section 
and distributed securely that way.

When your job needs to construct its input splits, it first needs to 
pull the delegate token out of the Job. For whatever reason, the 
serialized DelegationToken we expected to pull out of the Job's 
credentials is invalid/malformed.

Perhaps in your copying of the Configuration, you're blowing away 
something? I'm not sure.

Re: ClientConfiguration using Kerberos & MapReduce

Posted by James Srinivasan <ja...@gmail.com>.

Hi Josh, thanks for the help!

> To your original question: you'd want to look at the method,
>
> `AccumuloInputFormat.setConnectorInfo(Job, String, AuthenticationToken)`

I found the functions to call pretty quickly, it was just how to
actually call them which was puzzling me since my existing code uses
Configurations. I've settled on creating a Job from my Configuration,
invoking the new API calls, then overwriting my prior Configuration
with the new values from my Job since the Job class doesn't modify its
incoming Configuration in-place (my wrong assumption). This makes me
feel slightly icky...

> (the implementation is actually on AbstractInputFormat if you're curious..)

Yup, and I call it directly since I'm using Scala

> You would construct a KerberosToken via normal methods (Instance +
> ClientConfiguration) and pass that to this method. When you do this, the
> implementation automatically fetches delegation tokens for you (tl;dr on
> delegation tokens: short-lived password sufficient to identify you that
> prevents us from having to distribute your Kerberos credentials across the
> cluster).

Yup, that part seems to work fine:

scala> val rdd = spatialRDDProvider.rdd(new Configuration, sc, params, q)
17/05/19 21:30:49 INFO UserGroupInformation: Login successful for user
accumulo-wink@VBOX.LOCAL using keytab file
/tmp/accumulo.headless.keytab
17/05/19 21:30:49 INFO UserGroupInformation: Login successful for user
accumulo-wink@VBOX.LOCAL using keytab file
/tmp/accumulo.headless.keytab
17/05/19 21:30:50 INFO ENGINE: dataFileCache open start
17/05/19 21:30:51 INFO AccumuloInputFormat: Received KerberosToken,
attempting to fetch DelegationToken
17/05/19 21:30:52 INFO MemoryStore: Block broadcast_0 stored as values
in memory (estimated size 384.8 KB, free 365.9 MB)
17/05/19 21:30:52 INFO MemoryStore: Block broadcast_0_piece0 stored as
bytes in memory (estimated size 28.1 KB, free 365.9 MB)
17/05/19 21:30:52 INFO BlockManagerInfo: Added broadcast_0_piece0 in
memory on 192.168.85.100:39803 (size: 28.1 KB, free: 366.3 MB)
17/05/19 21:30:52 INFO SparkContext: Created broadcast 0 from
newAPIHadoopRDD at AccumuloSpatialRDDProvider.scala:130
17/05/19 21:30:52 INFO GeoMesaSparkKryoRegistratorEndpoint$:
kryo-schema rpc endpoint registered on driver 192.168.85.100:35861
rdd: org.locationtech.geomesa.spark.SpatialRDD = SpatialRDD[2] at RDD
at GeoMesaSpark.scala:58

However, I seem to get this when trying to use the DelegationToken:

scala> rdd.count()
17/05/19 21:30:55 INFO UserGroupInformation: Login successful for user
accumulo-wink@VBOX.LOCAL using keytab file
/tmp/accumulo.headless.keytab
java.lang.NullPointerException
  at org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.unwrapAuthenticationToken(ConfiguratorBase.java:493)
  at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:390)
  at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:668)
  at org.locationtech.geomesa.jobs.mapreduce.GeoMesaAccumuloInputFormat.getSplits(GeoMesaAccumuloInputFormat.scala:174)
  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121)

Looking over the code, I can't see an obvious reason it would be null
on those lines. Any help is much appreciated!

James

Re: ClientConfiguration using Kerberos & MapReduce

Posted by Josh Elser <jo...@gmail.com>.

James Srinivasan wrote:
>> [snip]
>> ConfiguratorBase has no other overrides for setZookeeperInstance, so I
>> don't see how this would ever work with Kerberos. It is marked as
>> deprecated, which points me to AccumuloInputFormat, but I'm a little
>> confused as to how this API relates to ConfiguratorBase.
>
> So I'm now comparing (for example):
>
> ConfiguratorBase.setConnectorInfo(classOf[AccumuloInputFormat], conf,
> username, password)
> (old API)
>
> with
>
> AbstractInputFormat.setConnectorInfo(new Job(conf), username, password)
> (new API)
>
> The difference is that the old API operates directly on the
> Configuration and updates it in-place, whereas my way of calling the
> new API seems to create a copy of the Configuration and leave the
> original untouched. This leaves me with a problem of having to merge
> the old and new Configurations - surely there must be a better way?
>
> Thanks,
>
> James

(whoops, forgot to respond to your first message)

You've definitely stumbled onto a very painful corner of our public API. 
We got screwed (essentially) by introducing a bunch of code that wasn't 
really meant to be public API (stable) while trying to consolidate our 
implementation between the Hadoop mapred and mapreduce API calls and the 
InputFormat/OutputFormat for each. Anyways!

To your original question: you'd want to look at the method,

`AccumuloInputFormat.setConnectorInfo(Job, String, AuthenticationToken)` 
(the implementation is actually on AbstractInputFormat if you're curious..)

You would construct a KerberosToken via normal methods (Instance + 
ClientConfiguration) and pass that to this method. When you do this, the 
implementation automatically fetches delegation tokens for you (tl;dr on 
delegation tokens: short-lived password sufficient to identify you that 
prevents us from having to distribute your Kerberos credentials across 
the cluster).

Fair-warning: you'll need to make sure you grant the permission to your 
user to obtain delegation tokens (System.OBTAIN_DELEGATION_TOKEN) 
otherwise you'll get an permission error from the Master when the 
Input/OutputFormat asks for one on your behalf.

Re: ClientConfiguration using Kerberos & MapReduce

Posted by James Srinivasan <ja...@gmail.com>.

> [snip]
> ConfiguratorBase has no other overrides for setZookeeperInstance, so I
> don't see how this would ever work with Kerberos. It is marked as
> deprecated, which points me to AccumuloInputFormat, but I'm a little
> confused as to how this API relates to ConfiguratorBase.

So I'm now comparing (for example):

ConfiguratorBase.setConnectorInfo(classOf[AccumuloInputFormat], conf,
username, password)
(old API)

with

AbstractInputFormat.setConnectorInfo(new Job(conf), username, password)
(new API)

The difference is that the old API operates directly on the
Configuration and updates it in-place, whereas my way of calling the
new API seems to create a copy of the Configuration and leave the
original untouched. This leaves me with a problem of having to merge
the old and new Configurations - surely there must be a better way?

Thanks,

James