You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Bolke de Bruin <bd...@gmail.com> on 2018/08/02 17:51:14 UTC

Re: Kerberos and Airflow

Hi Dan,

I discussed this a little bit with one of the security architects here. We think that 
you can have a fair trade off between security and usability by having
a kind of manifest with the dag you are submitting. This manifest can then 
specify what the generated tasks/dags are allowed to do and what metadata 
to provide to them. We could also let the scheduler generate hashes per generated
DAG / task and verify those with an established version (1st run?). This limits the 
attack vector.

A DagSerializer would be great, but I think it solves a different issue and the above 
is somewhat simpler to implement?

Bolke

> On 29 Jul 2018, at 23:47, Dan Davydov <dd...@twitter.com.INVALID> wrote:
> 
> *Let’s say we trust the owner field of the DAGs I think we could do the
> following.*
> *Obviously, the trusting the user part is key here. It is one of the
> reasons I was suggesting using “airflow submit” to update / add dags in
> Airflow*
> 
> 
> *This is the hard part about my question.*
> I think in a true multi-tenant environment we wouldn't be able to trust the
> user, otherwise we wouldn't necessarily even need a mapping of Airflow DAG
> users to secrets, because if we trust users to set the correct Airflow user
> for DAGs, we are basically trusting them with all of the creds the Airflow
> scheduler can access for all users anyways.
> 
> I actually had the same thought as your "airflow submit" a while ago, which
> I discussed with Alex, basically creating an API for adding DAGs instead of
> having the Scheduler parse them. FWIW I think it's superior to the git time
> machine approach because it's a more generic form of "serialization" and is
> more correct as well because the same DAG file parsed on a given git SHA
> can produce different DAGs. Let me know what you think, and maybe I can
> start a more formal design doc if you are onboard:
> 
> A user or service with an auth token sends an "airflow submit" request to a
> new kind of Dag Serialization service, along with the serialized DAG
> objects generated by parsing on the client. It's important that these
> serialized objects are declaritive and not e.g. pickles so that the
> scheduler/workers can consume them and reproducability of the DAGs is
> guaranteed. The service will then store each generated DAG along with it's
> access based on the provided token e.g. using Ranger, and the
> scheduler/workers will use the stored DAGs for scheduling/execution.
> Operators would be deployed along with the Airflow code separately from the
> serialized DAGs.
> 
> A serialed DAG would look something like this (basically Luigi-style :)):
> MyTask - BashOperator: {
>  cmd: "sleep 1"
>  user: "Foo"
>  access: "token1", "token2"
> }
> 
> MyDAG: {
>  MyTask1 >> SomeOtherTask1
>  MyTask2 >> SomeOtherTask1
> }
> 
> Dynamic DAGs in this case would just consist of a service calling "Airflow
> Submit" that does it's own form of authentication to get access to some
> kind of tokens (or basically just forwarding the secrets the users of the
> dynamic DAG submit).
> 
> For the default Airflow implementation you can maybe just have the Dag
> Serialization server bundled with the Scheduler, with auth turned off, and
> to periodically update the Dag Serialization store which would emulate the
> current behavior closely.
> 
> Pros:
> 1. Consistency across running task instances in a dagrun/scheduler,
> reproducability and auditability of DAGs
> 2. Users can control when to deploy their DAGs
> 3. Scheduler runs much faster since it doesn't have to run python files and
> e.g. make network calls
> 4. Scaling scheduler becomes easier because can have different service
> responsible for parsing DAGs which can be trivially scaled horizontally
> (clients are doing the parsing)
> 5. Potentially makes creating ad-hoc DAGs/backfilling/iterating on DAGs
> easier? e.g. can use the Scheduler itself to schedule backfills with a
> slightly modified serialized version of a DAG.
> 
> Cons:
> 1. Have to deprecate a lot of popular features, e.g. allowing custom
> callbacks in operators (e.g. on_failure), and jinja_templates
> 2. Version compatibility problems, e.g. user/service client might be
> serializing arguments for hooks/operators that have been deprecated in
> newer versions of the hooks, or the serialized DAG schema changes and old
> DAGs aren't automatically updated. Might want to have some kind of
> versioning system for serialized DAGs to at least ensure that stored DAGs
> are valid when the Scheduler/Worker/etc are upgraded, maybe something
> similar to thrift/protobuf versioning.
> 3. Additional complexity - additional service, logic on workers/scheduler
> to fetch/cache serialized DAGs efficiently, expiring/archiving old DAG
> definitions, etc
> 
> 
> On Sun, Jul 29, 2018 at 3:20 PM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>> wrote:
> 
>> Ah gotcha. That’s another issue actually (but related).
>> 
>> Let’s say we trust the owner field of the DAGs I think we could do the
>> following. We then have a table (and interface) to tell Airflow what users
>> have access to what connections. The scheduler can then check if the task
>> in the dag can access the conn_id it is asking for. Auto generated dags
>> still have an owner (or should) and therefore should be fine. Some
>> integrity checking could/should be added as we want to be sure that the
>> task we schedule is the task we launch. So a signature calculated at the
>> scheduler (or part of the DAG), send as part of the metadata and checked by
>> the executor is probably smart.
>> 
>> You can also make this more fancy by integrating with something like
>> Apache Ranger that allows for policy checking.
>> 
>> Obviously, the trusting the user part is key here. It is one of the
>> reasons I was suggesting using “airflow submit” to update / add dags in
>> Airflow. We could enforce authentication on the DAG. It was kind of ruled
>> out in favor of git time machines although these never happened afaik ;-).
>> 
>> BTW: I have updated my implementation with protobuf. Metadata is now
>> available at executor and task.
>> 
>> 
>>> On 29 Jul 2018, at 15:47, Dan Davydov <dd...@twitter.com.INVALID>
>> wrote:
>>> 
>>> The concern is how to secure secrets on the scheduler such that only
>>> certain DAGs can access them, and in the case of files that create DAGs
>>> dynamically, only some set of DAGs should be able to access these
>> secrets.
>>> 
>>> e.g. if there is a secret/keytab that can be read by DAG A generated by
>>> file X, and file X generates DAG B as well, there needs to be a scheme to
>>> stop the parsing of DAG B on the scheduler from being able to read the
>>> secret in DAG A.
>>> 
>>> Does that make sense?
>>> 
>>> On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin <bdbruin@gmail.com
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>> wrote:
>>> 
>>>> I’m not sure what you mean. The example I created allows for dynamic
>> DAGs,
>>>> as the scheduler obviously knows about the tasks when they are ready to
>> be
>>>> scheduled.
>>>> This isn’t any different from a static DAG or a dynamic one.
>>>> 
>>>> For Kerberos it isnt that special. Basically a keytab are the revokable
>>>> users credentials
>>>> in a special format. The keytab itself can be protected by a password.
>> So
>>>> I can imagine
>>>> that a connection is defined that sets a keytab location and password to
>>>> access the keytab.
>>>> The scheduler understands this (or maybe the Connection model) and
>>>> serializes and sends
>>>> it to the worker as part of the metadata. The worker then reconstructs
>> the
>>>> keytab and issues
>>>> a kinit or supplies it to the other service requiring it (eg. Spark)
>>>> 
>>>> * Obviously the worker and scheduler need to communicate over SSL.
>>>> * There is a challenge at the worker level. Credentials are secured
>>>> against other users, but are readable by the owning user. So imagine 2
>> DAGs
>>>> from two different users with different connections without sudo
>>>> configured. If they end up at the same worker if DAG 2 is malicious it
>>>> could read files and memory created by DAG 1. This is the reason why
>> using
>>>> environment variables are NOT safe (DAG 2 could read
>> /proc/<pid>/environ).
>>>> To mitigate this we probably need to PIPE the data to the task’s STDIN.
>> It
>>>> won’t solve the issue but will make it harder as now it will only be in
>>>> memory.
>>>> * The reconstructed keytab (or the initalized version) can be stored in,
>>>> most likely, the process-keyring (
>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html>>>). As
>>>> mentioned earlier this poses a challenge for Java applications that
>> cannot
>>>> read from this location (keytab an ccache). Writing it out to the
>>>> filesystem then becomes a possibility. This is essentially the same how
>>>> Spark solves it (
>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode>>>).
>>>> 
>>>> Why not work on this together? We need it as well. Airflow as it is now
>> we
>>>> consider the biggest security threat and it is really hard to secure it.
>>>> The above would definitely be a serious improvement. Another step would
>> be
>>>> to stop Tasks from accessing the Airflow DB all together.
>>>> 
>>>> Cheers
>>>> Bolke
>>>> 
>>>>> On 29 Jul 2018, at 05:36, Dan Davydov <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
>>>> wrote:
>>>>> 
>>>>> This makes sense, and thanks for putting this together. I might pick
>> this
>>>>> up myself depending on if we can get the rest of the mutli-tenancy
>> story
>>>>> nailed down, but I still think the tricky part is figuring out how to
>>>> allow
>>>>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work
>> with
>>>>> Kerberos, curious what your thoughts are there. How would secrets be
>>>> passed
>>>>> securely in a multi-tenant Scheduler starting from parsing the DAGs up
>> to
>>>>> the executor sending them off?
>>>>> 
>>>>> On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
>>>>> 
>>>>>> Here:
>>>>>> 
>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>>>>
>>>>>> 
>>>>>> Is a working rudimentary implementation that allows securing the
>>>>>> connections (only LocalExecutor at the moment)
>>>>>> 
>>>>>> * It enforces the use of “conn_id” instead of the mix that we have now
>>>>>> * A task if using “conn_id” has ‘auto-registered’ (which is a noop)
>> its
>>>>>> connections
>>>>>> * The scheduler reads the connection informations and serializes it to
>>>>>> json (which should be a different format, protobuf preferably)
>>>>>> * The scheduler then sends this info to the executor
>>>>>> * The executor puts this in the environment of the task (environment
>>>> most
>>>>>> likely not secure enough for us)
>>>>>> * The BaseHook reads out this environment variable and does not need
>> to
>>>>>> touch the database
>>>>>> 
>>>>>> The example_http_operator works, I havent tested any other. To make it
>>>>>> work I just adjusted the hook and operator to use “conn_id” instead
>>>>>> of the non standard http_conn_id.
>>>>>> 
>>>>>> Makes sense?
>>>>>> 
>>>>>> B.
>>>>>> 
>>>>>> * The BaseHook is adjusted to not connect to the database
>>>>>>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>> 
>>>>>>> Well, I don’t think a hook (or task) should be obtain it by itself.
>> It
>>>>>> should be supplied.
>>>>>>> At the moment you start executing the task you cannot trust it
>> anymore
>>>>>> (ie. it is unmanaged
>>>>>>> / non airflow code).
>>>>>>> 
>>>>>>> So we could change the basehook to understand supplied credentials
>> and
>>>>>> populate
>>>>>>> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection
>>>>>> anyway, so
>>>>>>> it shouldnt be too hard and should in principle not require changes
>> to
>>>>>> the hooks
>>>>>>> themselves if they are well behaved.
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>>> On 28 Jul 2018, at 17:41, Dan Davydov <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>> <mailto:
>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>>>
>> wrote:
>>>>>>>> 
>>>>>>>> *So basically in the scheduler we parse the dag. Either from the
>>>>>> manifest
>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>> register?) we
>>>>>>>> know what connections and keytabs are available dag wide or per
>> task.*
>>>>>>>> This is the hard part that I was curious about, for dynamically
>>>> created
>>>>>>>> DAGs, e.g. those generated by reading tasks in a MySQL database or a
>>>>>> json
>>>>>>>> file, there isn't a great way to do this.
>>>>>>>> 
>>>>>>>> I 100% agree with deprecating the connections table (at least for
>> the
>>>>>>>> secure option). The main work there is rewriting all hooks to take
>>>>>>>> credentials from arbitrary data sources by allowing a customized
>>>>>>>> CredentialsReader class. Although hooks are technically private, I
>>>>>> think a
>>>>>>>> lot of companies depend on them so the PMC should probably discuss
>> if
>>>>>> this
>>>>>>>> is an Airflow 2.0 change or not.
>>>>>>>> 
>>>>>>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sure. In general I consider keytabs as a part of connection
>>>>>> information.
>>>>>>>>> Connections should be secured by sending the connection
>> information a
>>>>>> task
>>>>>>>>> needs as part of information the executor gets. A task should then
>>>> not
>>>>>> need
>>>>>>>>> access to the connection table in Airflow. Keytabs could then be
>> send
>>>>>> as
>>>>>>>>> part of the connection information (base64 encoded) and setup by
>> the
>>>>>>>>> executor (this key) to be read only to the task it is launching.
>>>>>>>>> 
>>>>>>>>> So basically in the scheduler we parse the dag. Either from the
>>>>>> manifest
>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>> register?) we
>>>>>>>>> know what connections and keytabs are available dag wide or per
>> task.
>>>>>>>>> 
>>>>>>>>> The credentials and connection information then are serialized
>> into a
>>>>>>>>> protobuf message and send to the executor as part of the “queue”
>>>>>> action.
>>>>>>>>> The worker then deserializes the information and makes it securely
>>>>>>>>> available to the task (which is quite hard btw).
>>>>>>>>> 
>>>>>>>>> On that last bit making the info securely available might be
>> storing
>>>>>> it in
>>>>>>>>> the Linux KEYRING (supported by python keyring). Keytabs will be
>>>> tough
>>>>>> to
>>>>>>>>> do properly due to Java not properly supporting KEYRING and only
>>>> files
>>>>>> and
>>>>>>>>> these are hard to make secure (due to the possibility a process
>> will
>>>>>> list
>>>>>>>>> all files in /tmp and get credentials through that). Maybe storing
>>>> the
>>>>>>>>> keytab with a password and having the password in the KEYRING might
>>>>>> work.
>>>>>>>>> Something to find out.
>>>>>>>>> 
>>>>>>>>> B.
>>>>>>>>> 
>>>>>>>>> Verstuurd vanaf mijn iPad
>>>>>>>>> 
>>>>>>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
>>>>>> <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>>>> 
>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>>>>>> 
>>>>>>>>> het volgende geschreven:
>>>>>>>>>> 
>>>>>>>>>> I'm curious if you had any ideas in terms of ideas to enable
>>>>>>>>> multi-tenancy
>>>>>>>>>> with respect to Kerberos in Airflow.
>>>>>>>>>> 
>>>>>>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Cool. The doc will need some refinement as it isn't entirely
>>>>>> accurate.
>>>>>>>>> In
>>>>>>>>>>> addition we need to separate between Airflow as a client of
>>>>>> kerberized
>>>>>>>>>>> services (this is what is talked about in the astronomer doc) vs
>>>>>>>>>>> kerberizing airflow itself, which the API supports.
>>>>>>>>>>> 
>>>>>>>>>>> In general to access kerberized services (airflow as a client)
>> one
>>>>>> needs
>>>>>>>>>>> to start the ticket renewer with a valid keytab. For the hooks it
>>>>>> isn't
>>>>>>>>>>> always required to change the hook to support it. Hadoop cli
>> tools
>>>>>> often
>>>>>>>>>>> just pick it up as their client config is set to do so. Then
>>>> another
>>>>>>>>> class
>>>>>>>>>>> is there for HTTP-like services which are accessed by urllib
>> under
>>>>>> the
>>>>>>>>>>> hood, these typically use SPNEGO. These often need to be adjusted
>>>> as
>>>>>> it
>>>>>>>>>>> requires some urllib config. Finally, there are protocols which
>> use
>>>>>> SASL
>>>>>>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These
>>>>>> require
>>>>>>>>> per
>>>>>>>>>>> protocol implementations.
>>>>>>>>>>> 
>>>>>>>>>>> From the top of my head we support kerberos client side now with:
>>>>>>>>>>> 
>>>>>>>>>>> * Spark
>>>>>>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
>>>>>>>>>>> implementation)
>>>>>>>>>>> * Hive (not metastore afaik)
>>>>>>>>>>> 
>>>>>>>>>>> Two things to remember:
>>>>>>>>>>> 
>>>>>>>>>>> * If a job (ie. Spark job) will finish later than the maximum
>>>> ticket
>>>>>>>>>>> lifetime you probably need to provide a keytab to said
>> application.
>>>>>>>>>>> Otherwise you will get failures after the expiry.
>>>>>>>>>>> * A keytab (used by the renewer) are credentials (user and pass)
>> so
>>>>>> jobs
>>>>>>>>>>> are executed under the keytab in use at that moment
>>>>>>>>>>> * Securing keytab in multi tenancy airflow is a challenge. This
>>>> also
>>>>>>>>> goes
>>>>>>>>>>> for securing connections. This we need to fix at some point.
>>>> Solution
>>>>>>>>> for
>>>>>>>>>>> now seems to be no multi tenancy.
>>>>>>>>>>> 
>>>>>>>>>>> Kerberos seems harder than it is btw. Still, we are sometimes
>>>> moving
>>>>>>>>> away
>>>>>>>>>>> from it to OAUTH2 based authentication. This gets use closer to
>>>> cloud
>>>>>>>>>>> standards (but we are on prem)
>>>>>>>>>>> 
>>>>>>>>>>> B.
>>>>>>>>>>> 
>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>> 
>>>>>>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hitesh@apache.org <ma...@apache.org>
>> <mailto:hitesh@apache.org <ma...@apache.org>> <mailto:
>>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>>> <mailto:
>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>> <mailto:
>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Taylor
>>>>>>>>>>>> 
>>>>>>>>>>>> +1 on upstreaming this. It would be great if you can submit a
>> pull
>>>>>>>>>>> request
>>>>>>>>>>>> to enhance the apache airflow docs.
>>>>>>>>>>>> 
>>>>>>>>>>>> thanks
>>>>>>>>>>>> Hitesh
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> While we're on the topic, I'd love any feedback from Bolke or
>>>>>> others
>>>>>>>>>>> who've
>>>>>>>>>>>>> used Kerberos with Airflow on this quick guide I put together
>>>>>>>>> yesterday.
>>>>>>>>>>>>> It's similar to what's in the Airflow docs but instead all on
>> one
>>>>>> page
>>>>>>>>>>>>> and slightly
>>>>>>>>>>>>> expanded.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>>> 
>>>>>> <
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>>> 
>>>>>>> 
>>>>>>>>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/> <
>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/>> <
>>>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/> <
>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/>>>>)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> One thing I'd like to add is a minimal example of how to
>>>> Kerberize
>>>>>> a
>>>>>>>>>>> hook.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
>>>>>>>>> Concepts >
>>>>>>>>>>>>> Additional Functionality > Kerberos page?)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Taylor
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Taylor Edmiston*
>>>>>>>>>>>>> Blog <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>
>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>>
>>>> | CV
>>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor> <
>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor>> <
>>>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor> <
>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor>>>> | LinkedIn
>>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/> <
>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/>> <
>>>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/> <
>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/>>>> | AngelList
>>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor> <https://angel.co/taylor <https://angel.co/taylor>> <
>> https://angel.co/taylor <https://angel.co/taylor> <https://angel.co/taylor <https://angel.co/taylor>>>> | Stack
>>>> Overflow
>>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston> <
>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston>> <
>>>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston> <
>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston>>>>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
>>>>>>>>> <fokko@driesprong.frl <ma...@driesprong.frl> <mailto:fokko@driesprong.frl <ma...@driesprong.frl>> <mailto:
>> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:fokko@driesprong.frl <ma...@driesprong.frl>>>
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Ry,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You should ask Bolke de Bruin. He's really experienced with
>>>>>> Kerberos
>>>>>>>>>>> and
>>>>>>>>>>>>> he
>>>>>>>>>>>>>> did also the implementation for Airflow. Beside that he worked
>>>>>> also
>>>>>>>>> on
>>>>>>>>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <
>> ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io <ma...@astronomer.io>>
>>>> <mailto:ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io <ma...@astronomer.io>>>>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi everyone -
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We have several bigCo's who are considering using Airflow
>>>> asking
>>>>>>>>> into
>>>>>>>>>>>>> its
>>>>>>>>>>>>>>> support for Kerberos.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We're going to work on a proof-of-concept next week, will
>>>> likely
>>>>>>>>>>>>> record a
>>>>>>>>>>>>>>> screencast on it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For now, we're looking for any anecdotal information from
>>>>>>>>>>> organizations
>>>>>>>>>>>>>> who
>>>>>>>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing
>> to
>>>>>> share
>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>> experiences here, or reply to me personally, it would be
>>>> greatly
>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -Ry
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/ <http://www.astronomer.io/> <
>> http://www.astronomer.io/ <http://www.astronomer.io/>> <
>>>> http://www.astronomer.io/ <http://www.astronomer.io/> <http://www.astronomer.io/ <http://www.astronomer.io/>>>> |
>>>>>>>>>>>>>> 513.417.2163 |
>>>>>>>>>>>>>>> @rywalker <http://twitter.com/rywalker <http://twitter.com/rywalker> <
>> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
>>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <http://twitter.com/rywalker <http://twitter.com/rywalker>>>> | LinkedIn
>>>>>>>>>>>>>>> <http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker> <
>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>> <
>>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker> <
>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>>>>

Re: Kerberos and Airflow

Posted by Dan Davydov <dd...@twitter.com.INVALID>.

I look forward to reading the draft and working on it with you! Not 100%
sure I can make it so SF for the hackathon (I'm in New York now), but I can
participate remotely.



On Sat, Aug 4, 2018 at 9:30 AM Bolke de Bruin <bd...@gmail.com> wrote:

> Hi Dan,
>
> Don’t misunderstand me. I think what I proposed is complementary to the
> dag submit function. The only thing you mentioned I don’t think is needed
> is to fully serialize up front and therefore excluding callback etc
> (although there are other serialization libraries like marshmallow that
> might be able to do it).
>
> You are right to mention that the hashes should be calculated at submit
> time and a authorized user should be able to recalculate a hash. Another
> option could be something like https://pypi.org/project/signedimp/ which
> we could use to verify dependencies.
>
> I’ll start writing something up. We can then shoot holes in it (i think
> you have a point on the crypto) and maybe do some hacking on it. This could
> be part of the hackathon in sept in SF, I’m sure some other people would
> have an interest in it as well.
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 3 aug. 2018 om 23:14 heeft Dan Davydov <dd...@twitter.com.INVALID>
> het volgende geschreven:
> >
> > I designed a system similar to what you are describing which is in use at
> > Airbnb (only DAGs on a whitelist would be allowed to merged to the git
> repo
> > if they used certain types of impersonation), it worked for simple use
> > cases, but the problem was doing access control becomes very difficult,
> > e.g. solving the problem of which DAGs map to which manifest files, and
> > which manifest files can access which secrets.
> >
> > There is also a security risk where someone changes e.g. a python file
> > dependency of your task, or let's say you figure out a way to block those
> > kinds of changes based on your sthashing, what if there is a legitimate
> > change in a dependency and you want to recalculate the hash? Then I think
> > you go back to a solution like your proposed "airflow submit" command to
> > accomplish this.
> >
> > Additional concerns:
> > - I'm not sure if I'm a fan of the the first time a scheduler parses a
> DAG
> > to be what creates the hashes either, it feels to me like
> > encryption/hashing should be done before DAGs are even parsed by the
> > scheduler (at commit time or submit time of the DAGs)
> > - The type of the encrypted key seem kind of hacky to me, i.e. some kind
> of
> > custom hash based on DAG structure instead of a simple token passed in by
> > users which has a clear separation of concerns WRT security
> > - Added complexity both to Airflow code, and to users as they need to
> > define or customize hashing functions for DAGs to improve security
> > If we can get a reasonably secure solution then it might be a reasonable
> > trade-off considering the alternative is a major overhaul/restrictions to
> > DAGs.
> >
> > Maybe I'm missing some details that would alleviate my concerns here,
> and a
> > bit of a more in-depth document might help?
> >
> >
> >
> > *Also: using the Kubernetes executor combined with some of the things
> > wediscussed greatly enhances the security of Airflow as the
> > environment isn’t really shared anymore.*
> > Assuming a multi-tenant scheduler, I feel the same set of hard problems
> > exist with Kubernetes, as the executor mainly just simplifies the
> > post-executor parts of task scheduling/execution which I think you
> already
> > outlined a good solution for early on in this thread (passing keys from
> the
> > executor to workers).
> >
> > Happy to set up some time to talk real-time about this by the way, once
> we
> > iron out the details I want to implement whatever the best solution we
> come
> > up with is.
> >
> >> On Thu, Aug 2, 2018 at 4:13 PM Bolke de Bruin <bd...@gmail.com>
> wrote:
> >>
> >> You mentioned you would like to make sure that the DAG (and its tasks)
> >> runs in a confined set of settings. Ie.
> >> A given set of connections at submission time not at run time. So here
> we
> >> can make use of the fact that both the scheduler
> >> and the worker parse the DAG.
> >>
> >> Firstly, when scheduler evaluates a DAG it can add an integrity check
> >> (hash) for each task. The executor can encrypt the
> >> metadata with this hash ensuring that the structure of the DAG remained
> >> the same. It means that the task is only
> >> able to decrypt the metadata when it is able to calculate the same hash.
> >>
> >> Similarly, if the scheduler parses a DAG for the first time it can
> >> register the hashes for the tasks. It can then verify these hashes
> >> at runtime to ensure the structure of the tasks have stayed the same. In
> >> the manifest (which could even in the DAG or
> >> part of the DAG definition) we could specify which fields would be used
> >> for hash calculation. We could even specify
> >> static hashes. This would give flexibility as to what freedom the users
> >> have in the auto-generated DAGS.
> >>
> >> Something like that?
> >>
> >> B.
> >>
> >>> On 2 Aug 2018, at 20:12, Dan Davydov <dd...@twitter.com.INVALID>
> >> wrote:
> >>>
> >>> I'm very intrigued, and am curious how this would work in a bit more
> >>> detail, especially for dynamically created DAGs (how would static
> >> manifests
> >>> map to DAGs that are generated from rows in a MySQL table for example)?
> >> You
> >>> could of course have something like regexes in your manifest file like
> >>> some_dag_framework_dag_*, but then how would you make sure that other
> >> users
> >>> did not create DAGs that matched this regex?
> >>>
> >>> On Thu, Aug 2, 2018 at 1:51 PM Bolke de Bruin <bdbruin@gmail.com
> >> <ma...@gmail.com>> wrote:
> >>>
> >>>> Hi Dan,
> >>>>
> >>>> I discussed this a little bit with one of the security architects
> here.
> >> We
> >>>> think that
> >>>> you can have a fair trade off between security and usability by having
> >>>> a kind of manifest with the dag you are submitting. This manifest can
> >> then
> >>>> specify what the generated tasks/dags are allowed to do and what
> >> metadata
> >>>> to provide to them. We could also let the scheduler generate hashes
> per
> >>>> generated
> >>>> DAG / task and verify those with an established version (1st run?).
> This
> >>>> limits the
> >>>> attack vector.
> >>>>
> >>>> A DagSerializer would be great, but I think it solves a different
> issue
> >>>> and the above
> >>>> is somewhat simpler to implement?
> >>>>
> >>>> Bolke
> >>>>
> >>>>> On 29 Jul 2018, at 23:47, Dan Davydov <dd...@twitter.com.INVALID>
> >>>> wrote:
> >>>>>
> >>>>> *Let’s say we trust the owner field of the DAGs I think we could do
> the
> >>>>> following.*
> >>>>> *Obviously, the trusting the user part is key here. It is one of the
> >>>>> reasons I was suggesting using “airflow submit” to update / add dags
> in
> >>>>> Airflow*
> >>>>>
> >>>>>
> >>>>> *This is the hard part about my question.*
> >>>>> I think in a true multi-tenant environment we wouldn't be able to
> trust
> >>>> the
> >>>>> user, otherwise we wouldn't necessarily even need a mapping of
> Airflow
> >>>> DAG
> >>>>> users to secrets, because if we trust users to set the correct
> Airflow
> >>>> user
> >>>>> for DAGs, we are basically trusting them with all of the creds the
> >>>> Airflow
> >>>>> scheduler can access for all users anyways.
> >>>>>
> >>>>> I actually had the same thought as your "airflow submit" a while ago,
> >>>> which
> >>>>> I discussed with Alex, basically creating an API for adding DAGs
> >> instead
> >>>> of
> >>>>> having the Scheduler parse them. FWIW I think it's superior to the
> git
> >>>> time
> >>>>> machine approach because it's a more generic form of "serialization"
> >> and
> >>>> is
> >>>>> more correct as well because the same DAG file parsed on a given git
> >> SHA
> >>>>> can produce different DAGs. Let me know what you think, and maybe I
> can
> >>>>> start a more formal design doc if you are onboard:
> >>>>>
> >>>>> A user or service with an auth token sends an "airflow submit"
> request
> >>>> to a
> >>>>> new kind of Dag Serialization service, along with the serialized DAG
> >>>>> objects generated by parsing on the client. It's important that these
> >>>>> serialized objects are declaritive and not e.g. pickles so that the
> >>>>> scheduler/workers can consume them and reproducability of the DAGs is
> >>>>> guaranteed. The service will then store each generated DAG along with
> >>>> it's
> >>>>> access based on the provided token e.g. using Ranger, and the
> >>>>> scheduler/workers will use the stored DAGs for scheduling/execution.
> >>>>> Operators would be deployed along with the Airflow code separately
> from
> >>>> the
> >>>>> serialized DAGs.
> >>>>>
> >>>>> A serialed DAG would look something like this (basically Luigi-style
> >> :)):
> >>>>> MyTask - BashOperator: {
> >>>>> cmd: "sleep 1"
> >>>>> user: "Foo"
> >>>>> access: "token1", "token2"
> >>>>> }
> >>>>>
> >>>>> MyDAG: {
> >>>>> MyTask1 >> SomeOtherTask1
> >>>>> MyTask2 >> SomeOtherTask1
> >>>>> }
> >>>>>
> >>>>> Dynamic DAGs in this case would just consist of a service calling
> >>>> "Airflow
> >>>>> Submit" that does it's own form of authentication to get access to
> some
> >>>>> kind of tokens (or basically just forwarding the secrets the users of
> >> the
> >>>>> dynamic DAG submit).
> >>>>>
> >>>>> For the default Airflow implementation you can maybe just have the
> Dag
> >>>>> Serialization server bundled with the Scheduler, with auth turned
> off,
> >>>> and
> >>>>> to periodically update the Dag Serialization store which would
> emulate
> >>>> the
> >>>>> current behavior closely.
> >>>>>
> >>>>> Pros:
> >>>>> 1. Consistency across running task instances in a dagrun/scheduler,
> >>>>> reproducability and auditability of DAGs
> >>>>> 2. Users can control when to deploy their DAGs
> >>>>> 3. Scheduler runs much faster since it doesn't have to run python
> files
> >>>> and
> >>>>> e.g. make network calls
> >>>>> 4. Scaling scheduler becomes easier because can have different
> service
> >>>>> responsible for parsing DAGs which can be trivially scaled
> horizontally
> >>>>> (clients are doing the parsing)
> >>>>> 5. Potentially makes creating ad-hoc DAGs/backfilling/iterating on
> DAGs
> >>>>> easier? e.g. can use the Scheduler itself to schedule backfills with
> a
> >>>>> slightly modified serialized version of a DAG.
> >>>>>
> >>>>> Cons:
> >>>>> 1. Have to deprecate a lot of popular features, e.g. allowing custom
> >>>>> callbacks in operators (e.g. on_failure), and jinja_templates
> >>>>> 2. Version compatibility problems, e.g. user/service client might be
> >>>>> serializing arguments for hooks/operators that have been deprecated
> in
> >>>>> newer versions of the hooks, or the serialized DAG schema changes and
> >> old
> >>>>> DAGs aren't automatically updated. Might want to have some kind of
> >>>>> versioning system for serialized DAGs to at least ensure that stored
> >> DAGs
> >>>>> are valid when the Scheduler/Worker/etc are upgraded, maybe something
> >>>>> similar to thrift/protobuf versioning.
> >>>>> 3. Additional complexity - additional service, logic on
> >> workers/scheduler
> >>>>> to fetch/cache serialized DAGs efficiently, expiring/archiving old
> DAG
> >>>>> definitions, etc
> >>>>>
> >>>>>
> >>>>> On Sun, Jul 29, 2018 at 3:20 PM Bolke de Bruin <bdbruin@gmail.com
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>
> >>>>>> Ah gotcha. That’s another issue actually (but related).
> >>>>>>
> >>>>>> Let’s say we trust the owner field of the DAGs I think we could do
> the
> >>>>>> following. We then have a table (and interface) to tell Airflow what
> >>>> users
> >>>>>> have access to what connections. The scheduler can then check if the
> >>>> task
> >>>>>> in the dag can access the conn_id it is asking for. Auto generated
> >> dags
> >>>>>> still have an owner (or should) and therefore should be fine. Some
> >>>>>> integrity checking could/should be added as we want to be sure that
> >> the
> >>>>>> task we schedule is the task we launch. So a signature calculated at
> >> the
> >>>>>> scheduler (or part of the DAG), send as part of the metadata and
> >>>> checked by
> >>>>>> the executor is probably smart.
> >>>>>>
> >>>>>> You can also make this more fancy by integrating with something like
> >>>>>> Apache Ranger that allows for policy checking.
> >>>>>>
> >>>>>> Obviously, the trusting the user part is key here. It is one of the
> >>>>>> reasons I was suggesting using “airflow submit” to update / add dags
> >> in
> >>>>>> Airflow. We could enforce authentication on the DAG. It was kind of
> >>>> ruled
> >>>>>> out in favor of git time machines although these never happened
> afaik
> >>>> ;-).
> >>>>>>
> >>>>>> BTW: I have updated my implementation with protobuf. Metadata is now
> >>>>>> available at executor and task.
> >>>>>>
> >>>>>>
> >>>>>>> On 29 Jul 2018, at 15:47, Dan Davydov <ddavydov@twitter.com.INVALID
> >> <ma...@twitter.com.INVALID>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> The concern is how to secure secrets on the scheduler such that
> only
> >>>>>>> certain DAGs can access them, and in the case of files that create
> >> DAGs
> >>>>>>> dynamically, only some set of DAGs should be able to access these
> >>>>>> secrets.
> >>>>>>>
> >>>>>>> e.g. if there is a secret/keytab that can be read by DAG A
> generated
> >> by
> >>>>>>> file X, and file X generates DAG B as well, there needs to be a
> >> scheme
> >>>> to
> >>>>>>> stop the parsing of DAG B on the scheduler from being able to read
> >> the
> >>>>>>> secret in DAG A.
> >>>>>>>
> >>>>>>> Does that make sense?
> >>>>>>>
> >>>>>>> On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin <bdbruin@gmail.com
> >> <ma...@gmail.com>
> >>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>
> >>>>>>>> I’m not sure what you mean. The example I created allows for
> dynamic
> >>>>>> DAGs,
> >>>>>>>> as the scheduler obviously knows about the tasks when they are
> ready
> >>>> to
> >>>>>> be
> >>>>>>>> scheduled.
> >>>>>>>> This isn’t any different from a static DAG or a dynamic one.
> >>>>>>>>
> >>>>>>>> For Kerberos it isnt that special. Basically a keytab are the
> >>>> revokable
> >>>>>>>> users credentials
> >>>>>>>> in a special format. The keytab itself can be protected by a
> >> password.
> >>>>>> So
> >>>>>>>> I can imagine
> >>>>>>>> that a connection is defined that sets a keytab location and
> >> password
> >>>> to
> >>>>>>>> access the keytab.
> >>>>>>>> The scheduler understands this (or maybe the Connection model) and
> >>>>>>>> serializes and sends
> >>>>>>>> it to the worker as part of the metadata. The worker then
> >> reconstructs
> >>>>>> the
> >>>>>>>> keytab and issues
> >>>>>>>> a kinit or supplies it to the other service requiring it (eg.
> Spark)
> >>>>>>>>
> >>>>>>>> * Obviously the worker and scheduler need to communicate over SSL.
> >>>>>>>> * There is a challenge at the worker level. Credentials are
> secured
> >>>>>>>> against other users, but are readable by the owning user. So
> >> imagine 2
> >>>>>> DAGs
> >>>>>>>> from two different users with different connections without sudo
> >>>>>>>> configured. If they end up at the same worker if DAG 2 is
> malicious
> >> it
> >>>>>>>> could read files and memory created by DAG 1. This is the reason
> why
> >>>>>> using
> >>>>>>>> environment variables are NOT safe (DAG 2 could read
> >>>>>> /proc/<pid>/environ).
> >>>>>>>> To mitigate this we probably need to PIPE the data to the task’s
> >>>> STDIN.
> >>>>>> It
> >>>>>>>> won’t solve the issue but will make it harder as now it will only
> be
> >>>> in
> >>>>>>>> memory.
> >>>>>>>> * The reconstructed keytab (or the initalized version) can be
> stored
> >>>> in,
> >>>>>>>> most likely, the process-keyring (
> >>>>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html>> <
> >>>>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html>> <
> >>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html>>>>). As
> >>>>>>>> mentioned earlier this poses a challenge for Java applications
> that
> >>>>>> cannot
> >>>>>>>> read from this location (keytab an ccache). Writing it out to the
> >>>>>>>> filesystem then becomes a possibility. This is essentially the
> same
> >>>> how
> >>>>>>>> Spark solves it (
> >>>>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
> >>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode>>> <
> >>>>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
> >>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode>>>>).
> >>>>>>>>
> >>>>>>>> Why not work on this together? We need it as well. Airflow as it
> is
> >>>> now
> >>>>>> we
> >>>>>>>> consider the biggest security threat and it is really hard to
> secure
> >>>> it.
> >>>>>>>> The above would definitely be a serious improvement. Another step
> >>>> would
> >>>>>> be
> >>>>>>>> to stop Tasks from accessing the Airflow DB all together.
> >>>>>>>>
> >>>>>>>> Cheers
> >>>>>>>> Bolke
> >>>>>>>>
> >>>>>>>>> On 29 Jul 2018, at 05:36, Dan Davydov
> <ddavydov@twitter.com.INVALID
> >> <ma...@twitter.com.INVALID>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID>>
> >>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:
> >>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> This makes sense, and thanks for putting this together. I might
> >> pick
> >>>>>> this
> >>>>>>>>> up myself depending on if we can get the rest of the
> mutli-tenancy
> >>>>>> story
> >>>>>>>>> nailed down, but I still think the tricky part is figuring out
> how
> >> to
> >>>>>>>> allow
> >>>>>>>>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to
> work
> >>>>>> with
> >>>>>>>>> Kerberos, curious what your thoughts are there. How would secrets
> >> be
> >>>>>>>> passed
> >>>>>>>>> securely in a multi-tenant Scheduler starting from parsing the
> DAGs
> >>>> up
> >>>>>> to
> >>>>>>>>> the executor sending them off?
> >>>>>>>>>
> >>>>>>>>> On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <
> bdbruin@gmail.com
> >> <ma...@gmail.com>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
> >>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>>>
> >>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com
> >> <ma...@gmail.com>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Here:
> >>>>>>>>>>
> >>>>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections
> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
> >>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections>>>> <
> >>>>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections
> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
> >>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Is a working rudimentary implementation that allows securing the
> >>>>>>>>>> connections (only LocalExecutor at the moment)
> >>>>>>>>>>
> >>>>>>>>>> * It enforces the use of “conn_id” instead of the mix that we
> have
> >>>> now
> >>>>>>>>>> * A task if using “conn_id” has ‘auto-registered’ (which is a
> >> noop)
> >>>>>> its
> >>>>>>>>>> connections
> >>>>>>>>>> * The scheduler reads the connection informations and serializes
> >> it
> >>>> to
> >>>>>>>>>> json (which should be a different format, protobuf preferably)
> >>>>>>>>>> * The scheduler then sends this info to the executor
> >>>>>>>>>> * The executor puts this in the environment of the task
> >> (environment
> >>>>>>>> most
> >>>>>>>>>> likely not secure enough for us)
> >>>>>>>>>> * The BaseHook reads out this environment variable and does not
> >> need
> >>>>>> to
> >>>>>>>>>> touch the database
> >>>>>>>>>>
> >>>>>>>>>> The example_http_operator works, I havent tested any other. To
> >> make
> >>>> it
> >>>>>>>>>> work I just adjusted the hook and operator to use “conn_id”
> >> instead
> >>>>>>>>>> of the non standard http_conn_id.
> >>>>>>>>>>
> >>>>>>>>>> Makes sense?
> >>>>>>>>>>
> >>>>>>>>>> B.
> >>>>>>>>>>
> >>>>>>>>>> * The BaseHook is adjusted to not connect to the database
> >>>>>>>>>>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbruin@gmail.com
> >> <ma...@gmail.com>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >>>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Well, I don’t think a hook (or task) should be obtain it by
> >> itself.
> >>>>>> It
> >>>>>>>>>> should be supplied.
> >>>>>>>>>>> At the moment you start executing the task you cannot trust it
> >>>>>> anymore
> >>>>>>>>>> (ie. it is unmanaged
> >>>>>>>>>>> / non airflow code).
> >>>>>>>>>>>
> >>>>>>>>>>> So we could change the basehook to understand supplied
> >> credentials
> >>>>>> and
> >>>>>>>>>> populate
> >>>>>>>>>>> a hash with “conn_ids”. Hooks normally call
> >> BaseHook.get_connection
> >>>>>>>>>> anyway, so
> >>>>>>>>>>> it shouldnt be too hard and should in principle not require
> >> changes
> >>>>>> to
> >>>>>>>>>> the hooks
> >>>>>>>>>>> themselves if they are well behaved.
> >>>>>>>>>>>
> >>>>>>>>>>> B.
> >>>>>>>>>>>
> >>>>>>>>>>>> On 28 Jul 2018, at 17:41, Dan Davydov
> >>>> <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID
> >>>>
> >>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:
> >>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
> >>>>>>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:
> >>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
> >> <mailto:
> >>>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID
> >>>>>
> >>>> <mailto:
> >>>>>>>> ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
> >
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID
> >>>>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
> >> <ma...@twitter.com.INVALID>
> >>>>>>>>>
> >>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> *So basically in the scheduler we parse the dag. Either from
> the
> >>>>>>>>>> manifest
> >>>>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
> >>>>>>>>>> register?) we
> >>>>>>>>>>>> know what connections and keytabs are available dag wide or
> per
> >>>>>> task.*
> >>>>>>>>>>>> This is the hard part that I was curious about, for
> dynamically
> >>>>>>>> created
> >>>>>>>>>>>> DAGs, e.g. those generated by reading tasks in a MySQL
> database
> >>>> or a
> >>>>>>>>>> json
> >>>>>>>>>>>> file, there isn't a great way to do this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I 100% agree with deprecating the connections table (at least
> >> for
> >>>>>> the
> >>>>>>>>>>>> secure option). The main work there is rewriting all hooks to
> >> take
> >>>>>>>>>>>> credentials from arbitrary data sources by allowing a
> customized
> >>>>>>>>>>>> CredentialsReader class. Although hooks are technically
> >> private, I
> >>>>>>>>>> think a
> >>>>>>>>>>>> lot of companies depend on them so the PMC should probably
> >> discuss
> >>>>>> if
> >>>>>>>>>> this
> >>>>>>>>>>>> is an Airflow 2.0 change or not.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <
> >> bdbruin@gmail.com <ma...@gmail.com>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
> >>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>>>
> >>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com
> >> <ma...@gmail.com>>>>
> >>>>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com
> >> <ma...@gmail.com>>> <mailto:
> >>>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com
> >> <ma...@gmail.com>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Sure. In general I consider keytabs as a part of connection
> >>>>>>>>>> information.
> >>>>>>>>>>>>> Connections should be secured by sending the connection
> >>>>>> information a
> >>>>>>>>>> task
> >>>>>>>>>>>>> needs as part of information the executor gets. A task should
> >>>> then
> >>>>>>>> not
> >>>>>>>>>> need
> >>>>>>>>>>>>> access to the connection table in Airflow. Keytabs could then
> >> be
> >>>>>> send
> >>>>>>>>>> as
> >>>>>>>>>>>>> part of the connection information (base64 encoded) and setup
> >> by
> >>>>>> the
> >>>>>>>>>>>>> executor (this key) to be read only to the task it is
> >> launching.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So basically in the scheduler we parse the dag. Either from
> the
> >>>>>>>>>> manifest
> >>>>>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
> >>>>>>>>>> register?) we
> >>>>>>>>>>>>> know what connections and keytabs are available dag wide or
> per
> >>>>>> task.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The credentials and connection information then are
> serialized
> >>>>>> into a
> >>>>>>>>>>>>> protobuf message and send to the executor as part of the
> >> “queue”
> >>>>>>>>>> action.
> >>>>>>>>>>>>> The worker then deserializes the information and makes it
> >>>> securely
> >>>>>>>>>>>>> available to the task (which is quite hard btw).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On that last bit making the info securely available might be
> >>>>>> storing
> >>>>>>>>>> it in
> >>>>>>>>>>>>> the Linux KEYRING (supported by python keyring). Keytabs will
> >> be
> >>>>>>>> tough
> >>>>>>>>>> to
> >>>>>>>>>>>>> do properly due to Java not properly supporting KEYRING and
> >> only
> >>>>>>>> files
> >>>>>>>>>> and
> >>>>>>>>>>>>> these are hard to make secure (due to the possibility a
> process
> >>>>>> will
> >>>>>>>>>> list
> >>>>>>>>>>>>> all files in /tmp and get credentials through that). Maybe
> >>>> storing
> >>>>>>>> the
> >>>>>>>>>>>>> keytab with a password and having the password in the KEYRING
> >>>> might
> >>>>>>>>>> work.
> >>>>>>>>>>>>> Something to find out.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> B.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Verstuurd vanaf mijn iPad
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
> >>>>>>>>>> <ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
> >> <ma...@twitter.com.INVALID>>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
> >> <ma...@twitter.com.INVALID>
> >>>>>>
> >>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:
> >>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID>>
> >>>>>>>>
> >>>>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:
> >>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
> >> <mailto:
> >>>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID
> >>>>>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
> >> <ma...@twitter.com.INVALID>>
> >>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> >> ddavydov@twitter.com.INVALID> <mailto:
> >>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> het volgende geschreven:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm curious if you had any ideas in terms of ideas to enable
> >>>>>>>>>>>>> multi-tenancy
> >>>>>>>>>>>>>> with respect to Kerberos in Airflow.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
> >>>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com
> >> <ma...@gmail.com>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
> >>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com
> >> <ma...@gmail.com>>>>
> >>>>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com
> >> <ma...@gmail.com>>> <mailto:
> >>>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com
> >> <ma...@gmail.com>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Cool. The doc will need some refinement as it isn't
> entirely
> >>>>>>>>>> accurate.
> >>>>>>>>>>>>> In
> >>>>>>>>>>>>>>> addition we need to separate between Airflow as a client of
> >>>>>>>>>> kerberized
> >>>>>>>>>>>>>>> services (this is what is talked about in the astronomer
> doc)
> >>>> vs
> >>>>>>>>>>>>>>> kerberizing airflow itself, which the API supports.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In general to access kerberized services (airflow as a
> >> client)
> >>>>>> one
> >>>>>>>>>> needs
> >>>>>>>>>>>>>>> to start the ticket renewer with a valid keytab. For the
> >> hooks
> >>>> it
> >>>>>>>>>> isn't
> >>>>>>>>>>>>>>> always required to change the hook to support it. Hadoop
> cli
> >>>>>> tools
> >>>>>>>>>> often
> >>>>>>>>>>>>>>> just pick it up as their client config is set to do so.
> Then
> >>>>>>>> another
> >>>>>>>>>>>>> class
> >>>>>>>>>>>>>>> is there for HTTP-like services which are accessed by
> urllib
> >>>>>> under
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> hood, these typically use SPNEGO. These often need to be
> >>>> adjusted
> >>>>>>>> as
> >>>>>>>>>> it
> >>>>>>>>>>>>>>> requires some urllib config. Finally, there are protocols
> >> which
> >>>>>> use
> >>>>>>>>>> SASL
> >>>>>>>>>>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO).
> >> These
> >>>>>>>>>> require
> >>>>>>>>>>>>> per
> >>>>>>>>>>>>>>> protocol implementations.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> From the top of my head we support kerberos client side now
> >>>> with:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> * Spark
> >>>>>>>>>>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming
> >> libhdfs
> >>>>>>>>>>>>>>> implementation)
> >>>>>>>>>>>>>>> * Hive (not metastore afaik)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Two things to remember:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> * If a job (ie. Spark job) will finish later than the
> maximum
> >>>>>>>> ticket
> >>>>>>>>>>>>>>> lifetime you probably need to provide a keytab to said
> >>>>>> application.
> >>>>>>>>>>>>>>> Otherwise you will get failures after the expiry.
> >>>>>>>>>>>>>>> * A keytab (used by the renewer) are credentials (user and
> >>>> pass)
> >>>>>> so
> >>>>>>>>>> jobs
> >>>>>>>>>>>>>>> are executed under the keytab in use at that moment
> >>>>>>>>>>>>>>> * Securing keytab in multi tenancy airflow is a challenge.
> >> This
> >>>>>>>> also
> >>>>>>>>>>>>> goes
> >>>>>>>>>>>>>>> for securing connections. This we need to fix at some
> point.
> >>>>>>>> Solution
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> now seems to be no multi tenancy.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Kerberos seems harder than it is btw. Still, we are
> sometimes
> >>>>>>>> moving
> >>>>>>>>>>>>> away
> >>>>>>>>>>>>>>> from it to OAUTH2 based authentication. This gets use
> closer
> >> to
> >>>>>>>> cloud
> >>>>>>>>>>>>>>> standards (but we are on prem)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> B.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hitesh@apache.org
> >> <ma...@apache.org>
> >>>> <mailto:hitesh@apache.org <ma...@apache.org>>
> >>>>>> <mailto:hitesh@apache.org <ma...@apache.org> <mailto:
> >> hitesh@apache.org <ma...@apache.org>>> <mailto:
> >>>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:
> >> hitesh@apache.org <ma...@apache.org>> <mailto:
> >>>> hitesh@apache.org <ma...@apache.org> <mailto:
> hitesh@apache.org
> >> <ma...@apache.org>>>> <mailto:
> >>>>>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:
> >> hitesh@apache.org <ma...@apache.org>> <mailto:
> >>>> hitesh@apache.org <ma...@apache.org> <mailto:
> hitesh@apache.org
> >> <ma...@apache.org>>> <mailto:
> >>>>>> hitesh@apache.org <ma...@apache.org> <mailto:
> >> hitesh@apache.org <ma...@apache.org>> <mailto:hitesh@apache.org
> >> <ma...@apache.org>
> >>>> <mailto:hitesh@apache.org <ma...@apache.org>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Taylor
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +1 on upstreaming this. It would be great if you can
> submit
> >> a
> >>>>>> pull
> >>>>>>>>>>>>>>> request
> >>>>>>>>>>>>>>>> to enhance the apache airflow docs.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> thanks
> >>>>>>>>>>>>>>>> Hitesh
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
> >>>>>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
> >>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com>>>> <mailto:
> >>>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
> >>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> While we're on the topic, I'd love any feedback from
> Bolke
> >> or
> >>>>>>>>>> others
> >>>>>>>>>>>>>>> who've
> >>>>>>>>>>>>>>>>> used Kerberos with Airflow on this quick guide I put
> >> together
> >>>>>>>>>>>>> yesterday.
> >>>>>>>>>>>>>>>>> It's similar to what's in the Airflow docs but instead
> all
> >> on
> >>>>>> one
> >>>>>>>>>> page
> >>>>>>>>>>>>>>>>> and slightly
> >>>>>>>>>>>>>>>>> expanded.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>>>
> >>>>>> <
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>>>
> >>>>>> <
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>>>
> >>>>>> <
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>>>
> >>>>>> <
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>>>>> (or web version <
> >> https://www.astronomer.io/guides/kerberos/ <
> >> https://www.astronomer.io/guides/kerberos/>
> >>>> <https://www.astronomer.io/guides/kerberos/ <
> >> https://www.astronomer.io/guides/kerberos/>> <
> >>>>>> https://www.astronomer.io/guides/kerberos/ <
> >> https://www.astronomer.io/guides/kerberos/> <
> >>>> https://www.astronomer.io/guides/kerberos/ <
> >> https://www.astronomer.io/guides/kerberos/>>> <
> >>>>>>>> https://www.astronomer.io/guides/kerberos/ <
> >> https://www.astronomer.io/guides/kerberos/> <
> >>>> https://www.astronomer.io/guides/kerberos/ <
> >> https://www.astronomer.io/guides/kerberos/>> <
> >>>>>> https://www.astronomer.io/guides/kerberos/ <
> >> https://www.astronomer.io/guides/kerberos/> <
> >>>> https://www.astronomer.io/guides/kerberos/ <
> >> https://www.astronomer.io/guides/kerberos/>>>>>)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> One thing I'd like to add is a minimal example of how to
> >>>>>>>> Kerberize
> >>>>>>>>>> a
> >>>>>>>>>>>>>>> hook.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'd be happy to upstream this as well if it's useful
> >> (maybe a
> >>>>>>>>>>>>> Concepts >
> >>>>>>>>>>>>>>>>> Additional Functionality > Kerberos page?)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> Taylor
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> *Taylor Edmiston*
> >>>>>>>>>>>>>>>>> Blog <https://blog.tedmiston.com/ <
> >> https://blog.tedmiston.com/> <
> >>>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>> <
> >> https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
> >>>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>
> >>>>>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
> >> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>> <
> >>>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
> >> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>>>
> >>>>>>>> | CV
> >>>>>>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor <
> >> https://stackoverflow.com/cv/taylor> <
> >>>> https://stackoverflow.com/cv/taylor <
> >> https://stackoverflow.com/cv/taylor>> <
> >>>>>> https://stackoverflow.com/cv/taylor <
> >> https://stackoverflow.com/cv/taylor> <
> >>>> https://stackoverflow.com/cv/taylor <
> >> https://stackoverflow.com/cv/taylor>>> <
> >>>>>>>> https://stackoverflow.com/cv/taylor <
> >> https://stackoverflow.com/cv/taylor> <
> >>>> https://stackoverflow.com/cv/taylor <
> >> https://stackoverflow.com/cv/taylor>> <
> >>>>>> https://stackoverflow.com/cv/taylor <
> >> https://stackoverflow.com/cv/taylor> <
> >>>> https://stackoverflow.com/cv/taylor <
> >> https://stackoverflow.com/cv/taylor>>>>> | LinkedIn
> >>>>>>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/ <
> >> https://www.linkedin.com/in/tedmiston/> <
> >>>> https://www.linkedin.com/in/tedmiston/ <
> >> https://www.linkedin.com/in/tedmiston/>> <
> >>>>>> https://www.linkedin.com/in/tedmiston/ <
> >> https://www.linkedin.com/in/tedmiston/> <
> >>>> https://www.linkedin.com/in/tedmiston/ <
> >> https://www.linkedin.com/in/tedmiston/>>> <
> >>>>>>>> https://www.linkedin.com/in/tedmiston/ <
> >> https://www.linkedin.com/in/tedmiston/> <
> >>>> https://www.linkedin.com/in/tedmiston/ <
> >> https://www.linkedin.com/in/tedmiston/>> <
> >>>>>> https://www.linkedin.com/in/tedmiston/ <
> >> https://www.linkedin.com/in/tedmiston/> <
> >>>> https://www.linkedin.com/in/tedmiston/ <
> >> https://www.linkedin.com/in/tedmiston/>>>>> | AngelList
> >>>>>>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor> <
> >> https://angel.co/taylor <https://angel.co/taylor>> <
> >>>> https://angel.co/taylor <https://angel.co/taylor> <
> >> https://angel.co/taylor <https://angel.co/taylor>>> <
> >>>>>> https://angel.co/taylor <https://angel.co/taylor> <
> >> https://angel.co/taylor <https://angel.co/taylor>> <
> >>>> https://angel.co/taylor <https://angel.co/taylor> <
> >> https://angel.co/taylor <https://angel.co/taylor>>>>> | Stack
> >>>>>>>> Overflow
> >>>>>>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston
> <
> >> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> >> https://stackoverflow.com/users/149428/taylor-edmiston>> <
> >>>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> >> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> >> https://stackoverflow.com/users/149428/taylor-edmiston>>> <
> >>>>>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> >> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> >> <https://stackoverflow.com/users/149428/taylor-edmiston>

Re: Kerberos and Airflow

Posted by Bolke de Bruin <bd...@gmail.com>.

Hi Dan,

Don’t misunderstand me. I think what I proposed is complementary to the dag submit function. The only thing you mentioned I don’t think is needed is to fully serialize up front and therefore excluding callback etc (although there are other serialization libraries like marshmallow that might be able to do it).

You are right to mention that the hashes should be calculated at submit time and a authorized user should be able to recalculate a hash. Another option could be something like https://pypi.org/project/signedimp/ which we could use to verify dependencies.

I’ll start writing something up. We can then shoot holes in it (i think you have a point on the crypto) and maybe do some hacking on it. This could be part of the hackathon in sept in SF, I’m sure some other people would have an interest in it as well.

B.

Verstuurd vanaf mijn iPad

> Op 3 aug. 2018 om 23:14 heeft Dan Davydov <dd...@twitter.com.INVALID> het volgende geschreven:
> 
> I designed a system similar to what you are describing which is in use at
> Airbnb (only DAGs on a whitelist would be allowed to merged to the git repo
> if they used certain types of impersonation), it worked for simple use
> cases, but the problem was doing access control becomes very difficult,
> e.g. solving the problem of which DAGs map to which manifest files, and
> which manifest files can access which secrets.
> 
> There is also a security risk where someone changes e.g. a python file
> dependency of your task, or let's say you figure out a way to block those
> kinds of changes based on your sthashing, what if there is a legitimate
> change in a dependency and you want to recalculate the hash? Then I think
> you go back to a solution like your proposed "airflow submit" command to
> accomplish this.
> 
> Additional concerns:
> - I'm not sure if I'm a fan of the the first time a scheduler parses a DAG
> to be what creates the hashes either, it feels to me like
> encryption/hashing should be done before DAGs are even parsed by the
> scheduler (at commit time or submit time of the DAGs)
> - The type of the encrypted key seem kind of hacky to me, i.e. some kind of
> custom hash based on DAG structure instead of a simple token passed in by
> users which has a clear separation of concerns WRT security
> - Added complexity both to Airflow code, and to users as they need to
> define or customize hashing functions for DAGs to improve security
> If we can get a reasonably secure solution then it might be a reasonable
> trade-off considering the alternative is a major overhaul/restrictions to
> DAGs.
> 
> Maybe I'm missing some details that would alleviate my concerns here, and a
> bit of a more in-depth document might help?
> 
> 
> 
> *Also: using the Kubernetes executor combined with some of the things
> wediscussed greatly enhances the security of Airflow as the
> environment isn’t really shared anymore.*
> Assuming a multi-tenant scheduler, I feel the same set of hard problems
> exist with Kubernetes, as the executor mainly just simplifies the
> post-executor parts of task scheduling/execution which I think you already
> outlined a good solution for early on in this thread (passing keys from the
> executor to workers).
> 
> Happy to set up some time to talk real-time about this by the way, once we
> iron out the details I want to implement whatever the best solution we come
> up with is.
> 
>> On Thu, Aug 2, 2018 at 4:13 PM Bolke de Bruin <bd...@gmail.com> wrote:
>> 
>> You mentioned you would like to make sure that the DAG (and its tasks)
>> runs in a confined set of settings. Ie.
>> A given set of connections at submission time not at run time. So here we
>> can make use of the fact that both the scheduler
>> and the worker parse the DAG.
>> 
>> Firstly, when scheduler evaluates a DAG it can add an integrity check
>> (hash) for each task. The executor can encrypt the
>> metadata with this hash ensuring that the structure of the DAG remained
>> the same. It means that the task is only
>> able to decrypt the metadata when it is able to calculate the same hash.
>> 
>> Similarly, if the scheduler parses a DAG for the first time it can
>> register the hashes for the tasks. It can then verify these hashes
>> at runtime to ensure the structure of the tasks have stayed the same. In
>> the manifest (which could even in the DAG or
>> part of the DAG definition) we could specify which fields would be used
>> for hash calculation. We could even specify
>> static hashes. This would give flexibility as to what freedom the users
>> have in the auto-generated DAGS.
>> 
>> Something like that?
>> 
>> B.
>> 
>>> On 2 Aug 2018, at 20:12, Dan Davydov <dd...@twitter.com.INVALID>
>> wrote:
>>> 
>>> I'm very intrigued, and am curious how this would work in a bit more
>>> detail, especially for dynamically created DAGs (how would static
>> manifests
>>> map to DAGs that are generated from rows in a MySQL table for example)?
>> You
>>> could of course have something like regexes in your manifest file like
>>> some_dag_framework_dag_*, but then how would you make sure that other
>> users
>>> did not create DAGs that matched this regex?
>>> 
>>> On Thu, Aug 2, 2018 at 1:51 PM Bolke de Bruin <bdbruin@gmail.com
>> <ma...@gmail.com>> wrote:
>>> 
>>>> Hi Dan,
>>>> 
>>>> I discussed this a little bit with one of the security architects here.
>> We
>>>> think that
>>>> you can have a fair trade off between security and usability by having
>>>> a kind of manifest with the dag you are submitting. This manifest can
>> then
>>>> specify what the generated tasks/dags are allowed to do and what
>> metadata
>>>> to provide to them. We could also let the scheduler generate hashes per
>>>> generated
>>>> DAG / task and verify those with an established version (1st run?). This
>>>> limits the
>>>> attack vector.
>>>> 
>>>> A DagSerializer would be great, but I think it solves a different issue
>>>> and the above
>>>> is somewhat simpler to implement?
>>>> 
>>>> Bolke
>>>> 
>>>>> On 29 Jul 2018, at 23:47, Dan Davydov <dd...@twitter.com.INVALID>
>>>> wrote:
>>>>> 
>>>>> *Let’s say we trust the owner field of the DAGs I think we could do the
>>>>> following.*
>>>>> *Obviously, the trusting the user part is key here. It is one of the
>>>>> reasons I was suggesting using “airflow submit” to update / add dags in
>>>>> Airflow*
>>>>> 
>>>>> 
>>>>> *This is the hard part about my question.*
>>>>> I think in a true multi-tenant environment we wouldn't be able to trust
>>>> the
>>>>> user, otherwise we wouldn't necessarily even need a mapping of Airflow
>>>> DAG
>>>>> users to secrets, because if we trust users to set the correct Airflow
>>>> user
>>>>> for DAGs, we are basically trusting them with all of the creds the
>>>> Airflow
>>>>> scheduler can access for all users anyways.
>>>>> 
>>>>> I actually had the same thought as your "airflow submit" a while ago,
>>>> which
>>>>> I discussed with Alex, basically creating an API for adding DAGs
>> instead
>>>> of
>>>>> having the Scheduler parse them. FWIW I think it's superior to the git
>>>> time
>>>>> machine approach because it's a more generic form of "serialization"
>> and
>>>> is
>>>>> more correct as well because the same DAG file parsed on a given git
>> SHA
>>>>> can produce different DAGs. Let me know what you think, and maybe I can
>>>>> start a more formal design doc if you are onboard:
>>>>> 
>>>>> A user or service with an auth token sends an "airflow submit" request
>>>> to a
>>>>> new kind of Dag Serialization service, along with the serialized DAG
>>>>> objects generated by parsing on the client. It's important that these
>>>>> serialized objects are declaritive and not e.g. pickles so that the
>>>>> scheduler/workers can consume them and reproducability of the DAGs is
>>>>> guaranteed. The service will then store each generated DAG along with
>>>> it's
>>>>> access based on the provided token e.g. using Ranger, and the
>>>>> scheduler/workers will use the stored DAGs for scheduling/execution.
>>>>> Operators would be deployed along with the Airflow code separately from
>>>> the
>>>>> serialized DAGs.
>>>>> 
>>>>> A serialed DAG would look something like this (basically Luigi-style
>> :)):
>>>>> MyTask - BashOperator: {
>>>>> cmd: "sleep 1"
>>>>> user: "Foo"
>>>>> access: "token1", "token2"
>>>>> }
>>>>> 
>>>>> MyDAG: {
>>>>> MyTask1 >> SomeOtherTask1
>>>>> MyTask2 >> SomeOtherTask1
>>>>> }
>>>>> 
>>>>> Dynamic DAGs in this case would just consist of a service calling
>>>> "Airflow
>>>>> Submit" that does it's own form of authentication to get access to some
>>>>> kind of tokens (or basically just forwarding the secrets the users of
>> the
>>>>> dynamic DAG submit).
>>>>> 
>>>>> For the default Airflow implementation you can maybe just have the Dag
>>>>> Serialization server bundled with the Scheduler, with auth turned off,
>>>> and
>>>>> to periodically update the Dag Serialization store which would emulate
>>>> the
>>>>> current behavior closely.
>>>>> 
>>>>> Pros:
>>>>> 1. Consistency across running task instances in a dagrun/scheduler,
>>>>> reproducability and auditability of DAGs
>>>>> 2. Users can control when to deploy their DAGs
>>>>> 3. Scheduler runs much faster since it doesn't have to run python files
>>>> and
>>>>> e.g. make network calls
>>>>> 4. Scaling scheduler becomes easier because can have different service
>>>>> responsible for parsing DAGs which can be trivially scaled horizontally
>>>>> (clients are doing the parsing)
>>>>> 5. Potentially makes creating ad-hoc DAGs/backfilling/iterating on DAGs
>>>>> easier? e.g. can use the Scheduler itself to schedule backfills with a
>>>>> slightly modified serialized version of a DAG.
>>>>> 
>>>>> Cons:
>>>>> 1. Have to deprecate a lot of popular features, e.g. allowing custom
>>>>> callbacks in operators (e.g. on_failure), and jinja_templates
>>>>> 2. Version compatibility problems, e.g. user/service client might be
>>>>> serializing arguments for hooks/operators that have been deprecated in
>>>>> newer versions of the hooks, or the serialized DAG schema changes and
>> old
>>>>> DAGs aren't automatically updated. Might want to have some kind of
>>>>> versioning system for serialized DAGs to at least ensure that stored
>> DAGs
>>>>> are valid when the Scheduler/Worker/etc are upgraded, maybe something
>>>>> similar to thrift/protobuf versioning.
>>>>> 3. Additional complexity - additional service, logic on
>> workers/scheduler
>>>>> to fetch/cache serialized DAGs efficiently, expiring/archiving old DAG
>>>>> definitions, etc
>>>>> 
>>>>> 
>>>>> On Sun, Jul 29, 2018 at 3:20 PM Bolke de Bruin <bdbruin@gmail.com
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>> wrote:
>>>>> 
>>>>>> Ah gotcha. That’s another issue actually (but related).
>>>>>> 
>>>>>> Let’s say we trust the owner field of the DAGs I think we could do the
>>>>>> following. We then have a table (and interface) to tell Airflow what
>>>> users
>>>>>> have access to what connections. The scheduler can then check if the
>>>> task
>>>>>> in the dag can access the conn_id it is asking for. Auto generated
>> dags
>>>>>> still have an owner (or should) and therefore should be fine. Some
>>>>>> integrity checking could/should be added as we want to be sure that
>> the
>>>>>> task we schedule is the task we launch. So a signature calculated at
>> the
>>>>>> scheduler (or part of the DAG), send as part of the metadata and
>>>> checked by
>>>>>> the executor is probably smart.
>>>>>> 
>>>>>> You can also make this more fancy by integrating with something like
>>>>>> Apache Ranger that allows for policy checking.
>>>>>> 
>>>>>> Obviously, the trusting the user part is key here. It is one of the
>>>>>> reasons I was suggesting using “airflow submit” to update / add dags
>> in
>>>>>> Airflow. We could enforce authentication on the DAG. It was kind of
>>>> ruled
>>>>>> out in favor of git time machines although these never happened afaik
>>>> ;-).
>>>>>> 
>>>>>> BTW: I have updated my implementation with protobuf. Metadata is now
>>>>>> available at executor and task.
>>>>>> 
>>>>>> 
>>>>>>> On 29 Jul 2018, at 15:47, Dan Davydov <ddavydov@twitter.com.INVALID
>> <ma...@twitter.com.INVALID>>
>>>>>> wrote:
>>>>>>> 
>>>>>>> The concern is how to secure secrets on the scheduler such that only
>>>>>>> certain DAGs can access them, and in the case of files that create
>> DAGs
>>>>>>> dynamically, only some set of DAGs should be able to access these
>>>>>> secrets.
>>>>>>> 
>>>>>>> e.g. if there is a secret/keytab that can be read by DAG A generated
>> by
>>>>>>> file X, and file X generates DAG B as well, there needs to be a
>> scheme
>>>> to
>>>>>>> stop the parsing of DAG B on the scheduler from being able to read
>> the
>>>>>>> secret in DAG A.
>>>>>>> 
>>>>>>> Does that make sense?
>>>>>>> 
>>>>>>> On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin <bdbruin@gmail.com
>> <ma...@gmail.com>
>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>> 
>>>>>>>> I’m not sure what you mean. The example I created allows for dynamic
>>>>>> DAGs,
>>>>>>>> as the scheduler obviously knows about the tasks when they are ready
>>>> to
>>>>>> be
>>>>>>>> scheduled.
>>>>>>>> This isn’t any different from a static DAG or a dynamic one.
>>>>>>>> 
>>>>>>>> For Kerberos it isnt that special. Basically a keytab are the
>>>> revokable
>>>>>>>> users credentials
>>>>>>>> in a special format. The keytab itself can be protected by a
>> password.
>>>>>> So
>>>>>>>> I can imagine
>>>>>>>> that a connection is defined that sets a keytab location and
>> password
>>>> to
>>>>>>>> access the keytab.
>>>>>>>> The scheduler understands this (or maybe the Connection model) and
>>>>>>>> serializes and sends
>>>>>>>> it to the worker as part of the metadata. The worker then
>> reconstructs
>>>>>> the
>>>>>>>> keytab and issues
>>>>>>>> a kinit or supplies it to the other service requiring it (eg. Spark)
>>>>>>>> 
>>>>>>>> * Obviously the worker and scheduler need to communicate over SSL.
>>>>>>>> * There is a challenge at the worker level. Credentials are secured
>>>>>>>> against other users, but are readable by the owning user. So
>> imagine 2
>>>>>> DAGs
>>>>>>>> from two different users with different connections without sudo
>>>>>>>> configured. If they end up at the same worker if DAG 2 is malicious
>> it
>>>>>>>> could read files and memory created by DAG 1. This is the reason why
>>>>>> using
>>>>>>>> environment variables are NOT safe (DAG 2 could read
>>>>>> /proc/<pid>/environ).
>>>>>>>> To mitigate this we probably need to PIPE the data to the task’s
>>>> STDIN.
>>>>>> It
>>>>>>>> won’t solve the issue but will make it harder as now it will only be
>>>> in
>>>>>>>> memory.
>>>>>>>> * The reconstructed keytab (or the initalized version) can be stored
>>>> in,
>>>>>>>> most likely, the process-keyring (
>>>>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html>> <
>>>>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html>> <
>>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html>>>>). As
>>>>>>>> mentioned earlier this poses a challenge for Java applications that
>>>>>> cannot
>>>>>>>> read from this location (keytab an ccache). Writing it out to the
>>>>>>>> filesystem then becomes a possibility. This is essentially the same
>>>> how
>>>>>>>> Spark solves it (
>>>>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
>>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode>>> <
>>>>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
>>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode>>>>).
>>>>>>>> 
>>>>>>>> Why not work on this together? We need it as well. Airflow as it is
>>>> now
>>>>>> we
>>>>>>>> consider the biggest security threat and it is really hard to secure
>>>> it.
>>>>>>>> The above would definitely be a serious improvement. Another step
>>>> would
>>>>>> be
>>>>>>>> to stop Tasks from accessing the Airflow DB all together.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Bolke
>>>>>>>> 
>>>>>>>>> On 29 Jul 2018, at 05:36, Dan Davydov <ddavydov@twitter.com.INVALID
>> <ma...@twitter.com.INVALID>
>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID>>
>>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:
>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> This makes sense, and thanks for putting this together. I might
>> pick
>>>>>> this
>>>>>>>>> up myself depending on if we can get the rest of the mutli-tenancy
>>>>>> story
>>>>>>>>> nailed down, but I still think the tricky part is figuring out how
>> to
>>>>>>>> allow
>>>>>>>>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work
>>>>>> with
>>>>>>>>> Kerberos, curious what your thoughts are there. How would secrets
>> be
>>>>>>>> passed
>>>>>>>>> securely in a multi-tenant Scheduler starting from parsing the DAGs
>>>> up
>>>>>> to
>>>>>>>>> the executor sending them off?
>>>>>>>>> 
>>>>>>>>> On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbruin@gmail.com
>> <ma...@gmail.com>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
>> <ma...@gmail.com>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Here:
>>>>>>>>>> 
>>>>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
>>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections>>>> <
>>>>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
>>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections>>>>>
>>>>>>>>>> 
>>>>>>>>>> Is a working rudimentary implementation that allows securing the
>>>>>>>>>> connections (only LocalExecutor at the moment)
>>>>>>>>>> 
>>>>>>>>>> * It enforces the use of “conn_id” instead of the mix that we have
>>>> now
>>>>>>>>>> * A task if using “conn_id” has ‘auto-registered’ (which is a
>> noop)
>>>>>> its
>>>>>>>>>> connections
>>>>>>>>>> * The scheduler reads the connection informations and serializes
>> it
>>>> to
>>>>>>>>>> json (which should be a different format, protobuf preferably)
>>>>>>>>>> * The scheduler then sends this info to the executor
>>>>>>>>>> * The executor puts this in the environment of the task
>> (environment
>>>>>>>> most
>>>>>>>>>> likely not secure enough for us)
>>>>>>>>>> * The BaseHook reads out this environment variable and does not
>> need
>>>>>> to
>>>>>>>>>> touch the database
>>>>>>>>>> 
>>>>>>>>>> The example_http_operator works, I havent tested any other. To
>> make
>>>> it
>>>>>>>>>> work I just adjusted the hook and operator to use “conn_id”
>> instead
>>>>>>>>>> of the non standard http_conn_id.
>>>>>>>>>> 
>>>>>>>>>> Makes sense?
>>>>>>>>>> 
>>>>>>>>>> B.
>>>>>>>>>> 
>>>>>>>>>> * The BaseHook is adjusted to not connect to the database
>>>>>>>>>>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbruin@gmail.com
>> <ma...@gmail.com>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>>>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Well, I don’t think a hook (or task) should be obtain it by
>> itself.
>>>>>> It
>>>>>>>>>> should be supplied.
>>>>>>>>>>> At the moment you start executing the task you cannot trust it
>>>>>> anymore
>>>>>>>>>> (ie. it is unmanaged
>>>>>>>>>>> / non airflow code).
>>>>>>>>>>> 
>>>>>>>>>>> So we could change the basehook to understand supplied
>> credentials
>>>>>> and
>>>>>>>>>> populate
>>>>>>>>>>> a hash with “conn_ids”. Hooks normally call
>> BaseHook.get_connection
>>>>>>>>>> anyway, so
>>>>>>>>>>> it shouldnt be too hard and should in principle not require
>> changes
>>>>>> to
>>>>>>>>>> the hooks
>>>>>>>>>>> themselves if they are well behaved.
>>>>>>>>>>> 
>>>>>>>>>>> B.
>>>>>>>>>>> 
>>>>>>>>>>>> On 28 Jul 2018, at 17:41, Dan Davydov
>>>> <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
>>>> 
>>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:
>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
>>>>>>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:
>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>> <mailto:
>>>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
>>>>> 
>>>> <mailto:
>>>>>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
>>>> 
>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
>> <ma...@twitter.com.INVALID>
>>>>>>>>> 
>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> *So basically in the scheduler we parse the dag. Either from the
>>>>>>>>>> manifest
>>>>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>>>>>> register?) we
>>>>>>>>>>>> know what connections and keytabs are available dag wide or per
>>>>>> task.*
>>>>>>>>>>>> This is the hard part that I was curious about, for dynamically
>>>>>>>> created
>>>>>>>>>>>> DAGs, e.g. those generated by reading tasks in a MySQL database
>>>> or a
>>>>>>>>>> json
>>>>>>>>>>>> file, there isn't a great way to do this.
>>>>>>>>>>>> 
>>>>>>>>>>>> I 100% agree with deprecating the connections table (at least
>> for
>>>>>> the
>>>>>>>>>>>> secure option). The main work there is rewriting all hooks to
>> take
>>>>>>>>>>>> credentials from arbitrary data sources by allowing a customized
>>>>>>>>>>>> CredentialsReader class. Although hooks are technically
>> private, I
>>>>>>>>>> think a
>>>>>>>>>>>> lot of companies depend on them so the PMC should probably
>> discuss
>>>>>> if
>>>>>>>>>> this
>>>>>>>>>>>> is an Airflow 2.0 change or not.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <
>> bdbruin@gmail.com <ma...@gmail.com>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
>> <ma...@gmail.com>>>>
>>>>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
>> <ma...@gmail.com>>> <mailto:
>>>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com
>> <ma...@gmail.com>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Sure. In general I consider keytabs as a part of connection
>>>>>>>>>> information.
>>>>>>>>>>>>> Connections should be secured by sending the connection
>>>>>> information a
>>>>>>>>>> task
>>>>>>>>>>>>> needs as part of information the executor gets. A task should
>>>> then
>>>>>>>> not
>>>>>>>>>> need
>>>>>>>>>>>>> access to the connection table in Airflow. Keytabs could then
>> be
>>>>>> send
>>>>>>>>>> as
>>>>>>>>>>>>> part of the connection information (base64 encoded) and setup
>> by
>>>>>> the
>>>>>>>>>>>>> executor (this key) to be read only to the task it is
>> launching.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So basically in the scheduler we parse the dag. Either from the
>>>>>>>>>> manifest
>>>>>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>>>>>> register?) we
>>>>>>>>>>>>> know what connections and keytabs are available dag wide or per
>>>>>> task.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The credentials and connection information then are serialized
>>>>>> into a
>>>>>>>>>>>>> protobuf message and send to the executor as part of the
>> “queue”
>>>>>>>>>> action.
>>>>>>>>>>>>> The worker then deserializes the information and makes it
>>>> securely
>>>>>>>>>>>>> available to the task (which is quite hard btw).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On that last bit making the info securely available might be
>>>>>> storing
>>>>>>>>>> it in
>>>>>>>>>>>>> the Linux KEYRING (supported by python keyring). Keytabs will
>> be
>>>>>>>> tough
>>>>>>>>>> to
>>>>>>>>>>>>> do properly due to Java not properly supporting KEYRING and
>> only
>>>>>>>> files
>>>>>>>>>> and
>>>>>>>>>>>>> these are hard to make secure (due to the possibility a process
>>>>>> will
>>>>>>>>>> list
>>>>>>>>>>>>> all files in /tmp and get credentials through that). Maybe
>>>> storing
>>>>>>>> the
>>>>>>>>>>>>> keytab with a password and having the password in the KEYRING
>>>> might
>>>>>>>>>> work.
>>>>>>>>>>>>> Something to find out.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> B.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Verstuurd vanaf mijn iPad
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
>>>>>>>>>> <ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
>> <ma...@twitter.com.INVALID>>
>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
>> <ma...@twitter.com.INVALID>
>>>>>> 
>>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:
>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID>>
>>>>>>>> 
>>>>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:
>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>> <mailto:
>>>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
>>>>> 
>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
>> <ma...@twitter.com.INVALID>>
>>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
>> ddavydov@twitter.com.INVALID> <mailto:
>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
>>>>>>>>>>> 
>>>>>>>>>>>>> het volgende geschreven:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm curious if you had any ideas in terms of ideas to enable
>>>>>>>>>>>>> multi-tenancy
>>>>>>>>>>>>>> with respect to Kerberos in Airflow.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
>>>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com
>> <ma...@gmail.com>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
>> <ma...@gmail.com>>>>
>>>>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
>> <ma...@gmail.com>>> <mailto:
>>>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com
>> <ma...@gmail.com>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cool. The doc will need some refinement as it isn't entirely
>>>>>>>>>> accurate.
>>>>>>>>>>>>> In
>>>>>>>>>>>>>>> addition we need to separate between Airflow as a client of
>>>>>>>>>> kerberized
>>>>>>>>>>>>>>> services (this is what is talked about in the astronomer doc)
>>>> vs
>>>>>>>>>>>>>>> kerberizing airflow itself, which the API supports.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In general to access kerberized services (airflow as a
>> client)
>>>>>> one
>>>>>>>>>> needs
>>>>>>>>>>>>>>> to start the ticket renewer with a valid keytab. For the
>> hooks
>>>> it
>>>>>>>>>> isn't
>>>>>>>>>>>>>>> always required to change the hook to support it. Hadoop cli
>>>>>> tools
>>>>>>>>>> often
>>>>>>>>>>>>>>> just pick it up as their client config is set to do so. Then
>>>>>>>> another
>>>>>>>>>>>>> class
>>>>>>>>>>>>>>> is there for HTTP-like services which are accessed by urllib
>>>>>> under
>>>>>>>>>> the
>>>>>>>>>>>>>>> hood, these typically use SPNEGO. These often need to be
>>>> adjusted
>>>>>>>> as
>>>>>>>>>> it
>>>>>>>>>>>>>>> requires some urllib config. Finally, there are protocols
>> which
>>>>>> use
>>>>>>>>>> SASL
>>>>>>>>>>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO).
>> These
>>>>>>>>>> require
>>>>>>>>>>>>> per
>>>>>>>>>>>>>>> protocol implementations.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> From the top of my head we support kerberos client side now
>>>> with:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> * Spark
>>>>>>>>>>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming
>> libhdfs
>>>>>>>>>>>>>>> implementation)
>>>>>>>>>>>>>>> * Hive (not metastore afaik)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Two things to remember:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> * If a job (ie. Spark job) will finish later than the maximum
>>>>>>>> ticket
>>>>>>>>>>>>>>> lifetime you probably need to provide a keytab to said
>>>>>> application.
>>>>>>>>>>>>>>> Otherwise you will get failures after the expiry.
>>>>>>>>>>>>>>> * A keytab (used by the renewer) are credentials (user and
>>>> pass)
>>>>>> so
>>>>>>>>>> jobs
>>>>>>>>>>>>>>> are executed under the keytab in use at that moment
>>>>>>>>>>>>>>> * Securing keytab in multi tenancy airflow is a challenge.
>> This
>>>>>>>> also
>>>>>>>>>>>>> goes
>>>>>>>>>>>>>>> for securing connections. This we need to fix at some point.
>>>>>>>> Solution
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> now seems to be no multi tenancy.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Kerberos seems harder than it is btw. Still, we are sometimes
>>>>>>>> moving
>>>>>>>>>>>>> away
>>>>>>>>>>>>>>> from it to OAUTH2 based authentication. This gets use closer
>> to
>>>>>>>> cloud
>>>>>>>>>>>>>>> standards (but we are on prem)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> B.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hitesh@apache.org
>> <ma...@apache.org>
>>>> <mailto:hitesh@apache.org <ma...@apache.org>>
>>>>>> <mailto:hitesh@apache.org <ma...@apache.org> <mailto:
>> hitesh@apache.org <ma...@apache.org>>> <mailto:
>>>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:
>> hitesh@apache.org <ma...@apache.org>> <mailto:
>>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org
>> <ma...@apache.org>>>> <mailto:
>>>>>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:
>> hitesh@apache.org <ma...@apache.org>> <mailto:
>>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org
>> <ma...@apache.org>>> <mailto:
>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:
>> hitesh@apache.org <ma...@apache.org>> <mailto:hitesh@apache.org
>> <ma...@apache.org>
>>>> <mailto:hitesh@apache.org <ma...@apache.org>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Taylor
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> +1 on upstreaming this. It would be great if you can submit
>> a
>>>>>> pull
>>>>>>>>>>>>>>> request
>>>>>>>>>>>>>>>> to enhance the apache airflow docs.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>>> Hitesh
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
>>>>>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com>>>> <mailto:
>>>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> While we're on the topic, I'd love any feedback from Bolke
>> or
>>>>>>>>>> others
>>>>>>>>>>>>>>> who've
>>>>>>>>>>>>>>>>> used Kerberos with Airflow on this quick guide I put
>> together
>>>>>>>>>>>>> yesterday.
>>>>>>>>>>>>>>>>> It's similar to what's in the Airflow docs but instead all
>> on
>>>>>> one
>>>>>>>>>> page
>>>>>>>>>>>>>>>>> and slightly
>>>>>>>>>>>>>>>>> expanded.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>>> 
>>>>>> <
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>>> 
>>>>>>> 
>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>>> 
>>>>>> <
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> <
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>>> 
>>>>>> <
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>>> 
>>>>>>> 
>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>>> 
>>>>>> <
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>> 
>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> (or web version <
>> https://www.astronomer.io/guides/kerberos/ <
>> https://www.astronomer.io/guides/kerberos/>
>>>> <https://www.astronomer.io/guides/kerberos/ <
>> https://www.astronomer.io/guides/kerberos/>> <
>>>>>> https://www.astronomer.io/guides/kerberos/ <
>> https://www.astronomer.io/guides/kerberos/> <
>>>> https://www.astronomer.io/guides/kerberos/ <
>> https://www.astronomer.io/guides/kerberos/>>> <
>>>>>>>> https://www.astronomer.io/guides/kerberos/ <
>> https://www.astronomer.io/guides/kerberos/> <
>>>> https://www.astronomer.io/guides/kerberos/ <
>> https://www.astronomer.io/guides/kerberos/>> <
>>>>>> https://www.astronomer.io/guides/kerberos/ <
>> https://www.astronomer.io/guides/kerberos/> <
>>>> https://www.astronomer.io/guides/kerberos/ <
>> https://www.astronomer.io/guides/kerberos/>>>>>)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> One thing I'd like to add is a minimal example of how to
>>>>>>>> Kerberize
>>>>>>>>>> a
>>>>>>>>>>>>>>> hook.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'd be happy to upstream this as well if it's useful
>> (maybe a
>>>>>>>>>>>>> Concepts >
>>>>>>>>>>>>>>>>> Additional Functionality > Kerberos page?)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Taylor
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> *Taylor Edmiston*
>>>>>>>>>>>>>>>>> Blog <https://blog.tedmiston.com/ <
>> https://blog.tedmiston.com/> <
>>>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>> <
>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
>>>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>
>>>>>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>> <
>>>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>>>
>>>>>>>> | CV
>>>>>>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor <
>> https://stackoverflow.com/cv/taylor> <
>>>> https://stackoverflow.com/cv/taylor <
>> https://stackoverflow.com/cv/taylor>> <
>>>>>> https://stackoverflow.com/cv/taylor <
>> https://stackoverflow.com/cv/taylor> <
>>>> https://stackoverflow.com/cv/taylor <
>> https://stackoverflow.com/cv/taylor>>> <
>>>>>>>> https://stackoverflow.com/cv/taylor <
>> https://stackoverflow.com/cv/taylor> <
>>>> https://stackoverflow.com/cv/taylor <
>> https://stackoverflow.com/cv/taylor>> <
>>>>>> https://stackoverflow.com/cv/taylor <
>> https://stackoverflow.com/cv/taylor> <
>>>> https://stackoverflow.com/cv/taylor <
>> https://stackoverflow.com/cv/taylor>>>>> | LinkedIn
>>>>>>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/ <
>> https://www.linkedin.com/in/tedmiston/> <
>>>> https://www.linkedin.com/in/tedmiston/ <
>> https://www.linkedin.com/in/tedmiston/>> <
>>>>>> https://www.linkedin.com/in/tedmiston/ <
>> https://www.linkedin.com/in/tedmiston/> <
>>>> https://www.linkedin.com/in/tedmiston/ <
>> https://www.linkedin.com/in/tedmiston/>>> <
>>>>>>>> https://www.linkedin.com/in/tedmiston/ <
>> https://www.linkedin.com/in/tedmiston/> <
>>>> https://www.linkedin.com/in/tedmiston/ <
>> https://www.linkedin.com/in/tedmiston/>> <
>>>>>> https://www.linkedin.com/in/tedmiston/ <
>> https://www.linkedin.com/in/tedmiston/> <
>>>> https://www.linkedin.com/in/tedmiston/ <
>> https://www.linkedin.com/in/tedmiston/>>>>> | AngelList
>>>>>>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor> <
>> https://angel.co/taylor <https://angel.co/taylor>> <
>>>> https://angel.co/taylor <https://angel.co/taylor> <
>> https://angel.co/taylor <https://angel.co/taylor>>> <
>>>>>> https://angel.co/taylor <https://angel.co/taylor> <
>> https://angel.co/taylor <https://angel.co/taylor>> <
>>>> https://angel.co/taylor <https://angel.co/taylor> <
>> https://angel.co/taylor <https://angel.co/taylor>>>>> | Stack
>>>>>>>> Overflow
>>>>>>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston <
>> https://stackoverflow.com/users/149428/taylor-edmiston> <
>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
>> https://stackoverflow.com/users/149428/taylor-edmiston>> <
>>>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
>> https://stackoverflow.com/users/149428/taylor-edmiston> <
>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
>> https://stackoverflow.com/users/149428/taylor-edmiston>>> <
>>>>>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
>> https://stackoverflow.com/users/149428/taylor-edmiston> <
>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
>> https://stackoverflow.com/users/149428/taylor-edmiston>> <
>>>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
>> https://stackoverflow.com/users/149428/taylor-edmiston> <
>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
>> https://stackoverflow.com/users/149428/taylor-edmiston>>>>>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
>>>>>>>>>>>>> <fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
>> fokko@driesprong.frl <ma...@driesprong.frl>> <mailto:
>>>> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
>> fokko@driesprong.frl <ma...@driesprong.frl>>> <mailto:
>>>>>> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
>> fokko@driesprong.frl <ma...@driesprong.frl>> <mailto:
>>>> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
>> fokko@driesprong.frl <ma...@driesprong.frl>>>>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Ry,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> You should ask Bolke de Bruin. He's really experienced
>> with
>>>>>>>>>> Kerberos
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> he
>>>>>>>>>>>>>>>>>> did also the implementation for Airflow. Beside that he
>>>> worked
>>>>>>>>>> also
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> implementing Kerberos in Ambari. Just want to let you
>> know.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <
>>>>>> ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io
>> <ma...@astronomer.io>> <mailto:ry@astronomer.io <mailto:
>> ry@astronomer.io>
>>>> <mailto:ry@astronomer.io <ma...@astronomer.io>>>
>>>>>>>> <mailto:ry@astronomer.io <ma...@astronomer.io> <mailto:
>> ry@astronomer.io <ma...@astronomer.io>> <mailto:
>>>> ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io
>> <ma...@astronomer.io>>>>>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi everyone -
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We have several bigCo's who are considering using Airflow
>>>>>>>> asking
>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>> support for Kerberos.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We're going to work on a proof-of-concept next week, will
>>>>>>>> likely
>>>>>>>>>>>>>>>>> record a
>>>>>>>>>>>>>>>>>>> screencast on it.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> For now, we're looking for any anecdotal information from
>>>>>>>>>>>>>>> organizations
>>>>>>>>>>>>>>>>>> who
>>>>>>>>>>>>>>>>>>> are using Kerberos with Airflow, if anyone would be
>> willing
>>>>>> to
>>>>>>>>>> share
>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>>> experiences here, or reply to me personally, it would be
>>>>>>>> greatly
>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -Ry
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/
>> <http://www.astronomer.io/> <
>>>> http://www.astronomer.io/ <http://www.astronomer.io/>> <
>>>>>> http://www.astronomer.io/ <http://www.astronomer.io/> <
>> http://www.astronomer.io/ <http://www.astronomer.io/>>> <
>>>>>>>> http://www.astronomer.io/ <http://www.astronomer.io/> <
>> http://www.astronomer.io/ <http://www.astronomer.io/>> <
>>>> http://www.astronomer.io/ <http://www.astronomer.io/> <
>> http://www.astronomer.io/ <http://www.astronomer.io/>>>>> |
>>>>>>>>>>>>>>>>>> 513.417.2163 |
>>>>>>>>>>>>>>>>>>> @rywalker <http://twitter.com/rywalker <
>> http://twitter.com/rywalker> <
>>>> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
>>>>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <
>> http://twitter.com/rywalker <http://twitter.com/rywalker>>> <
>>>>>>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <
>> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
>>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <
>> http://twitter.com/rywalker <http://twitter.com/rywalker>>>>> | LinkedIn
>>>>>>>>>>>>>>>>>>> <http://www.linkedin.com/in/rywal
>> <http://www.linkedin.com/in/rywalker>

Re: Kerberos and Airflow

Posted by Dan Davydov <dd...@twitter.com.INVALID>.

I designed a system similar to what you are describing which is in use at
Airbnb (only DAGs on a whitelist would be allowed to merged to the git repo
if they used certain types of impersonation), it worked for simple use
cases, but the problem was doing access control becomes very difficult,
e.g. solving the problem of which DAGs map to which manifest files, and
which manifest files can access which secrets.

There is also a security risk where someone changes e.g. a python file
dependency of your task, or let's say you figure out a way to block those
kinds of changes based on your hashing, what if there is a legitimate
change in a dependency and you want to recalculate the hash? Then I think
you go back to a solution like your proposed "airflow submit" command to
accomplish this.

Additional concerns:
- I'm not sure if I'm a fan of the the first time a scheduler parses a DAG
to be what creates the hashes either, it feels to me like
encryption/hashing should be done before DAGs are even parsed by the
scheduler (at commit time or submit time of the DAGs)
- The type of the encrypted key seem kind of hacky to me, i.e. some kind of
custom hash based on DAG structure instead of a simple token passed in by
users which has a clear separation of concerns WRT security
- Added complexity both to Airflow code, and to users as they need to
define or customize hashing functions for DAGs to improve security
If we can get a reasonably secure solution then it might be a reasonable
trade-off considering the alternative is a major overhaul/restrictions to
DAGs.

Maybe I'm missing some details that would alleviate my concerns here, and a
bit of a more in-depth document might help?



*Also: using the Kubernetes executor combined with some of the things
wediscussed greatly enhances the security of Airflow as the
environment isn’t really shared anymore.*
Assuming a multi-tenant scheduler, I feel the same set of hard problems
exist with Kubernetes, as the executor mainly just simplifies the
post-executor parts of task scheduling/execution which I think you already
outlined a good solution for early on in this thread (passing keys from the
executor to workers).

Happy to set up some time to talk real-time about this by the way, once we
iron out the details I want to implement whatever the best solution we come
up with is.

On Thu, Aug 2, 2018 at 4:13 PM Bolke de Bruin <bd...@gmail.com> wrote:

> You mentioned you would like to make sure that the DAG (and its tasks)
> runs in a confined set of settings. Ie.
> A given set of connections at submission time not at run time. So here we
> can make use of the fact that both the scheduler
> and the worker parse the DAG.
>
> Firstly, when scheduler evaluates a DAG it can add an integrity check
> (hash) for each task. The executor can encrypt the
> metadata with this hash ensuring that the structure of the DAG remained
> the same. It means that the task is only
> able to decrypt the metadata when it is able to calculate the same hash.
>
> Similarly, if the scheduler parses a DAG for the first time it can
> register the hashes for the tasks. It can then verify these hashes
> at runtime to ensure the structure of the tasks have stayed the same. In
> the manifest (which could even in the DAG or
> part of the DAG definition) we could specify which fields would be used
> for hash calculation. We could even specify
> static hashes. This would give flexibility as to what freedom the users
> have in the auto-generated DAGS.
>
> Something like that?
>
> B.
>
> > On 2 Aug 2018, at 20:12, Dan Davydov <dd...@twitter.com.INVALID>
> wrote:
> >
> > I'm very intrigued, and am curious how this would work in a bit more
> > detail, especially for dynamically created DAGs (how would static
> manifests
> > map to DAGs that are generated from rows in a MySQL table for example)?
> You
> > could of course have something like regexes in your manifest file like
> > some_dag_framework_dag_*, but then how would you make sure that other
> users
> > did not create DAGs that matched this regex?
> >
> > On Thu, Aug 2, 2018 at 1:51 PM Bolke de Bruin <bdbruin@gmail.com
> <ma...@gmail.com>> wrote:
> >
> >> Hi Dan,
> >>
> >> I discussed this a little bit with one of the security architects here.
> We
> >> think that
> >> you can have a fair trade off between security and usability by having
> >> a kind of manifest with the dag you are submitting. This manifest can
> then
> >> specify what the generated tasks/dags are allowed to do and what
> metadata
> >> to provide to them. We could also let the scheduler generate hashes per
> >> generated
> >> DAG / task and verify those with an established version (1st run?). This
> >> limits the
> >> attack vector.
> >>
> >> A DagSerializer would be great, but I think it solves a different issue
> >> and the above
> >> is somewhat simpler to implement?
> >>
> >> Bolke
> >>
> >>> On 29 Jul 2018, at 23:47, Dan Davydov <dd...@twitter.com.INVALID>
> >> wrote:
> >>>
> >>> *Let’s say we trust the owner field of the DAGs I think we could do the
> >>> following.*
> >>> *Obviously, the trusting the user part is key here. It is one of the
> >>> reasons I was suggesting using “airflow submit” to update / add dags in
> >>> Airflow*
> >>>
> >>>
> >>> *This is the hard part about my question.*
> >>> I think in a true multi-tenant environment we wouldn't be able to trust
> >> the
> >>> user, otherwise we wouldn't necessarily even need a mapping of Airflow
> >> DAG
> >>> users to secrets, because if we trust users to set the correct Airflow
> >> user
> >>> for DAGs, we are basically trusting them with all of the creds the
> >> Airflow
> >>> scheduler can access for all users anyways.
> >>>
> >>> I actually had the same thought as your "airflow submit" a while ago,
> >> which
> >>> I discussed with Alex, basically creating an API for adding DAGs
> instead
> >> of
> >>> having the Scheduler parse them. FWIW I think it's superior to the git
> >> time
> >>> machine approach because it's a more generic form of "serialization"
> and
> >> is
> >>> more correct as well because the same DAG file parsed on a given git
> SHA
> >>> can produce different DAGs. Let me know what you think, and maybe I can
> >>> start a more formal design doc if you are onboard:
> >>>
> >>> A user or service with an auth token sends an "airflow submit" request
> >> to a
> >>> new kind of Dag Serialization service, along with the serialized DAG
> >>> objects generated by parsing on the client. It's important that these
> >>> serialized objects are declaritive and not e.g. pickles so that the
> >>> scheduler/workers can consume them and reproducability of the DAGs is
> >>> guaranteed. The service will then store each generated DAG along with
> >> it's
> >>> access based on the provided token e.g. using Ranger, and the
> >>> scheduler/workers will use the stored DAGs for scheduling/execution.
> >>> Operators would be deployed along with the Airflow code separately from
> >> the
> >>> serialized DAGs.
> >>>
> >>> A serialed DAG would look something like this (basically Luigi-style
> :)):
> >>> MyTask - BashOperator: {
> >>> cmd: "sleep 1"
> >>> user: "Foo"
> >>> access: "token1", "token2"
> >>> }
> >>>
> >>> MyDAG: {
> >>> MyTask1 >> SomeOtherTask1
> >>> MyTask2 >> SomeOtherTask1
> >>> }
> >>>
> >>> Dynamic DAGs in this case would just consist of a service calling
> >> "Airflow
> >>> Submit" that does it's own form of authentication to get access to some
> >>> kind of tokens (or basically just forwarding the secrets the users of
> the
> >>> dynamic DAG submit).
> >>>
> >>> For the default Airflow implementation you can maybe just have the Dag
> >>> Serialization server bundled with the Scheduler, with auth turned off,
> >> and
> >>> to periodically update the Dag Serialization store which would emulate
> >> the
> >>> current behavior closely.
> >>>
> >>> Pros:
> >>> 1. Consistency across running task instances in a dagrun/scheduler,
> >>> reproducability and auditability of DAGs
> >>> 2. Users can control when to deploy their DAGs
> >>> 3. Scheduler runs much faster since it doesn't have to run python files
> >> and
> >>> e.g. make network calls
> >>> 4. Scaling scheduler becomes easier because can have different service
> >>> responsible for parsing DAGs which can be trivially scaled horizontally
> >>> (clients are doing the parsing)
> >>> 5. Potentially makes creating ad-hoc DAGs/backfilling/iterating on DAGs
> >>> easier? e.g. can use the Scheduler itself to schedule backfills with a
> >>> slightly modified serialized version of a DAG.
> >>>
> >>> Cons:
> >>> 1. Have to deprecate a lot of popular features, e.g. allowing custom
> >>> callbacks in operators (e.g. on_failure), and jinja_templates
> >>> 2. Version compatibility problems, e.g. user/service client might be
> >>> serializing arguments for hooks/operators that have been deprecated in
> >>> newer versions of the hooks, or the serialized DAG schema changes and
> old
> >>> DAGs aren't automatically updated. Might want to have some kind of
> >>> versioning system for serialized DAGs to at least ensure that stored
> DAGs
> >>> are valid when the Scheduler/Worker/etc are upgraded, maybe something
> >>> similar to thrift/protobuf versioning.
> >>> 3. Additional complexity - additional service, logic on
> workers/scheduler
> >>> to fetch/cache serialized DAGs efficiently, expiring/archiving old DAG
> >>> definitions, etc
> >>>
> >>>
> >>> On Sun, Jul 29, 2018 at 3:20 PM Bolke de Bruin <bdbruin@gmail.com
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>>> wrote:
> >>>
> >>>> Ah gotcha. That’s another issue actually (but related).
> >>>>
> >>>> Let’s say we trust the owner field of the DAGs I think we could do the
> >>>> following. We then have a table (and interface) to tell Airflow what
> >> users
> >>>> have access to what connections. The scheduler can then check if the
> >> task
> >>>> in the dag can access the conn_id it is asking for. Auto generated
> dags
> >>>> still have an owner (or should) and therefore should be fine. Some
> >>>> integrity checking could/should be added as we want to be sure that
> the
> >>>> task we schedule is the task we launch. So a signature calculated at
> the
> >>>> scheduler (or part of the DAG), send as part of the metadata and
> >> checked by
> >>>> the executor is probably smart.
> >>>>
> >>>> You can also make this more fancy by integrating with something like
> >>>> Apache Ranger that allows for policy checking.
> >>>>
> >>>> Obviously, the trusting the user part is key here. It is one of the
> >>>> reasons I was suggesting using “airflow submit” to update / add dags
> in
> >>>> Airflow. We could enforce authentication on the DAG. It was kind of
> >> ruled
> >>>> out in favor of git time machines although these never happened afaik
> >> ;-).
> >>>>
> >>>> BTW: I have updated my implementation with protobuf. Metadata is now
> >>>> available at executor and task.
> >>>>
> >>>>
> >>>>> On 29 Jul 2018, at 15:47, Dan Davydov <ddavydov@twitter.com.INVALID
> <ma...@twitter.com.INVALID>>
> >>>> wrote:
> >>>>>
> >>>>> The concern is how to secure secrets on the scheduler such that only
> >>>>> certain DAGs can access them, and in the case of files that create
> DAGs
> >>>>> dynamically, only some set of DAGs should be able to access these
> >>>> secrets.
> >>>>>
> >>>>> e.g. if there is a secret/keytab that can be read by DAG A generated
> by
> >>>>> file X, and file X generates DAG B as well, there needs to be a
> scheme
> >> to
> >>>>> stop the parsing of DAG B on the scheduler from being able to read
> the
> >>>>> secret in DAG A.
> >>>>>
> >>>>> Does that make sense?
> >>>>>
> >>>>> On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin <bdbruin@gmail.com
> <ma...@gmail.com>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>
> >>>>>> I’m not sure what you mean. The example I created allows for dynamic
> >>>> DAGs,
> >>>>>> as the scheduler obviously knows about the tasks when they are ready
> >> to
> >>>> be
> >>>>>> scheduled.
> >>>>>> This isn’t any different from a static DAG or a dynamic one.
> >>>>>>
> >>>>>> For Kerberos it isnt that special. Basically a keytab are the
> >> revokable
> >>>>>> users credentials
> >>>>>> in a special format. The keytab itself can be protected by a
> password.
> >>>> So
> >>>>>> I can imagine
> >>>>>> that a connection is defined that sets a keytab location and
> password
> >> to
> >>>>>> access the keytab.
> >>>>>> The scheduler understands this (or maybe the Connection model) and
> >>>>>> serializes and sends
> >>>>>> it to the worker as part of the metadata. The worker then
> reconstructs
> >>>> the
> >>>>>> keytab and issues
> >>>>>> a kinit or supplies it to the other service requiring it (eg. Spark)
> >>>>>>
> >>>>>> * Obviously the worker and scheduler need to communicate over SSL.
> >>>>>> * There is a challenge at the worker level. Credentials are secured
> >>>>>> against other users, but are readable by the owning user. So
> imagine 2
> >>>> DAGs
> >>>>>> from two different users with different connections without sudo
> >>>>>> configured. If they end up at the same worker if DAG 2 is malicious
> it
> >>>>>> could read files and memory created by DAG 1. This is the reason why
> >>>> using
> >>>>>> environment variables are NOT safe (DAG 2 could read
> >>>> /proc/<pid>/environ).
> >>>>>> To mitigate this we probably need to PIPE the data to the task’s
> >> STDIN.
> >>>> It
> >>>>>> won’t solve the issue but will make it harder as now it will only be
> >> in
> >>>>>> memory.
> >>>>>> * The reconstructed keytab (or the initalized version) can be stored
> >> in,
> >>>>>> most likely, the process-keyring (
> >>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html>> <
> >>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html>> <
> >>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html>>>>). As
> >>>>>> mentioned earlier this poses a challenge for Java applications that
> >>>> cannot
> >>>>>> read from this location (keytab an ccache). Writing it out to the
> >>>>>> filesystem then becomes a possibility. This is essentially the same
> >> how
> >>>>>> Spark solves it (
> >>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>>> <
> >>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>>>>).
> >>>>>>
> >>>>>> Why not work on this together? We need it as well. Airflow as it is
> >> now
> >>>> we
> >>>>>> consider the biggest security threat and it is really hard to secure
> >> it.
> >>>>>> The above would definitely be a serious improvement. Another step
> >> would
> >>>> be
> >>>>>> to stop Tasks from accessing the Airflow DB all together.
> >>>>>>
> >>>>>> Cheers
> >>>>>> Bolke
> >>>>>>
> >>>>>>> On 29 Jul 2018, at 05:36, Dan Davydov <ddavydov@twitter.com.INVALID
> <ma...@twitter.com.INVALID>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID>>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:
> >> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> This makes sense, and thanks for putting this together. I might
> pick
> >>>> this
> >>>>>>> up myself depending on if we can get the rest of the mutli-tenancy
> >>>> story
> >>>>>>> nailed down, but I still think the tricky part is figuring out how
> to
> >>>>>> allow
> >>>>>>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work
> >>>> with
> >>>>>>> Kerberos, curious what your thoughts are there. How would secrets
> be
> >>>>>> passed
> >>>>>>> securely in a multi-tenant Scheduler starting from parsing the DAGs
> >> up
> >>>> to
> >>>>>>> the executor sending them off?
> >>>>>>>
> >>>>>>> On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbruin@gmail.com
> <ma...@gmail.com>
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>>>
> >>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>>>>> wrote:
> >>>>>>>
> >>>>>>>> Here:
> >>>>>>>>
> >>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>>>> <
> >>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>>>>>
> >>>>>>>>
> >>>>>>>> Is a working rudimentary implementation that allows securing the
> >>>>>>>> connections (only LocalExecutor at the moment)
> >>>>>>>>
> >>>>>>>> * It enforces the use of “conn_id” instead of the mix that we have
> >> now
> >>>>>>>> * A task if using “conn_id” has ‘auto-registered’ (which is a
> noop)
> >>>> its
> >>>>>>>> connections
> >>>>>>>> * The scheduler reads the connection informations and serializes
> it
> >> to
> >>>>>>>> json (which should be a different format, protobuf preferably)
> >>>>>>>> * The scheduler then sends this info to the executor
> >>>>>>>> * The executor puts this in the environment of the task
> (environment
> >>>>>> most
> >>>>>>>> likely not secure enough for us)
> >>>>>>>> * The BaseHook reads out this environment variable and does not
> need
> >>>> to
> >>>>>>>> touch the database
> >>>>>>>>
> >>>>>>>> The example_http_operator works, I havent tested any other. To
> make
> >> it
> >>>>>>>> work I just adjusted the hook and operator to use “conn_id”
> instead
> >>>>>>>> of the non standard http_conn_id.
> >>>>>>>>
> >>>>>>>> Makes sense?
> >>>>>>>>
> >>>>>>>> B.
> >>>>>>>>
> >>>>>>>> * The BaseHook is adjusted to not connect to the database
> >>>>>>>>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbruin@gmail.com
> <ma...@gmail.com>
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Well, I don’t think a hook (or task) should be obtain it by
> itself.
> >>>> It
> >>>>>>>> should be supplied.
> >>>>>>>>> At the moment you start executing the task you cannot trust it
> >>>> anymore
> >>>>>>>> (ie. it is unmanaged
> >>>>>>>>> / non airflow code).
> >>>>>>>>>
> >>>>>>>>> So we could change the basehook to understand supplied
> credentials
> >>>> and
> >>>>>>>> populate
> >>>>>>>>> a hash with “conn_ids”. Hooks normally call
> BaseHook.get_connection
> >>>>>>>> anyway, so
> >>>>>>>>> it shouldnt be too hard and should in principle not require
> changes
> >>>> to
> >>>>>>>> the hooks
> >>>>>>>>> themselves if they are well behaved.
> >>>>>>>>>
> >>>>>>>>> B.
> >>>>>>>>>
> >>>>>>>>>> On 28 Jul 2018, at 17:41, Dan Davydov
> >> <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
> >>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:
> >> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
> >>>>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:
> >> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
> <mailto:
> >>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
> >>>
> >> <mailto:
> >>>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
> >>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
> <ma...@twitter.com.INVALID>
> >>>>>>>
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> *So basically in the scheduler we parse the dag. Either from the
> >>>>>>>> manifest
> >>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
> >>>>>>>> register?) we
> >>>>>>>>>> know what connections and keytabs are available dag wide or per
> >>>> task.*
> >>>>>>>>>> This is the hard part that I was curious about, for dynamically
> >>>>>> created
> >>>>>>>>>> DAGs, e.g. those generated by reading tasks in a MySQL database
> >> or a
> >>>>>>>> json
> >>>>>>>>>> file, there isn't a great way to do this.
> >>>>>>>>>>
> >>>>>>>>>> I 100% agree with deprecating the connections table (at least
> for
> >>>> the
> >>>>>>>>>> secure option). The main work there is rewriting all hooks to
> take
> >>>>>>>>>> credentials from arbitrary data sources by allowing a customized
> >>>>>>>>>> CredentialsReader class. Although hooks are technically
> private, I
> >>>>>>>> think a
> >>>>>>>>>> lot of companies depend on them so the PMC should probably
> discuss
> >>>> if
> >>>>>>>> this
> >>>>>>>>>> is an Airflow 2.0 change or not.
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <
> bdbruin@gmail.com <ma...@gmail.com>
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>>>
> >>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>>>>
> >>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>>> <mailto:
> >>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Sure. In general I consider keytabs as a part of connection
> >>>>>>>> information.
> >>>>>>>>>>> Connections should be secured by sending the connection
> >>>> information a
> >>>>>>>> task
> >>>>>>>>>>> needs as part of information the executor gets. A task should
> >> then
> >>>>>> not
> >>>>>>>> need
> >>>>>>>>>>> access to the connection table in Airflow. Keytabs could then
> be
> >>>> send
> >>>>>>>> as
> >>>>>>>>>>> part of the connection information (base64 encoded) and setup
> by
> >>>> the
> >>>>>>>>>>> executor (this key) to be read only to the task it is
> launching.
> >>>>>>>>>>>
> >>>>>>>>>>> So basically in the scheduler we parse the dag. Either from the
> >>>>>>>> manifest
> >>>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
> >>>>>>>> register?) we
> >>>>>>>>>>> know what connections and keytabs are available dag wide or per
> >>>> task.
> >>>>>>>>>>>
> >>>>>>>>>>> The credentials and connection information then are serialized
> >>>> into a
> >>>>>>>>>>> protobuf message and send to the executor as part of the
> “queue”
> >>>>>>>> action.
> >>>>>>>>>>> The worker then deserializes the information and makes it
> >> securely
> >>>>>>>>>>> available to the task (which is quite hard btw).
> >>>>>>>>>>>
> >>>>>>>>>>> On that last bit making the info securely available might be
> >>>> storing
> >>>>>>>> it in
> >>>>>>>>>>> the Linux KEYRING (supported by python keyring). Keytabs will
> be
> >>>>>> tough
> >>>>>>>> to
> >>>>>>>>>>> do properly due to Java not properly supporting KEYRING and
> only
> >>>>>> files
> >>>>>>>> and
> >>>>>>>>>>> these are hard to make secure (due to the possibility a process
> >>>> will
> >>>>>>>> list
> >>>>>>>>>>> all files in /tmp and get credentials through that). Maybe
> >> storing
> >>>>>> the
> >>>>>>>>>>> keytab with a password and having the password in the KEYRING
> >> might
> >>>>>>>> work.
> >>>>>>>>>>> Something to find out.
> >>>>>>>>>>>
> >>>>>>>>>>> B.
> >>>>>>>>>>>
> >>>>>>>>>>> Verstuurd vanaf mijn iPad
> >>>>>>>>>>>
> >>>>>>>>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
> >>>>>>>> <ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
> <ma...@twitter.com.INVALID>>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
> <ma...@twitter.com.INVALID>
> >>>>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:
> >> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID>>
> >>>>>>
> >>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:
> >> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
> <mailto:
> >>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
> >>>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
> <ma...@twitter.com.INVALID>>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:
> >> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
> >>>>>>>>>
> >>>>>>>>>>> het volgende geschreven:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm curious if you had any ideas in terms of ideas to enable
> >>>>>>>>>>> multi-tenancy
> >>>>>>>>>>>> with respect to Kerberos in Airflow.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
> >>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
> >>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>>>>
> >>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>>> <mailto:
> >>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Cool. The doc will need some refinement as it isn't entirely
> >>>>>>>> accurate.
> >>>>>>>>>>> In
> >>>>>>>>>>>>> addition we need to separate between Airflow as a client of
> >>>>>>>> kerberized
> >>>>>>>>>>>>> services (this is what is talked about in the astronomer doc)
> >> vs
> >>>>>>>>>>>>> kerberizing airflow itself, which the API supports.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In general to access kerberized services (airflow as a
> client)
> >>>> one
> >>>>>>>> needs
> >>>>>>>>>>>>> to start the ticket renewer with a valid keytab. For the
> hooks
> >> it
> >>>>>>>> isn't
> >>>>>>>>>>>>> always required to change the hook to support it. Hadoop cli
> >>>> tools
> >>>>>>>> often
> >>>>>>>>>>>>> just pick it up as their client config is set to do so. Then
> >>>>>> another
> >>>>>>>>>>> class
> >>>>>>>>>>>>> is there for HTTP-like services which are accessed by urllib
> >>>> under
> >>>>>>>> the
> >>>>>>>>>>>>> hood, these typically use SPNEGO. These often need to be
> >> adjusted
> >>>>>> as
> >>>>>>>> it
> >>>>>>>>>>>>> requires some urllib config. Finally, there are protocols
> which
> >>>> use
> >>>>>>>> SASL
> >>>>>>>>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO).
> These
> >>>>>>>> require
> >>>>>>>>>>> per
> >>>>>>>>>>>>> protocol implementations.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> From the top of my head we support kerberos client side now
> >> with:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * Spark
> >>>>>>>>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming
> libhdfs
> >>>>>>>>>>>>> implementation)
> >>>>>>>>>>>>> * Hive (not metastore afaik)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Two things to remember:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * If a job (ie. Spark job) will finish later than the maximum
> >>>>>> ticket
> >>>>>>>>>>>>> lifetime you probably need to provide a keytab to said
> >>>> application.
> >>>>>>>>>>>>> Otherwise you will get failures after the expiry.
> >>>>>>>>>>>>> * A keytab (used by the renewer) are credentials (user and
> >> pass)
> >>>> so
> >>>>>>>> jobs
> >>>>>>>>>>>>> are executed under the keytab in use at that moment
> >>>>>>>>>>>>> * Securing keytab in multi tenancy airflow is a challenge.
> This
> >>>>>> also
> >>>>>>>>>>> goes
> >>>>>>>>>>>>> for securing connections. This we need to fix at some point.
> >>>>>> Solution
> >>>>>>>>>>> for
> >>>>>>>>>>>>> now seems to be no multi tenancy.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Kerberos seems harder than it is btw. Still, we are sometimes
> >>>>>> moving
> >>>>>>>>>>> away
> >>>>>>>>>>>>> from it to OAUTH2 based authentication. This gets use closer
> to
> >>>>>> cloud
> >>>>>>>>>>>>> standards (but we are on prem)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> B.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hitesh@apache.org
> <ma...@apache.org>
> >> <mailto:hitesh@apache.org <ma...@apache.org>>
> >>>> <mailto:hitesh@apache.org <ma...@apache.org> <mailto:
> hitesh@apache.org <ma...@apache.org>>> <mailto:
> >>>>>> hitesh@apache.org <ma...@apache.org> <mailto:
> hitesh@apache.org <ma...@apache.org>> <mailto:
> >> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org
> <ma...@apache.org>>>> <mailto:
> >>>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:
> hitesh@apache.org <ma...@apache.org>> <mailto:
> >> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org
> <ma...@apache.org>>> <mailto:
> >>>> hitesh@apache.org <ma...@apache.org> <mailto:
> hitesh@apache.org <ma...@apache.org>> <mailto:hitesh@apache.org
> <ma...@apache.org>
> >> <mailto:hitesh@apache.org <ma...@apache.org>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Taylor
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> +1 on upstreaming this. It would be great if you can submit
> a
> >>>> pull
> >>>>>>>>>>>>> request
> >>>>>>>>>>>>>> to enhance the apache airflow docs.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> thanks
> >>>>>>>>>>>>>> Hitesh
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
> >>>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
> >>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>>>> <mailto:
> >>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
> >>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> While we're on the topic, I'd love any feedback from Bolke
> or
> >>>>>>>> others
> >>>>>>>>>>>>> who've
> >>>>>>>>>>>>>>> used Kerberos with Airflow on this quick guide I put
> together
> >>>>>>>>>>> yesterday.
> >>>>>>>>>>>>>>> It's similar to what's in the Airflow docs but instead all
> on
> >>>> one
> >>>>>>>> page
> >>>>>>>>>>>>>>> and slightly
> >>>>>>>>>>>>>>> expanded.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>>>
> >>>>>> <
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>>>
> >>>>>> <
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>>>>> (or web version <
> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/>
> >> <https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/>> <
> >>>> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/> <
> >> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/>>> <
> >>>>>> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/> <
> >> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/>> <
> >>>> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/> <
> >> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/>>>>>)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> One thing I'd like to add is a minimal example of how to
> >>>>>> Kerberize
> >>>>>>>> a
> >>>>>>>>>>>>> hook.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'd be happy to upstream this as well if it's useful
> (maybe a
> >>>>>>>>>>> Concepts >
> >>>>>>>>>>>>>>> Additional Functionality > Kerberos page?)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Taylor
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Taylor Edmiston*
> >>>>>>>>>>>>>>> Blog <https://blog.tedmiston.com/ <
> https://blog.tedmiston.com/> <
> >> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>> <
> https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
> >> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>
> >>>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>> <
> >> https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>>>
> >>>>>> | CV
> >>>>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor> <
> >> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor>> <
> >>>> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor> <
> >> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor>>> <
> >>>>>> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor> <
> >> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor>> <
> >>>> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor> <
> >> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor>>>>> | LinkedIn
> >>>>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/> <
> >> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/>> <
> >>>> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/> <
> >> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/>>> <
> >>>>>> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/> <
> >> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/>> <
> >>>> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/> <
> >> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/>>>>> | AngelList
> >>>>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor> <
> https://angel.co/taylor <https://angel.co/taylor>> <
> >> https://angel.co/taylor <https://angel.co/taylor> <
> https://angel.co/taylor <https://angel.co/taylor>>> <
> >>>> https://angel.co/taylor <https://angel.co/taylor> <
> https://angel.co/taylor <https://angel.co/taylor>> <
> >> https://angel.co/taylor <https://angel.co/taylor> <
> https://angel.co/taylor <https://angel.co/taylor>>>>> | Stack
> >>>>>> Overflow
> >>>>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston>> <
> >>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston>>> <
> >>>>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston>> <
> >>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
> >>>>>>>>>>> <fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
> fokko@driesprong.frl <ma...@driesprong.frl>> <mailto:
> >> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
> fokko@driesprong.frl <ma...@driesprong.frl>>> <mailto:
> >>>> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
> fokko@driesprong.frl <ma...@driesprong.frl>> <mailto:
> >> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
> fokko@driesprong.frl <ma...@driesprong.frl>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Ry,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> You should ask Bolke de Bruin. He's really experienced
> with
> >>>>>>>> Kerberos
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> he
> >>>>>>>>>>>>>>>> did also the implementation for Airflow. Beside that he
> >> worked
> >>>>>>>> also
> >>>>>>>>>>> on
> >>>>>>>>>>>>>>>> implementing Kerberos in Ambari. Just want to let you
> know.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Cheers, Fokko
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <
> >>>> ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io
> <ma...@astronomer.io>> <mailto:ry@astronomer.io <mailto:
> ry@astronomer.io>
> >> <mailto:ry@astronomer.io <ma...@astronomer.io>>>
> >>>>>> <mailto:ry@astronomer.io <ma...@astronomer.io> <mailto:
> ry@astronomer.io <ma...@astronomer.io>> <mailto:
> >> ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io
> <ma...@astronomer.io>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi everyone -
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We have several bigCo's who are considering using Airflow
> >>>>>> asking
> >>>>>>>>>>> into
> >>>>>>>>>>>>>>> its
> >>>>>>>>>>>>>>>>> support for Kerberos.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We're going to work on a proof-of-concept next week, will
> >>>>>> likely
> >>>>>>>>>>>>>>> record a
> >>>>>>>>>>>>>>>>> screencast on it.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For now, we're looking for any anecdotal information from
> >>>>>>>>>>>>> organizations
> >>>>>>>>>>>>>>>> who
> >>>>>>>>>>>>>>>>> are using Kerberos with Airflow, if anyone would be
> willing
> >>>> to
> >>>>>>>> share
> >>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>> experiences here, or reply to me personally, it would be
> >>>>>> greatly
> >>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> -Ry
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/
> <http://www.astronomer.io/> <
> >> http://www.astronomer.io/ <http://www.astronomer.io/>> <
> >>>> http://www.astronomer.io/ <http://www.astronomer.io/> <
> http://www.astronomer.io/ <http://www.astronomer.io/>>> <
> >>>>>> http://www.astronomer.io/ <http://www.astronomer.io/> <
> http://www.astronomer.io/ <http://www.astronomer.io/>> <
> >> http://www.astronomer.io/ <http://www.astronomer.io/> <
> http://www.astronomer.io/ <http://www.astronomer.io/>>>>> |
> >>>>>>>>>>>>>>>> 513.417.2163 |
> >>>>>>>>>>>>>>>>> @rywalker <http://twitter.com/rywalker <
> http://twitter.com/rywalker> <
> >> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
> >>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <
> http://twitter.com/rywalker <http://twitter.com/rywalker>>> <
> >>>>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <
> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
> >> http://twitter.com/rywalker <http://twitter.com/rywalker> <
> http://twitter.com/rywalker <http://twitter.com/rywalker>>>>> | LinkedIn
> >>>>>>>>>>>>>>>>> <http://www.linkedin.com/in/rywal
> <http://www.linkedin.com/in/rywalker>

Re: Kerberos and Airflow

Posted by Bolke de Bruin <bd...@gmail.com>.

You mentioned you would like to make sure that the DAG (and its tasks) runs in a confined set of settings. Ie.
A given set of connections at submission time not at run time. So here we can make use of the fact that both the scheduler
and the worker parse the DAG. 

Firstly, when scheduler evaluates a DAG it can add an integrity check (hash) for each task. The executor can encrypt the
metadata with this hash ensuring that the structure of the DAG remained the same. It means that the task is only
able to decrypt the metadata when it is able to calculate the same hash.

Similarly, if the scheduler parses a DAG for the first time it can register the hashes for the tasks. It can then verify these hashes
at runtime to ensure the structure of the tasks have stayed the same. In the manifest (which could even in the DAG or
part of the DAG definition) we could specify which fields would be used for hash calculation. We could even specify
static hashes. This would give flexibility as to what freedom the users have in the auto-generated DAGS.

Something like that?

B.

> On 2 Aug 2018, at 20:12, Dan Davydov <dd...@twitter.com.INVALID> wrote:
> 
> I'm very intrigued, and am curious how this would work in a bit more
> detail, especially for dynamically created DAGs (how would static manifests
> map to DAGs that are generated from rows in a MySQL table for example)? You
> could of course have something like regexes in your manifest file like
> some_dag_framework_dag_*, but then how would you make sure that other users
> did not create DAGs that matched this regex?
> 
> On Thu, Aug 2, 2018 at 1:51 PM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>> wrote:
> 
>> Hi Dan,
>> 
>> I discussed this a little bit with one of the security architects here. We
>> think that
>> you can have a fair trade off between security and usability by having
>> a kind of manifest with the dag you are submitting. This manifest can then
>> specify what the generated tasks/dags are allowed to do and what metadata
>> to provide to them. We could also let the scheduler generate hashes per
>> generated
>> DAG / task and verify those with an established version (1st run?). This
>> limits the
>> attack vector.
>> 
>> A DagSerializer would be great, but I think it solves a different issue
>> and the above
>> is somewhat simpler to implement?
>> 
>> Bolke
>> 
>>> On 29 Jul 2018, at 23:47, Dan Davydov <dd...@twitter.com.INVALID>
>> wrote:
>>> 
>>> *Let’s say we trust the owner field of the DAGs I think we could do the
>>> following.*
>>> *Obviously, the trusting the user part is key here. It is one of the
>>> reasons I was suggesting using “airflow submit” to update / add dags in
>>> Airflow*
>>> 
>>> 
>>> *This is the hard part about my question.*
>>> I think in a true multi-tenant environment we wouldn't be able to trust
>> the
>>> user, otherwise we wouldn't necessarily even need a mapping of Airflow
>> DAG
>>> users to secrets, because if we trust users to set the correct Airflow
>> user
>>> for DAGs, we are basically trusting them with all of the creds the
>> Airflow
>>> scheduler can access for all users anyways.
>>> 
>>> I actually had the same thought as your "airflow submit" a while ago,
>> which
>>> I discussed with Alex, basically creating an API for adding DAGs instead
>> of
>>> having the Scheduler parse them. FWIW I think it's superior to the git
>> time
>>> machine approach because it's a more generic form of "serialization" and
>> is
>>> more correct as well because the same DAG file parsed on a given git SHA
>>> can produce different DAGs. Let me know what you think, and maybe I can
>>> start a more formal design doc if you are onboard:
>>> 
>>> A user or service with an auth token sends an "airflow submit" request
>> to a
>>> new kind of Dag Serialization service, along with the serialized DAG
>>> objects generated by parsing on the client. It's important that these
>>> serialized objects are declaritive and not e.g. pickles so that the
>>> scheduler/workers can consume them and reproducability of the DAGs is
>>> guaranteed. The service will then store each generated DAG along with
>> it's
>>> access based on the provided token e.g. using Ranger, and the
>>> scheduler/workers will use the stored DAGs for scheduling/execution.
>>> Operators would be deployed along with the Airflow code separately from
>> the
>>> serialized DAGs.
>>> 
>>> A serialed DAG would look something like this (basically Luigi-style :)):
>>> MyTask - BashOperator: {
>>> cmd: "sleep 1"
>>> user: "Foo"
>>> access: "token1", "token2"
>>> }
>>> 
>>> MyDAG: {
>>> MyTask1 >> SomeOtherTask1
>>> MyTask2 >> SomeOtherTask1
>>> }
>>> 
>>> Dynamic DAGs in this case would just consist of a service calling
>> "Airflow
>>> Submit" that does it's own form of authentication to get access to some
>>> kind of tokens (or basically just forwarding the secrets the users of the
>>> dynamic DAG submit).
>>> 
>>> For the default Airflow implementation you can maybe just have the Dag
>>> Serialization server bundled with the Scheduler, with auth turned off,
>> and
>>> to periodically update the Dag Serialization store which would emulate
>> the
>>> current behavior closely.
>>> 
>>> Pros:
>>> 1. Consistency across running task instances in a dagrun/scheduler,
>>> reproducability and auditability of DAGs
>>> 2. Users can control when to deploy their DAGs
>>> 3. Scheduler runs much faster since it doesn't have to run python files
>> and
>>> e.g. make network calls
>>> 4. Scaling scheduler becomes easier because can have different service
>>> responsible for parsing DAGs which can be trivially scaled horizontally
>>> (clients are doing the parsing)
>>> 5. Potentially makes creating ad-hoc DAGs/backfilling/iterating on DAGs
>>> easier? e.g. can use the Scheduler itself to schedule backfills with a
>>> slightly modified serialized version of a DAG.
>>> 
>>> Cons:
>>> 1. Have to deprecate a lot of popular features, e.g. allowing custom
>>> callbacks in operators (e.g. on_failure), and jinja_templates
>>> 2. Version compatibility problems, e.g. user/service client might be
>>> serializing arguments for hooks/operators that have been deprecated in
>>> newer versions of the hooks, or the serialized DAG schema changes and old
>>> DAGs aren't automatically updated. Might want to have some kind of
>>> versioning system for serialized DAGs to at least ensure that stored DAGs
>>> are valid when the Scheduler/Worker/etc are upgraded, maybe something
>>> similar to thrift/protobuf versioning.
>>> 3. Additional complexity - additional service, logic on workers/scheduler
>>> to fetch/cache serialized DAGs efficiently, expiring/archiving old DAG
>>> definitions, etc
>>> 
>>> 
>>> On Sun, Jul 29, 2018 at 3:20 PM Bolke de Bruin <bdbruin@gmail.com
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>> wrote:
>>> 
>>>> Ah gotcha. That’s another issue actually (but related).
>>>> 
>>>> Let’s say we trust the owner field of the DAGs I think we could do the
>>>> following. We then have a table (and interface) to tell Airflow what
>> users
>>>> have access to what connections. The scheduler can then check if the
>> task
>>>> in the dag can access the conn_id it is asking for. Auto generated dags
>>>> still have an owner (or should) and therefore should be fine. Some
>>>> integrity checking could/should be added as we want to be sure that the
>>>> task we schedule is the task we launch. So a signature calculated at the
>>>> scheduler (or part of the DAG), send as part of the metadata and
>> checked by
>>>> the executor is probably smart.
>>>> 
>>>> You can also make this more fancy by integrating with something like
>>>> Apache Ranger that allows for policy checking.
>>>> 
>>>> Obviously, the trusting the user part is key here. It is one of the
>>>> reasons I was suggesting using “airflow submit” to update / add dags in
>>>> Airflow. We could enforce authentication on the DAG. It was kind of
>> ruled
>>>> out in favor of git time machines although these never happened afaik
>> ;-).
>>>> 
>>>> BTW: I have updated my implementation with protobuf. Metadata is now
>>>> available at executor and task.
>>>> 
>>>> 
>>>>> On 29 Jul 2018, at 15:47, Dan Davydov <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>>> wrote:
>>>>> 
>>>>> The concern is how to secure secrets on the scheduler such that only
>>>>> certain DAGs can access them, and in the case of files that create DAGs
>>>>> dynamically, only some set of DAGs should be able to access these
>>>> secrets.
>>>>> 
>>>>> e.g. if there is a secret/keytab that can be read by DAG A generated by
>>>>> file X, and file X generates DAG B as well, there needs to be a scheme
>> to
>>>>> stop the parsing of DAG B on the scheduler from being able to read the
>>>>> secret in DAG A.
>>>>> 
>>>>> Does that make sense?
>>>>> 
>>>>> On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
>>>>> 
>>>>>> I’m not sure what you mean. The example I created allows for dynamic
>>>> DAGs,
>>>>>> as the scheduler obviously knows about the tasks when they are ready
>> to
>>>> be
>>>>>> scheduled.
>>>>>> This isn’t any different from a static DAG or a dynamic one.
>>>>>> 
>>>>>> For Kerberos it isnt that special. Basically a keytab are the
>> revokable
>>>>>> users credentials
>>>>>> in a special format. The keytab itself can be protected by a password.
>>>> So
>>>>>> I can imagine
>>>>>> that a connection is defined that sets a keytab location and password
>> to
>>>>>> access the keytab.
>>>>>> The scheduler understands this (or maybe the Connection model) and
>>>>>> serializes and sends
>>>>>> it to the worker as part of the metadata. The worker then reconstructs
>>>> the
>>>>>> keytab and issues
>>>>>> a kinit or supplies it to the other service requiring it (eg. Spark)
>>>>>> 
>>>>>> * Obviously the worker and scheduler need to communicate over SSL.
>>>>>> * There is a challenge at the worker level. Credentials are secured
>>>>>> against other users, but are readable by the owning user. So imagine 2
>>>> DAGs
>>>>>> from two different users with different connections without sudo
>>>>>> configured. If they end up at the same worker if DAG 2 is malicious it
>>>>>> could read files and memory created by DAG 1. This is the reason why
>>>> using
>>>>>> environment variables are NOT safe (DAG 2 could read
>>>> /proc/<pid>/environ).
>>>>>> To mitigate this we probably need to PIPE the data to the task’s
>> STDIN.
>>>> It
>>>>>> won’t solve the issue but will make it harder as now it will only be
>> in
>>>>>> memory.
>>>>>> * The reconstructed keytab (or the initalized version) can be stored
>> in,
>>>>>> most likely, the process-keyring (
>>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html>> <
>>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html>> <
>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html>>>>). As
>>>>>> mentioned earlier this poses a challenge for Java applications that
>>>> cannot
>>>>>> read from this location (keytab an ccache). Writing it out to the
>>>>>> filesystem then becomes a possibility. This is essentially the same
>> how
>>>>>> Spark solves it (
>>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode>>> <
>>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode>>>>).
>>>>>> 
>>>>>> Why not work on this together? We need it as well. Airflow as it is
>> now
>>>> we
>>>>>> consider the biggest security threat and it is really hard to secure
>> it.
>>>>>> The above would definitely be a serious improvement. Another step
>> would
>>>> be
>>>>>> to stop Tasks from accessing the Airflow DB all together.
>>>>>> 
>>>>>> Cheers
>>>>>> Bolke
>>>>>> 
>>>>>>> On 29 Jul 2018, at 05:36, Dan Davydov <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>>
>>>>>> wrote:
>>>>>>> 
>>>>>>> This makes sense, and thanks for putting this together. I might pick
>>>> this
>>>>>>> up myself depending on if we can get the rest of the mutli-tenancy
>>>> story
>>>>>>> nailed down, but I still think the tricky part is figuring out how to
>>>>>> allow
>>>>>>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work
>>>> with
>>>>>>> Kerberos, curious what your thoughts are there. How would secrets be
>>>>>> passed
>>>>>>> securely in a multi-tenant Scheduler starting from parsing the DAGs
>> up
>>>> to
>>>>>>> the executor sending them off?
>>>>>>> 
>>>>>>> On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>> wrote:
>>>>>>> 
>>>>>>>> Here:
>>>>>>>> 
>>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>>>> <
>>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>>>>>
>>>>>>>> 
>>>>>>>> Is a working rudimentary implementation that allows securing the
>>>>>>>> connections (only LocalExecutor at the moment)
>>>>>>>> 
>>>>>>>> * It enforces the use of “conn_id” instead of the mix that we have
>> now
>>>>>>>> * A task if using “conn_id” has ‘auto-registered’ (which is a noop)
>>>> its
>>>>>>>> connections
>>>>>>>> * The scheduler reads the connection informations and serializes it
>> to
>>>>>>>> json (which should be a different format, protobuf preferably)
>>>>>>>> * The scheduler then sends this info to the executor
>>>>>>>> * The executor puts this in the environment of the task (environment
>>>>>> most
>>>>>>>> likely not secure enough for us)
>>>>>>>> * The BaseHook reads out this environment variable and does not need
>>>> to
>>>>>>>> touch the database
>>>>>>>> 
>>>>>>>> The example_http_operator works, I havent tested any other. To make
>> it
>>>>>>>> work I just adjusted the hook and operator to use “conn_id” instead
>>>>>>>> of the non standard http_conn_id.
>>>>>>>> 
>>>>>>>> Makes sense?
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>> * The BaseHook is adjusted to not connect to the database
>>>>>>>>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Well, I don’t think a hook (or task) should be obtain it by itself.
>>>> It
>>>>>>>> should be supplied.
>>>>>>>>> At the moment you start executing the task you cannot trust it
>>>> anymore
>>>>>>>> (ie. it is unmanaged
>>>>>>>>> / non airflow code).
>>>>>>>>> 
>>>>>>>>> So we could change the basehook to understand supplied credentials
>>>> and
>>>>>>>> populate
>>>>>>>>> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection
>>>>>>>> anyway, so
>>>>>>>>> it shouldnt be too hard and should in principle not require changes
>>>> to
>>>>>>>> the hooks
>>>>>>>>> themselves if they are well behaved.
>>>>>>>>> 
>>>>>>>>> B.
>>>>>>>>> 
>>>>>>>>>> On 28 Jul 2018, at 17:41, Dan Davydov
>> <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
>>>>>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>> <mailto:
>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
>> <mailto:
>>>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>>>>>>> 
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> *So basically in the scheduler we parse the dag. Either from the
>>>>>>>> manifest
>>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>>>> register?) we
>>>>>>>>>> know what connections and keytabs are available dag wide or per
>>>> task.*
>>>>>>>>>> This is the hard part that I was curious about, for dynamically
>>>>>> created
>>>>>>>>>> DAGs, e.g. those generated by reading tasks in a MySQL database
>> or a
>>>>>>>> json
>>>>>>>>>> file, there isn't a great way to do this.
>>>>>>>>>> 
>>>>>>>>>> I 100% agree with deprecating the connections table (at least for
>>>> the
>>>>>>>>>> secure option). The main work there is rewriting all hooks to take
>>>>>>>>>> credentials from arbitrary data sources by allowing a customized
>>>>>>>>>> CredentialsReader class. Although hooks are technically private, I
>>>>>>>> think a
>>>>>>>>>> lot of companies depend on them so the PMC should probably discuss
>>>> if
>>>>>>>> this
>>>>>>>>>> is an Airflow 2.0 change or not.
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>
>>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>> <mailto:
>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com <ma...@gmail.com>
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Sure. In general I consider keytabs as a part of connection
>>>>>>>> information.
>>>>>>>>>>> Connections should be secured by sending the connection
>>>> information a
>>>>>>>> task
>>>>>>>>>>> needs as part of information the executor gets. A task should
>> then
>>>>>> not
>>>>>>>> need
>>>>>>>>>>> access to the connection table in Airflow. Keytabs could then be
>>>> send
>>>>>>>> as
>>>>>>>>>>> part of the connection information (base64 encoded) and setup by
>>>> the
>>>>>>>>>>> executor (this key) to be read only to the task it is launching.
>>>>>>>>>>> 
>>>>>>>>>>> So basically in the scheduler we parse the dag. Either from the
>>>>>>>> manifest
>>>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>>>> register?) we
>>>>>>>>>>> know what connections and keytabs are available dag wide or per
>>>> task.
>>>>>>>>>>> 
>>>>>>>>>>> The credentials and connection information then are serialized
>>>> into a
>>>>>>>>>>> protobuf message and send to the executor as part of the “queue”
>>>>>>>> action.
>>>>>>>>>>> The worker then deserializes the information and makes it
>> securely
>>>>>>>>>>> available to the task (which is quite hard btw).
>>>>>>>>>>> 
>>>>>>>>>>> On that last bit making the info securely available might be
>>>> storing
>>>>>>>> it in
>>>>>>>>>>> the Linux KEYRING (supported by python keyring). Keytabs will be
>>>>>> tough
>>>>>>>> to
>>>>>>>>>>> do properly due to Java not properly supporting KEYRING and only
>>>>>> files
>>>>>>>> and
>>>>>>>>>>> these are hard to make secure (due to the possibility a process
>>>> will
>>>>>>>> list
>>>>>>>>>>> all files in /tmp and get credentials through that). Maybe
>> storing
>>>>>> the
>>>>>>>>>>> keytab with a password and having the password in the KEYRING
>> might
>>>>>>>> work.
>>>>>>>>>>> Something to find out.
>>>>>>>>>>> 
>>>>>>>>>>> B.
>>>>>>>>>>> 
>>>>>>>>>>> Verstuurd vanaf mijn iPad
>>>>>>>>>>> 
>>>>>>>>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
>>>>>>>> <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>>>> 
>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>>>>> 
>>>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>> <mailto:
>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
>>>>>>>>> 
>>>>>>>>>>> het volgende geschreven:
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm curious if you had any ideas in terms of ideas to enable
>>>>>>>>>>> multi-tenancy
>>>>>>>>>>>> with respect to Kerberos in Airflow.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com <ma...@gmail.com>
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>
>>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>> <mailto:
>>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:bdbruin@gmail.com <ma...@gmail.com>
>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cool. The doc will need some refinement as it isn't entirely
>>>>>>>> accurate.
>>>>>>>>>>> In
>>>>>>>>>>>>> addition we need to separate between Airflow as a client of
>>>>>>>> kerberized
>>>>>>>>>>>>> services (this is what is talked about in the astronomer doc)
>> vs
>>>>>>>>>>>>> kerberizing airflow itself, which the API supports.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In general to access kerberized services (airflow as a client)
>>>> one
>>>>>>>> needs
>>>>>>>>>>>>> to start the ticket renewer with a valid keytab. For the hooks
>> it
>>>>>>>> isn't
>>>>>>>>>>>>> always required to change the hook to support it. Hadoop cli
>>>> tools
>>>>>>>> often
>>>>>>>>>>>>> just pick it up as their client config is set to do so. Then
>>>>>> another
>>>>>>>>>>> class
>>>>>>>>>>>>> is there for HTTP-like services which are accessed by urllib
>>>> under
>>>>>>>> the
>>>>>>>>>>>>> hood, these typically use SPNEGO. These often need to be
>> adjusted
>>>>>> as
>>>>>>>> it
>>>>>>>>>>>>> requires some urllib config. Finally, there are protocols which
>>>> use
>>>>>>>> SASL
>>>>>>>>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These
>>>>>>>> require
>>>>>>>>>>> per
>>>>>>>>>>>>> protocol implementations.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> From the top of my head we support kerberos client side now
>> with:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * Spark
>>>>>>>>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
>>>>>>>>>>>>> implementation)
>>>>>>>>>>>>> * Hive (not metastore afaik)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Two things to remember:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * If a job (ie. Spark job) will finish later than the maximum
>>>>>> ticket
>>>>>>>>>>>>> lifetime you probably need to provide a keytab to said
>>>> application.
>>>>>>>>>>>>> Otherwise you will get failures after the expiry.
>>>>>>>>>>>>> * A keytab (used by the renewer) are credentials (user and
>> pass)
>>>> so
>>>>>>>> jobs
>>>>>>>>>>>>> are executed under the keytab in use at that moment
>>>>>>>>>>>>> * Securing keytab in multi tenancy airflow is a challenge. This
>>>>>> also
>>>>>>>>>>> goes
>>>>>>>>>>>>> for securing connections. This we need to fix at some point.
>>>>>> Solution
>>>>>>>>>>> for
>>>>>>>>>>>>> now seems to be no multi tenancy.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Kerberos seems harder than it is btw. Still, we are sometimes
>>>>>> moving
>>>>>>>>>>> away
>>>>>>>>>>>>> from it to OAUTH2 based authentication. This gets use closer to
>>>>>> cloud
>>>>>>>>>>>>> standards (but we are on prem)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> B.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hitesh@apache.org <ma...@apache.org>
>> <mailto:hitesh@apache.org <ma...@apache.org>>
>>>> <mailto:hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>>> <mailto:
>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>> <mailto:
>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>>>> <mailto:
>>>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>> <mailto:
>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>>> <mailto:
>>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>> <mailto:hitesh@apache.org <ma...@apache.org>
>> <mailto:hitesh@apache.org <ma...@apache.org>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Taylor
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> +1 on upstreaming this. It would be great if you can submit a
>>>> pull
>>>>>>>>>>>>> request
>>>>>>>>>>>>>> to enhance the apache airflow docs.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>> Hitesh
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
>>>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>>>> <mailto:
>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> While we're on the topic, I'd love any feedback from Bolke or
>>>>>>>> others
>>>>>>>>>>>>> who've
>>>>>>>>>>>>>>> used Kerberos with Airflow on this quick guide I put together
>>>>>>>>>>> yesterday.
>>>>>>>>>>>>>>> It's similar to what's in the Airflow docs but instead all on
>>>> one
>>>>>>>> page
>>>>>>>>>>>>>>> and slightly
>>>>>>>>>>>>>>> expanded.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>>> 
>>>>>> <
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>>> 
>>>>>>> 
>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>>> 
>>>>>> <
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>> <
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>> <
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> 
>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/>
>> <https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/>> <
>>>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/> <
>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/>>> <
>>>>>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/> <
>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/>> <
>>>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/> <
>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/>>>>>)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> One thing I'd like to add is a minimal example of how to
>>>>>> Kerberize
>>>>>>>> a
>>>>>>>>>>>>> hook.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
>>>>>>>>>>> Concepts >
>>>>>>>>>>>>>>> Additional Functionality > Kerberos page?)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Taylor
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *Taylor Edmiston*
>>>>>>>>>>>>>>> Blog <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>
>>>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/>> <
>> https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>>>
>>>>>> | CV
>>>>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor> <
>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor>> <
>>>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor> <
>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor>>> <
>>>>>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor> <
>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor>> <
>>>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor> <
>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor>>>>> | LinkedIn
>>>>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/> <
>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/>> <
>>>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/> <
>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/>>> <
>>>>>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/> <
>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/>> <
>>>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/> <
>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/>>>>> | AngelList
>>>>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor> <https://angel.co/taylor <https://angel.co/taylor>> <
>> https://angel.co/taylor <https://angel.co/taylor> <https://angel.co/taylor <https://angel.co/taylor>>> <
>>>> https://angel.co/taylor <https://angel.co/taylor> <https://angel.co/taylor <https://angel.co/taylor>> <
>> https://angel.co/taylor <https://angel.co/taylor> <https://angel.co/taylor <https://angel.co/taylor>>>>> | Stack
>>>>>> Overflow
>>>>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston> <
>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston>> <
>>>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston> <
>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston>>> <
>>>>>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston> <
>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston>> <
>>>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston> <
>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston>>>>>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
>>>>>>>>>>> <fokko@driesprong.frl <ma...@driesprong.frl> <mailto:fokko@driesprong.frl <ma...@driesprong.frl>> <mailto:
>> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:fokko@driesprong.frl <ma...@driesprong.frl>>> <mailto:
>>>> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:fokko@driesprong.frl <ma...@driesprong.frl>> <mailto:
>> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:fokko@driesprong.frl <ma...@driesprong.frl>>>>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Ry,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You should ask Bolke de Bruin. He's really experienced with
>>>>>>>> Kerberos
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> he
>>>>>>>>>>>>>>>> did also the implementation for Airflow. Beside that he
>> worked
>>>>>>>> also
>>>>>>>>>>> on
>>>>>>>>>>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <
>>>> ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io <ma...@astronomer.io>> <mailto:ry@astronomer.io <ma...@astronomer.io>
>> <mailto:ry@astronomer.io <ma...@astronomer.io>>>
>>>>>> <mailto:ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io <ma...@astronomer.io>> <mailto:
>> ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io <ma...@astronomer.io>>>>>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi everyone -
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We have several bigCo's who are considering using Airflow
>>>>>> asking
>>>>>>>>>>> into
>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>> support for Kerberos.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We're going to work on a proof-of-concept next week, will
>>>>>> likely
>>>>>>>>>>>>>>> record a
>>>>>>>>>>>>>>>>> screencast on it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> For now, we're looking for any anecdotal information from
>>>>>>>>>>>>> organizations
>>>>>>>>>>>>>>>> who
>>>>>>>>>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing
>>>> to
>>>>>>>> share
>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>> experiences here, or reply to me personally, it would be
>>>>>> greatly
>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -Ry
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/ <http://www.astronomer.io/> <
>> http://www.astronomer.io/ <http://www.astronomer.io/>> <
>>>> http://www.astronomer.io/ <http://www.astronomer.io/> <http://www.astronomer.io/ <http://www.astronomer.io/>>> <
>>>>>> http://www.astronomer.io/ <http://www.astronomer.io/> <http://www.astronomer.io/ <http://www.astronomer.io/>> <
>> http://www.astronomer.io/ <http://www.astronomer.io/> <http://www.astronomer.io/ <http://www.astronomer.io/>>>>> |
>>>>>>>>>>>>>>>> 513.417.2163 |
>>>>>>>>>>>>>>>>> @rywalker <http://twitter.com/rywalker <http://twitter.com/rywalker> <
>> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
>>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <http://twitter.com/rywalker <http://twitter.com/rywalker>>> <
>>>>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <http://twitter.com/rywalker <http://twitter.com/rywalker>> <
>> http://twitter.com/rywalker <http://twitter.com/rywalker> <http://twitter.com/rywalker <http://twitter.com/rywalker>>>>> | LinkedIn
>>>>>>>>>>>>>>>>> <http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker> <
>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>> <
>>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker> <
>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>>> <
>>>>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker> <
>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>> <
>>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker> <
>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>>>>>

Re: Kerberos and Airflow

Posted by Dan Davydov <dd...@twitter.com.INVALID>.

I'm very intrigued, and am curious how this would work in a bit more
detail, especially for dynamically created DAGs (how would static manifests
map to DAGs that are generated from rows in a MySQL table for example)? You
could of course have something like regexes in your manifest file like
some_dag_framework_dag_*, but then how would you make sure that other users
did not create DAGs that matched this regex?

On Thu, Aug 2, 2018 at 1:51 PM Bolke de Bruin <bd...@gmail.com> wrote:

> Hi Dan,
>
> I discussed this a little bit with one of the security architects here. We
> think that
> you can have a fair trade off between security and usability by having
> a kind of manifest with the dag you are submitting. This manifest can then
> specify what the generated tasks/dags are allowed to do and what metadata
> to provide to them. We could also let the scheduler generate hashes per
> generated
> DAG / task and verify those with an established version (1st run?). This
> limits the
> attack vector.
>
> A DagSerializer would be great, but I think it solves a different issue
> and the above
> is somewhat simpler to implement?
>
> Bolke
>
> > On 29 Jul 2018, at 23:47, Dan Davydov <dd...@twitter.com.INVALID>
> wrote:
> >
> > *Let’s say we trust the owner field of the DAGs I think we could do the
> > following.*
> > *Obviously, the trusting the user part is key here. It is one of the
> > reasons I was suggesting using “airflow submit” to update / add dags in
> > Airflow*
> >
> >
> > *This is the hard part about my question.*
> > I think in a true multi-tenant environment we wouldn't be able to trust
> the
> > user, otherwise we wouldn't necessarily even need a mapping of Airflow
> DAG
> > users to secrets, because if we trust users to set the correct Airflow
> user
> > for DAGs, we are basically trusting them with all of the creds the
> Airflow
> > scheduler can access for all users anyways.
> >
> > I actually had the same thought as your "airflow submit" a while ago,
> which
> > I discussed with Alex, basically creating an API for adding DAGs instead
> of
> > having the Scheduler parse them. FWIW I think it's superior to the git
> time
> > machine approach because it's a more generic form of "serialization" and
> is
> > more correct as well because the same DAG file parsed on a given git SHA
> > can produce different DAGs. Let me know what you think, and maybe I can
> > start a more formal design doc if you are onboard:
> >
> > A user or service with an auth token sends an "airflow submit" request
> to a
> > new kind of Dag Serialization service, along with the serialized DAG
> > objects generated by parsing on the client. It's important that these
> > serialized objects are declaritive and not e.g. pickles so that the
> > scheduler/workers can consume them and reproducability of the DAGs is
> > guaranteed. The service will then store each generated DAG along with
> it's
> > access based on the provided token e.g. using Ranger, and the
> > scheduler/workers will use the stored DAGs for scheduling/execution.
> > Operators would be deployed along with the Airflow code separately from
> the
> > serialized DAGs.
> >
> > A serialed DAG would look something like this (basically Luigi-style :)):
> > MyTask - BashOperator: {
> >  cmd: "sleep 1"
> >  user: "Foo"
> >  access: "token1", "token2"
> > }
> >
> > MyDAG: {
> >  MyTask1 >> SomeOtherTask1
> >  MyTask2 >> SomeOtherTask1
> > }
> >
> > Dynamic DAGs in this case would just consist of a service calling
> "Airflow
> > Submit" that does it's own form of authentication to get access to some
> > kind of tokens (or basically just forwarding the secrets the users of the
> > dynamic DAG submit).
> >
> > For the default Airflow implementation you can maybe just have the Dag
> > Serialization server bundled with the Scheduler, with auth turned off,
> and
> > to periodically update the Dag Serialization store which would emulate
> the
> > current behavior closely.
> >
> > Pros:
> > 1. Consistency across running task instances in a dagrun/scheduler,
> > reproducability and auditability of DAGs
> > 2. Users can control when to deploy their DAGs
> > 3. Scheduler runs much faster since it doesn't have to run python files
> and
> > e.g. make network calls
> > 4. Scaling scheduler becomes easier because can have different service
> > responsible for parsing DAGs which can be trivially scaled horizontally
> > (clients are doing the parsing)
> > 5. Potentially makes creating ad-hoc DAGs/backfilling/iterating on DAGs
> > easier? e.g. can use the Scheduler itself to schedule backfills with a
> > slightly modified serialized version of a DAG.
> >
> > Cons:
> > 1. Have to deprecate a lot of popular features, e.g. allowing custom
> > callbacks in operators (e.g. on_failure), and jinja_templates
> > 2. Version compatibility problems, e.g. user/service client might be
> > serializing arguments for hooks/operators that have been deprecated in
> > newer versions of the hooks, or the serialized DAG schema changes and old
> > DAGs aren't automatically updated. Might want to have some kind of
> > versioning system for serialized DAGs to at least ensure that stored DAGs
> > are valid when the Scheduler/Worker/etc are upgraded, maybe something
> > similar to thrift/protobuf versioning.
> > 3. Additional complexity - additional service, logic on workers/scheduler
> > to fetch/cache serialized DAGs efficiently, expiring/archiving old DAG
> > definitions, etc
> >
> >
> > On Sun, Jul 29, 2018 at 3:20 PM Bolke de Bruin <bdbruin@gmail.com
> <ma...@gmail.com>> wrote:
> >
> >> Ah gotcha. That’s another issue actually (but related).
> >>
> >> Let’s say we trust the owner field of the DAGs I think we could do the
> >> following. We then have a table (and interface) to tell Airflow what
> users
> >> have access to what connections. The scheduler can then check if the
> task
> >> in the dag can access the conn_id it is asking for. Auto generated dags
> >> still have an owner (or should) and therefore should be fine. Some
> >> integrity checking could/should be added as we want to be sure that the
> >> task we schedule is the task we launch. So a signature calculated at the
> >> scheduler (or part of the DAG), send as part of the metadata and
> checked by
> >> the executor is probably smart.
> >>
> >> You can also make this more fancy by integrating with something like
> >> Apache Ranger that allows for policy checking.
> >>
> >> Obviously, the trusting the user part is key here. It is one of the
> >> reasons I was suggesting using “airflow submit” to update / add dags in
> >> Airflow. We could enforce authentication on the DAG. It was kind of
> ruled
> >> out in favor of git time machines although these never happened afaik
> ;-).
> >>
> >> BTW: I have updated my implementation with protobuf. Metadata is now
> >> available at executor and task.
> >>
> >>
> >>> On 29 Jul 2018, at 15:47, Dan Davydov <dd...@twitter.com.INVALID>
> >> wrote:
> >>>
> >>> The concern is how to secure secrets on the scheduler such that only
> >>> certain DAGs can access them, and in the case of files that create DAGs
> >>> dynamically, only some set of DAGs should be able to access these
> >> secrets.
> >>>
> >>> e.g. if there is a secret/keytab that can be read by DAG A generated by
> >>> file X, and file X generates DAG B as well, there needs to be a scheme
> to
> >>> stop the parsing of DAG B on the scheduler from being able to read the
> >>> secret in DAG A.
> >>>
> >>> Does that make sense?
> >>>
> >>> On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin <bdbruin@gmail.com
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>>> wrote:
> >>>
> >>>> I’m not sure what you mean. The example I created allows for dynamic
> >> DAGs,
> >>>> as the scheduler obviously knows about the tasks when they are ready
> to
> >> be
> >>>> scheduled.
> >>>> This isn’t any different from a static DAG or a dynamic one.
> >>>>
> >>>> For Kerberos it isnt that special. Basically a keytab are the
> revokable
> >>>> users credentials
> >>>> in a special format. The keytab itself can be protected by a password.
> >> So
> >>>> I can imagine
> >>>> that a connection is defined that sets a keytab location and password
> to
> >>>> access the keytab.
> >>>> The scheduler understands this (or maybe the Connection model) and
> >>>> serializes and sends
> >>>> it to the worker as part of the metadata. The worker then reconstructs
> >> the
> >>>> keytab and issues
> >>>> a kinit or supplies it to the other service requiring it (eg. Spark)
> >>>>
> >>>> * Obviously the worker and scheduler need to communicate over SSL.
> >>>> * There is a challenge at the worker level. Credentials are secured
> >>>> against other users, but are readable by the owning user. So imagine 2
> >> DAGs
> >>>> from two different users with different connections without sudo
> >>>> configured. If they end up at the same worker if DAG 2 is malicious it
> >>>> could read files and memory created by DAG 1. This is the reason why
> >> using
> >>>> environment variables are NOT safe (DAG 2 could read
> >> /proc/<pid>/environ).
> >>>> To mitigate this we probably need to PIPE the data to the task’s
> STDIN.
> >> It
> >>>> won’t solve the issue but will make it harder as now it will only be
> in
> >>>> memory.
> >>>> * The reconstructed keytab (or the initalized version) can be stored
> in,
> >>>> most likely, the process-keyring (
> >>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html>>>). As
> >>>> mentioned earlier this poses a challenge for Java applications that
> >> cannot
> >>>> read from this location (keytab an ccache). Writing it out to the
> >>>> filesystem then becomes a possibility. This is essentially the same
> how
> >>>> Spark solves it (
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>>>).
> >>>>
> >>>> Why not work on this together? We need it as well. Airflow as it is
> now
> >> we
> >>>> consider the biggest security threat and it is really hard to secure
> it.
> >>>> The above would definitely be a serious improvement. Another step
> would
> >> be
> >>>> to stop Tasks from accessing the Airflow DB all together.
> >>>>
> >>>> Cheers
> >>>> Bolke
> >>>>
> >>>>> On 29 Jul 2018, at 05:36, Dan Davydov <ddavydov@twitter.com.INVALID
> <ma...@twitter.com.INVALID>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID>>>
> >>>> wrote:
> >>>>>
> >>>>> This makes sense, and thanks for putting this together. I might pick
> >> this
> >>>>> up myself depending on if we can get the rest of the mutli-tenancy
> >> story
> >>>>> nailed down, but I still think the tricky part is figuring out how to
> >>>> allow
> >>>>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work
> >> with
> >>>>> Kerberos, curious what your thoughts are there. How would secrets be
> >>>> passed
> >>>>> securely in a multi-tenant Scheduler starting from parsing the DAGs
> up
> >> to
> >>>>> the executor sending them off?
> >>>>>
> >>>>> On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbruin@gmail.com
> <ma...@gmail.com>
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
> >>>>>
> >>>>>> Here:
> >>>>>>
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>>>>
> >>>>>>
> >>>>>> Is a working rudimentary implementation that allows securing the
> >>>>>> connections (only LocalExecutor at the moment)
> >>>>>>
> >>>>>> * It enforces the use of “conn_id” instead of the mix that we have
> now
> >>>>>> * A task if using “conn_id” has ‘auto-registered’ (which is a noop)
> >> its
> >>>>>> connections
> >>>>>> * The scheduler reads the connection informations and serializes it
> to
> >>>>>> json (which should be a different format, protobuf preferably)
> >>>>>> * The scheduler then sends this info to the executor
> >>>>>> * The executor puts this in the environment of the task (environment
> >>>> most
> >>>>>> likely not secure enough for us)
> >>>>>> * The BaseHook reads out this environment variable and does not need
> >> to
> >>>>>> touch the database
> >>>>>>
> >>>>>> The example_http_operator works, I havent tested any other. To make
> it
> >>>>>> work I just adjusted the hook and operator to use “conn_id” instead
> >>>>>> of the non standard http_conn_id.
> >>>>>>
> >>>>>> Makes sense?
> >>>>>>
> >>>>>> B.
> >>>>>>
> >>>>>> * The BaseHook is adjusted to not connect to the database
> >>>>>>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbruin@gmail.com
> <ma...@gmail.com> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com>>> wrote:
> >>>>>>>
> >>>>>>> Well, I don’t think a hook (or task) should be obtain it by itself.
> >> It
> >>>>>> should be supplied.
> >>>>>>> At the moment you start executing the task you cannot trust it
> >> anymore
> >>>>>> (ie. it is unmanaged
> >>>>>>> / non airflow code).
> >>>>>>>
> >>>>>>> So we could change the basehook to understand supplied credentials
> >> and
> >>>>>> populate
> >>>>>>> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection
> >>>>>> anyway, so
> >>>>>>> it shouldnt be too hard and should in principle not require changes
> >> to
> >>>>>> the hooks
> >>>>>>> themselves if they are well behaved.
> >>>>>>>
> >>>>>>> B.
> >>>>>>>
> >>>>>>>> On 28 Jul 2018, at 17:41, Dan Davydov
> <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID>>
> >>>>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:
> >> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
> <mailto:
> >>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
> >>>>>
> >> wrote:
> >>>>>>>>
> >>>>>>>> *So basically in the scheduler we parse the dag. Either from the
> >>>>>> manifest
> >>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
> >>>>>> register?) we
> >>>>>>>> know what connections and keytabs are available dag wide or per
> >> task.*
> >>>>>>>> This is the hard part that I was curious about, for dynamically
> >>>> created
> >>>>>>>> DAGs, e.g. those generated by reading tasks in a MySQL database
> or a
> >>>>>> json
> >>>>>>>> file, there isn't a great way to do this.
> >>>>>>>>
> >>>>>>>> I 100% agree with deprecating the connections table (at least for
> >> the
> >>>>>>>> secure option). The main work there is rewriting all hooks to take
> >>>>>>>> credentials from arbitrary data sources by allowing a customized
> >>>>>>>> CredentialsReader class. Although hooks are technically private, I
> >>>>>> think a
> >>>>>>>> lot of companies depend on them so the PMC should probably discuss
> >> if
> >>>>>> this
> >>>>>>>> is an Airflow 2.0 change or not.
> >>>>>>>>
> >>>>>>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbruin@gmail.com
> <ma...@gmail.com>
> >> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>>>
> >>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Sure. In general I consider keytabs as a part of connection
> >>>>>> information.
> >>>>>>>>> Connections should be secured by sending the connection
> >> information a
> >>>>>> task
> >>>>>>>>> needs as part of information the executor gets. A task should
> then
> >>>> not
> >>>>>> need
> >>>>>>>>> access to the connection table in Airflow. Keytabs could then be
> >> send
> >>>>>> as
> >>>>>>>>> part of the connection information (base64 encoded) and setup by
> >> the
> >>>>>>>>> executor (this key) to be read only to the task it is launching.
> >>>>>>>>>
> >>>>>>>>> So basically in the scheduler we parse the dag. Either from the
> >>>>>> manifest
> >>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
> >>>>>> register?) we
> >>>>>>>>> know what connections and keytabs are available dag wide or per
> >> task.
> >>>>>>>>>
> >>>>>>>>> The credentials and connection information then are serialized
> >> into a
> >>>>>>>>> protobuf message and send to the executor as part of the “queue”
> >>>>>> action.
> >>>>>>>>> The worker then deserializes the information and makes it
> securely
> >>>>>>>>> available to the task (which is quite hard btw).
> >>>>>>>>>
> >>>>>>>>> On that last bit making the info securely available might be
> >> storing
> >>>>>> it in
> >>>>>>>>> the Linux KEYRING (supported by python keyring). Keytabs will be
> >>>> tough
> >>>>>> to
> >>>>>>>>> do properly due to Java not properly supporting KEYRING and only
> >>>> files
> >>>>>> and
> >>>>>>>>> these are hard to make secure (due to the possibility a process
> >> will
> >>>>>> list
> >>>>>>>>> all files in /tmp and get credentials through that). Maybe
> storing
> >>>> the
> >>>>>>>>> keytab with a password and having the password in the KEYRING
> might
> >>>>>> work.
> >>>>>>>>> Something to find out.
> >>>>>>>>>
> >>>>>>>>> B.
> >>>>>>>>>
> >>>>>>>>> Verstuurd vanaf mijn iPad
> >>>>>>>>>
> >>>>>>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
> >>>>>> <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> <mailto:ddavydov@twitter.com.INVALID <mailto:ddavydov@twitter.com.INVALID
> >>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID
> <ma...@twitter.com.INVALID>
> >>>>
> >>>> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID> <mailto:
> >> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
> >> <mailto:ddavydov@twitter.com.INVALID <mailto:
> ddavydov@twitter.com.INVALID>>
> >>>>>>>
> >>>>>>>>> het volgende geschreven:
> >>>>>>>>>>
> >>>>>>>>>> I'm curious if you had any ideas in terms of ideas to enable
> >>>>>>>>> multi-tenancy
> >>>>>>>>>> with respect to Kerberos in Airflow.
> >>>>>>>>>>
> >>>>>>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
> >> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>>
> >>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>>>
> >>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:
> bdbruin@gmail.com <ma...@gmail.com>> <mailto:
> >> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com
> <ma...@gmail.com>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Cool. The doc will need some refinement as it isn't entirely
> >>>>>> accurate.
> >>>>>>>>> In
> >>>>>>>>>>> addition we need to separate between Airflow as a client of
> >>>>>> kerberized
> >>>>>>>>>>> services (this is what is talked about in the astronomer doc)
> vs
> >>>>>>>>>>> kerberizing airflow itself, which the API supports.
> >>>>>>>>>>>
> >>>>>>>>>>> In general to access kerberized services (airflow as a client)
> >> one
> >>>>>> needs
> >>>>>>>>>>> to start the ticket renewer with a valid keytab. For the hooks
> it
> >>>>>> isn't
> >>>>>>>>>>> always required to change the hook to support it. Hadoop cli
> >> tools
> >>>>>> often
> >>>>>>>>>>> just pick it up as their client config is set to do so. Then
> >>>> another
> >>>>>>>>> class
> >>>>>>>>>>> is there for HTTP-like services which are accessed by urllib
> >> under
> >>>>>> the
> >>>>>>>>>>> hood, these typically use SPNEGO. These often need to be
> adjusted
> >>>> as
> >>>>>> it
> >>>>>>>>>>> requires some urllib config. Finally, there are protocols which
> >> use
> >>>>>> SASL
> >>>>>>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These
> >>>>>> require
> >>>>>>>>> per
> >>>>>>>>>>> protocol implementations.
> >>>>>>>>>>>
> >>>>>>>>>>> From the top of my head we support kerberos client side now
> with:
> >>>>>>>>>>>
> >>>>>>>>>>> * Spark
> >>>>>>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
> >>>>>>>>>>> implementation)
> >>>>>>>>>>> * Hive (not metastore afaik)
> >>>>>>>>>>>
> >>>>>>>>>>> Two things to remember:
> >>>>>>>>>>>
> >>>>>>>>>>> * If a job (ie. Spark job) will finish later than the maximum
> >>>> ticket
> >>>>>>>>>>> lifetime you probably need to provide a keytab to said
> >> application.
> >>>>>>>>>>> Otherwise you will get failures after the expiry.
> >>>>>>>>>>> * A keytab (used by the renewer) are credentials (user and
> pass)
> >> so
> >>>>>> jobs
> >>>>>>>>>>> are executed under the keytab in use at that moment
> >>>>>>>>>>> * Securing keytab in multi tenancy airflow is a challenge. This
> >>>> also
> >>>>>>>>> goes
> >>>>>>>>>>> for securing connections. This we need to fix at some point.
> >>>> Solution
> >>>>>>>>> for
> >>>>>>>>>>> now seems to be no multi tenancy.
> >>>>>>>>>>>
> >>>>>>>>>>> Kerberos seems harder than it is btw. Still, we are sometimes
> >>>> moving
> >>>>>>>>> away
> >>>>>>>>>>> from it to OAUTH2 based authentication. This gets use closer to
> >>>> cloud
> >>>>>>>>>>> standards (but we are on prem)
> >>>>>>>>>>>
> >>>>>>>>>>> B.
> >>>>>>>>>>>
> >>>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>>
> >>>>>>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hitesh@apache.org
> <ma...@apache.org>
> >> <mailto:hitesh@apache.org <ma...@apache.org>> <mailto:
> >>>> hitesh@apache.org <ma...@apache.org> <mailto:
> hitesh@apache.org <ma...@apache.org>>> <mailto:
> >>>>>> hitesh@apache.org <ma...@apache.org> <mailto:
> hitesh@apache.org <ma...@apache.org>> <mailto:
> >> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org
> <ma...@apache.org>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Taylor
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1 on upstreaming this. It would be great if you can submit a
> >> pull
> >>>>>>>>>>> request
> >>>>>>>>>>>> to enhance the apache airflow docs.
> >>>>>>>>>>>>
> >>>>>>>>>>>> thanks
> >>>>>>>>>>>> Hitesh
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
> >>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
> >>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>> <mailto:
> >> tedmiston@gmail.com <ma...@gmail.com> <mailto:
> tedmiston@gmail.com <ma...@gmail.com>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> While we're on the topic, I'd love any feedback from Bolke or
> >>>>>> others
> >>>>>>>>>>> who've
> >>>>>>>>>>>>> used Kerberos with Airflow on this quick guide I put together
> >>>>>>>>> yesterday.
> >>>>>>>>>>>>> It's similar to what's in the Airflow docs but instead all on
> >> one
> >>>>>> page
> >>>>>>>>>>>>> and slightly
> >>>>>>>>>>>>> expanded.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>>>
> >>>>>> <
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/
> <https://www.astronomer.io/guides/kerberos/> <
> >> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/>> <
> >>>> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/> <
> >> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/>>>>)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> One thing I'd like to add is a minimal example of how to
> >>>> Kerberize
> >>>>>> a
> >>>>>>>>>>> hook.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
> >>>>>>>>> Concepts >
> >>>>>>>>>>>>> Additional Functionality > Kerberos page?)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Taylor
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Taylor Edmiston*
> >>>>>>>>>>>>> Blog <https://blog.tedmiston.com/ <
> https://blog.tedmiston.com/> <https://blog.tedmiston.com/ <
> https://blog.tedmiston.com/>>
> >> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>>
> >>>> | CV
> >>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor> <
> >> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor>> <
> >>>> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor> <
> >> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor>>>> | LinkedIn
> >>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/> <
> >> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/>> <
> >>>> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/> <
> >> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/>>>> | AngelList
> >>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor> <
> https://angel.co/taylor <https://angel.co/taylor>> <
> >> https://angel.co/taylor <https://angel.co/taylor> <
> https://angel.co/taylor <https://angel.co/taylor>>>> | Stack
> >>>> Overflow
> >>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston>> <
> >>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
> >>>>>>>>> <fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
> fokko@driesprong.frl <ma...@driesprong.frl>> <mailto:
> >> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:
> fokko@driesprong.frl <ma...@driesprong.frl>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Ry,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You should ask Bolke de Bruin. He's really experienced with
> >>>>>> Kerberos
> >>>>>>>>>>> and
> >>>>>>>>>>>>> he
> >>>>>>>>>>>>>> did also the implementation for Airflow. Beside that he
> worked
> >>>>>> also
> >>>>>>>>> on
> >>>>>>>>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers, Fokko
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <
> >> ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io
> <ma...@astronomer.io>>
> >>>> <mailto:ry@astronomer.io <ma...@astronomer.io> <mailto:
> ry@astronomer.io <ma...@astronomer.io>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi everyone -
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We have several bigCo's who are considering using Airflow
> >>>> asking
> >>>>>>>>> into
> >>>>>>>>>>>>> its
> >>>>>>>>>>>>>>> support for Kerberos.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We're going to work on a proof-of-concept next week, will
> >>>> likely
> >>>>>>>>>>>>> record a
> >>>>>>>>>>>>>>> screencast on it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For now, we're looking for any anecdotal information from
> >>>>>>>>>>> organizations
> >>>>>>>>>>>>>> who
> >>>>>>>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing
> >> to
> >>>>>> share
> >>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>> experiences here, or reply to me personally, it would be
> >>>> greatly
> >>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Ry
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/ <
> http://www.astronomer.io/> <
> >> http://www.astronomer.io/ <http://www.astronomer.io/>> <
> >>>> http://www.astronomer.io/ <http://www.astronomer.io/> <
> http://www.astronomer.io/ <http://www.astronomer.io/>>>> |
> >>>>>>>>>>>>>> 513.417.2163 |
> >>>>>>>>>>>>>>> @rywalker <http://twitter.com/rywalker <
> http://twitter.com/rywalker> <
> >> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
> >>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <
> http://twitter.com/rywalker <http://twitter.com/rywalker>>>> | LinkedIn
> >>>>>>>>>>>>>>> <http://www.linkedin.com/in/rywalker <
> http://www.linkedin.com/in/rywalker> <
> >> http://www.linkedin.com/in/rywalker <
> http://www.linkedin.com/in/rywalker>> <
> >>>> http://www.linkedin.com/in/rywalker <
> http://www.linkedin.com/in/rywalker> <
> >> http://www.linkedin.com/in/rywalker <
> http://www.linkedin.com/in/rywalker>>>>
>
>

Re: Kerberos and Airflow

Posted by Bolke de Bruin <bd...@gmail.com>.

Also: using the Kubernetes executor combined with some of the things we
discussed greatly enhances the security of Airflow as the environment 
isn’t really shared anymore.

B.

> On 2 Aug 2018, at 19:51, Bolke de Bruin <bd...@gmail.com> wrote:
> 
> Hi Dan,
> 
> I discussed this a little bit with one of the security architects here. We think that 
> you can have a fair trade off between security and usability by having
> a kind of manifest with the dag you are submitting. This manifest can then 
> specify what the generated tasks/dags are allowed to do and what metadata 
> to provide to them. We could also let the scheduler generate hashes per generated
> DAG / task and verify those with an established version (1st run?). This limits the 
> attack vector.
> 
> A DagSerializer would be great, but I think it solves a different issue and the above 
> is somewhat simpler to implement?
> 
> Bolke
> 
>> On 29 Jul 2018, at 23:47, Dan Davydov <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>> wrote:
>> 
>> *Let’s say we trust the owner field of the DAGs I think we could do the
>> following.*
>> *Obviously, the trusting the user part is key here. It is one of the
>> reasons I was suggesting using “airflow submit” to update / add dags in
>> Airflow*
>> 
>> 
>> *This is the hard part about my question.*
>> I think in a true multi-tenant environment we wouldn't be able to trust the
>> user, otherwise we wouldn't necessarily even need a mapping of Airflow DAG
>> users to secrets, because if we trust users to set the correct Airflow user
>> for DAGs, we are basically trusting them with all of the creds the Airflow
>> scheduler can access for all users anyways.
>> 
>> I actually had the same thought as your "airflow submit" a while ago, which
>> I discussed with Alex, basically creating an API for adding DAGs instead of
>> having the Scheduler parse them. FWIW I think it's superior to the git time
>> machine approach because it's a more generic form of "serialization" and is
>> more correct as well because the same DAG file parsed on a given git SHA
>> can produce different DAGs. Let me know what you think, and maybe I can
>> start a more formal design doc if you are onboard:
>> 
>> A user or service with an auth token sends an "airflow submit" request to a
>> new kind of Dag Serialization service, along with the serialized DAG
>> objects generated by parsing on the client. It's important that these
>> serialized objects are declaritive and not e.g. pickles so that the
>> scheduler/workers can consume them and reproducability of the DAGs is
>> guaranteed. The service will then store each generated DAG along with it's
>> access based on the provided token e.g. using Ranger, and the
>> scheduler/workers will use the stored DAGs for scheduling/execution.
>> Operators would be deployed along with the Airflow code separately from the
>> serialized DAGs.
>> 
>> A serialed DAG would look something like this (basically Luigi-style :)):
>> MyTask - BashOperator: {
>>  cmd: "sleep 1"
>>  user: "Foo"
>>  access: "token1", "token2"
>> }
>> 
>> MyDAG: {
>>  MyTask1 >> SomeOtherTask1
>>  MyTask2 >> SomeOtherTask1
>> }
>> 
>> Dynamic DAGs in this case would just consist of a service calling "Airflow
>> Submit" that does it's own form of authentication to get access to some
>> kind of tokens (or basically just forwarding the secrets the users of the
>> dynamic DAG submit).
>> 
>> For the default Airflow implementation you can maybe just have the Dag
>> Serialization server bundled with the Scheduler, with auth turned off, and
>> to periodically update the Dag Serialization store which would emulate the
>> current behavior closely.
>> 
>> Pros:
>> 1. Consistency across running task instances in a dagrun/scheduler,
>> reproducability and auditability of DAGs
>> 2. Users can control when to deploy their DAGs
>> 3. Scheduler runs much faster since it doesn't have to run python files and
>> e.g. make network calls
>> 4. Scaling scheduler becomes easier because can have different service
>> responsible for parsing DAGs which can be trivially scaled horizontally
>> (clients are doing the parsing)
>> 5. Potentially makes creating ad-hoc DAGs/backfilling/iterating on DAGs
>> easier? e.g. can use the Scheduler itself to schedule backfills with a
>> slightly modified serialized version of a DAG.
>> 
>> Cons:
>> 1. Have to deprecate a lot of popular features, e.g. allowing custom
>> callbacks in operators (e.g. on_failure), and jinja_templates
>> 2. Version compatibility problems, e.g. user/service client might be
>> serializing arguments for hooks/operators that have been deprecated in
>> newer versions of the hooks, or the serialized DAG schema changes and old
>> DAGs aren't automatically updated. Might want to have some kind of
>> versioning system for serialized DAGs to at least ensure that stored DAGs
>> are valid when the Scheduler/Worker/etc are upgraded, maybe something
>> similar to thrift/protobuf versioning.
>> 3. Additional complexity - additional service, logic on workers/scheduler
>> to fetch/cache serialized DAGs efficiently, expiring/archiving old DAG
>> definitions, etc
>> 
>> 
>> On Sun, Jul 29, 2018 at 3:20 PM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> Ah gotcha. That’s another issue actually (but related).
>>> 
>>> Let’s say we trust the owner field of the DAGs I think we could do the
>>> following. We then have a table (and interface) to tell Airflow what users
>>> have access to what connections. The scheduler can then check if the task
>>> in the dag can access the conn_id it is asking for. Auto generated dags
>>> still have an owner (or should) and therefore should be fine. Some
>>> integrity checking could/should be added as we want to be sure that the
>>> task we schedule is the task we launch. So a signature calculated at the
>>> scheduler (or part of the DAG), send as part of the metadata and checked by
>>> the executor is probably smart.
>>> 
>>> You can also make this more fancy by integrating with something like
>>> Apache Ranger that allows for policy checking.
>>> 
>>> Obviously, the trusting the user part is key here. It is one of the
>>> reasons I was suggesting using “airflow submit” to update / add dags in
>>> Airflow. We could enforce authentication on the DAG. It was kind of ruled
>>> out in favor of git time machines although these never happened afaik ;-).
>>> 
>>> BTW: I have updated my implementation with protobuf. Metadata is now
>>> available at executor and task.
>>> 
>>> 
>>>> On 29 Jul 2018, at 15:47, Dan Davydov <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>> wrote:
>>>> 
>>>> The concern is how to secure secrets on the scheduler such that only
>>>> certain DAGs can access them, and in the case of files that create DAGs
>>>> dynamically, only some set of DAGs should be able to access these
>>> secrets.
>>>> 
>>>> e.g. if there is a secret/keytab that can be read by DAG A generated by
>>>> file X, and file X generates DAG B as well, there needs to be a scheme to
>>>> stop the parsing of DAG B on the scheduler from being able to read the
>>>> secret in DAG A.
>>>> 
>>>> Does that make sense?
>>>> 
>>>> On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>
>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>> wrote:
>>>> 
>>>>> I’m not sure what you mean. The example I created allows for dynamic
>>> DAGs,
>>>>> as the scheduler obviously knows about the tasks when they are ready to
>>> be
>>>>> scheduled.
>>>>> This isn’t any different from a static DAG or a dynamic one.
>>>>> 
>>>>> For Kerberos it isnt that special. Basically a keytab are the revokable
>>>>> users credentials
>>>>> in a special format. The keytab itself can be protected by a password.
>>> So
>>>>> I can imagine
>>>>> that a connection is defined that sets a keytab location and password to
>>>>> access the keytab.
>>>>> The scheduler understands this (or maybe the Connection model) and
>>>>> serializes and sends
>>>>> it to the worker as part of the metadata. The worker then reconstructs
>>> the
>>>>> keytab and issues
>>>>> a kinit or supplies it to the other service requiring it (eg. Spark)
>>>>> 
>>>>> * Obviously the worker and scheduler need to communicate over SSL.
>>>>> * There is a challenge at the worker level. Credentials are secured
>>>>> against other users, but are readable by the owning user. So imagine 2
>>> DAGs
>>>>> from two different users with different connections without sudo
>>>>> configured. If they end up at the same worker if DAG 2 is malicious it
>>>>> could read files and memory created by DAG 1. This is the reason why
>>> using
>>>>> environment variables are NOT safe (DAG 2 could read
>>> /proc/<pid>/environ).
>>>>> To mitigate this we probably need to PIPE the data to the task’s STDIN.
>>> It
>>>>> won’t solve the issue but will make it harder as now it will only be in
>>>>> memory.
>>>>> * The reconstructed keytab (or the initalized version) can be stored in,
>>>>> most likely, the process-keyring (
>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html>>>). As
>>>>> mentioned earlier this poses a challenge for Java applications that
>>> cannot
>>>>> read from this location (keytab an ccache). Writing it out to the
>>>>> filesystem then becomes a possibility. This is essentially the same how
>>>>> Spark solves it (
>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode>>>).
>>>>> 
>>>>> Why not work on this together? We need it as well. Airflow as it is now
>>> we
>>>>> consider the biggest security threat and it is really hard to secure it.
>>>>> The above would definitely be a serious improvement. Another step would
>>> be
>>>>> to stop Tasks from accessing the Airflow DB all together.
>>>>> 
>>>>> Cheers
>>>>> Bolke
>>>>> 
>>>>>> On 29 Jul 2018, at 05:36, Dan Davydov <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>
>>>>> wrote:
>>>>>> 
>>>>>> This makes sense, and thanks for putting this together. I might pick
>>> this
>>>>>> up myself depending on if we can get the rest of the mutli-tenancy
>>> story
>>>>>> nailed down, but I still think the tricky part is figuring out how to
>>>>> allow
>>>>>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work
>>> with
>>>>>> Kerberos, curious what your thoughts are there. How would secrets be
>>>>> passed
>>>>>> securely in a multi-tenant Scheduler starting from parsing the DAGs up
>>> to
>>>>>> the executor sending them off?
>>>>>> 
>>>>>> On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>
>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>> wrote:
>>>>>> 
>>>>>>> Here:
>>>>>>> 
>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>>>>
>>>>>>> 
>>>>>>> Is a working rudimentary implementation that allows securing the
>>>>>>> connections (only LocalExecutor at the moment)
>>>>>>> 
>>>>>>> * It enforces the use of “conn_id” instead of the mix that we have now
>>>>>>> * A task if using “conn_id” has ‘auto-registered’ (which is a noop)
>>> its
>>>>>>> connections
>>>>>>> * The scheduler reads the connection informations and serializes it to
>>>>>>> json (which should be a different format, protobuf preferably)
>>>>>>> * The scheduler then sends this info to the executor
>>>>>>> * The executor puts this in the environment of the task (environment
>>>>> most
>>>>>>> likely not secure enough for us)
>>>>>>> * The BaseHook reads out this environment variable and does not need
>>> to
>>>>>>> touch the database
>>>>>>> 
>>>>>>> The example_http_operator works, I havent tested any other. To make it
>>>>>>> work I just adjusted the hook and operator to use “conn_id” instead
>>>>>>> of the non standard http_conn_id.
>>>>>>> 
>>>>>>> Makes sense?
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> * The BaseHook is adjusted to not connect to the database
>>>>>>>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com> <mailto:
>>> bdbruin@gmail.com <ma...@gmail.com>>> wrote:
>>>>>>>> 
>>>>>>>> Well, I don’t think a hook (or task) should be obtain it by itself.
>>> It
>>>>>>> should be supplied.
>>>>>>>> At the moment you start executing the task you cannot trust it
>>> anymore
>>>>>>> (ie. it is unmanaged
>>>>>>>> / non airflow code).
>>>>>>>> 
>>>>>>>> So we could change the basehook to understand supplied credentials
>>> and
>>>>>>> populate
>>>>>>>> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection
>>>>>>> anyway, so
>>>>>>>> it shouldnt be too hard and should in principle not require changes
>>> to
>>>>>>> the hooks
>>>>>>>> themselves if they are well behaved.
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>>> On 28 Jul 2018, at 17:41, Dan Davydov <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>>>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>> <mailto:
>>>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>>>>
>>> wrote:
>>>>>>>>> 
>>>>>>>>> *So basically in the scheduler we parse the dag. Either from the
>>>>>>> manifest
>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>>> register?) we
>>>>>>>>> know what connections and keytabs are available dag wide or per
>>> task.*
>>>>>>>>> This is the hard part that I was curious about, for dynamically
>>>>> created
>>>>>>>>> DAGs, e.g. those generated by reading tasks in a MySQL database or a
>>>>>>> json
>>>>>>>>> file, there isn't a great way to do this.
>>>>>>>>> 
>>>>>>>>> I 100% agree with deprecating the connections table (at least for
>>> the
>>>>>>>>> secure option). The main work there is rewriting all hooks to take
>>>>>>>>> credentials from arbitrary data sources by allowing a customized
>>>>>>>>> CredentialsReader class. Although hooks are technically private, I
>>>>>>> think a
>>>>>>>>> lot of companies depend on them so the PMC should probably discuss
>>> if
>>>>>>> this
>>>>>>>>> is an Airflow 2.0 change or not.
>>>>>>>>> 
>>>>>>>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbruin@gmail.com <ma...@gmail.com>
>>> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Sure. In general I consider keytabs as a part of connection
>>>>>>> information.
>>>>>>>>>> Connections should be secured by sending the connection
>>> information a
>>>>>>> task
>>>>>>>>>> needs as part of information the executor gets. A task should then
>>>>> not
>>>>>>> need
>>>>>>>>>> access to the connection table in Airflow. Keytabs could then be
>>> send
>>>>>>> as
>>>>>>>>>> part of the connection information (base64 encoded) and setup by
>>> the
>>>>>>>>>> executor (this key) to be read only to the task it is launching.
>>>>>>>>>> 
>>>>>>>>>> So basically in the scheduler we parse the dag. Either from the
>>>>>>> manifest
>>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>>> register?) we
>>>>>>>>>> know what connections and keytabs are available dag wide or per
>>> task.
>>>>>>>>>> 
>>>>>>>>>> The credentials and connection information then are serialized
>>> into a
>>>>>>>>>> protobuf message and send to the executor as part of the “queue”
>>>>>>> action.
>>>>>>>>>> The worker then deserializes the information and makes it securely
>>>>>>>>>> available to the task (which is quite hard btw).
>>>>>>>>>> 
>>>>>>>>>> On that last bit making the info securely available might be
>>> storing
>>>>>>> it in
>>>>>>>>>> the Linux KEYRING (supported by python keyring). Keytabs will be
>>>>> tough
>>>>>>> to
>>>>>>>>>> do properly due to Java not properly supporting KEYRING and only
>>>>> files
>>>>>>> and
>>>>>>>>>> these are hard to make secure (due to the possibility a process
>>> will
>>>>>>> list
>>>>>>>>>> all files in /tmp and get credentials through that). Maybe storing
>>>>> the
>>>>>>>>>> keytab with a password and having the password in the KEYRING might
>>>>>>> work.
>>>>>>>>>> Something to find out.
>>>>>>>>>> 
>>>>>>>>>> B.
>>>>>>>>>> 
>>>>>>>>>> Verstuurd vanaf mijn iPad
>>>>>>>>>> 
>>>>>>>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
>>>>>>> <ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>>>>> 
>>>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID> <mailto:
>>> ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>
>>> <mailto:ddavydov@twitter.com.INVALID <ma...@twitter.com.INVALID>>
>>>>>>>> 
>>>>>>>>>> het volgende geschreven:
>>>>>>>>>>> 
>>>>>>>>>>> I'm curious if you had any ideas in terms of ideas to enable
>>>>>>>>>> multi-tenancy
>>>>>>>>>>> with respect to Kerberos in Airflow.
>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>
>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>
>>>>>>> <mailto:bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>> <mailto:
>>> bdbruin@gmail.com <ma...@gmail.com> <mailto:bdbruin@gmail.com <ma...@gmail.com>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Cool. The doc will need some refinement as it isn't entirely
>>>>>>> accurate.
>>>>>>>>>> In
>>>>>>>>>>>> addition we need to separate between Airflow as a client of
>>>>>>> kerberized
>>>>>>>>>>>> services (this is what is talked about in the astronomer doc) vs
>>>>>>>>>>>> kerberizing airflow itself, which the API supports.
>>>>>>>>>>>> 
>>>>>>>>>>>> In general to access kerberized services (airflow as a client)
>>> one
>>>>>>> needs
>>>>>>>>>>>> to start the ticket renewer with a valid keytab. For the hooks it
>>>>>>> isn't
>>>>>>>>>>>> always required to change the hook to support it. Hadoop cli
>>> tools
>>>>>>> often
>>>>>>>>>>>> just pick it up as their client config is set to do so. Then
>>>>> another
>>>>>>>>>> class
>>>>>>>>>>>> is there for HTTP-like services which are accessed by urllib
>>> under
>>>>>>> the
>>>>>>>>>>>> hood, these typically use SPNEGO. These often need to be adjusted
>>>>> as
>>>>>>> it
>>>>>>>>>>>> requires some urllib config. Finally, there are protocols which
>>> use
>>>>>>> SASL
>>>>>>>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These
>>>>>>> require
>>>>>>>>>> per
>>>>>>>>>>>> protocol implementations.
>>>>>>>>>>>> 
>>>>>>>>>>>> From the top of my head we support kerberos client side now with:
>>>>>>>>>>>> 
>>>>>>>>>>>> * Spark
>>>>>>>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
>>>>>>>>>>>> implementation)
>>>>>>>>>>>> * Hive (not metastore afaik)
>>>>>>>>>>>> 
>>>>>>>>>>>> Two things to remember:
>>>>>>>>>>>> 
>>>>>>>>>>>> * If a job (ie. Spark job) will finish later than the maximum
>>>>> ticket
>>>>>>>>>>>> lifetime you probably need to provide a keytab to said
>>> application.
>>>>>>>>>>>> Otherwise you will get failures after the expiry.
>>>>>>>>>>>> * A keytab (used by the renewer) are credentials (user and pass)
>>> so
>>>>>>> jobs
>>>>>>>>>>>> are executed under the keytab in use at that moment
>>>>>>>>>>>> * Securing keytab in multi tenancy airflow is a challenge. This
>>>>> also
>>>>>>>>>> goes
>>>>>>>>>>>> for securing connections. This we need to fix at some point.
>>>>> Solution
>>>>>>>>>> for
>>>>>>>>>>>> now seems to be no multi tenancy.
>>>>>>>>>>>> 
>>>>>>>>>>>> Kerberos seems harder than it is btw. Still, we are sometimes
>>>>> moving
>>>>>>>>>> away
>>>>>>>>>>>> from it to OAUTH2 based authentication. This gets use closer to
>>>>> cloud
>>>>>>>>>>>> standards (but we are on prem)
>>>>>>>>>>>> 
>>>>>>>>>>>> B.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hitesh@apache.org <ma...@apache.org>
>>> <mailto:hitesh@apache.org <ma...@apache.org>> <mailto:
>>>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>>> <mailto:
>>>>>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>> <mailto:
>>> hitesh@apache.org <ma...@apache.org> <mailto:hitesh@apache.org <ma...@apache.org>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Taylor
>>>>>>>>>>>>> 
>>>>>>>>>>>>> +1 on upstreaming this. It would be great if you can submit a
>>> pull
>>>>>>>>>>>> request
>>>>>>>>>>>>> to enhance the apache airflow docs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks
>>>>>>>>>>>>> Hitesh
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
>>>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>>> <mailto:
>>>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>> <mailto:
>>> tedmiston@gmail.com <ma...@gmail.com> <mailto:tedmiston@gmail.com <ma...@gmail.com>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> While we're on the topic, I'd love any feedback from Bolke or
>>>>>>> others
>>>>>>>>>>>> who've
>>>>>>>>>>>>>> used Kerberos with Airflow on this quick guide I put together
>>>>>>>>>> yesterday.
>>>>>>>>>>>>>> It's similar to what's in the Airflow docs but instead all on
>>> one
>>>>>>> page
>>>>>>>>>>>>>> and slightly
>>>>>>>>>>>>>> expanded.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> <
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>> 
>>>>> <
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> <
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>> 
>>>>>> 
>>>>>>> <
>>>>>>> 
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> <
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>> 
>>>>> <
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> <
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>>>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/> <
>>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/>> <
>>>>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/> <
>>> https://www.astronomer.io/guides/kerberos/ <https://www.astronomer.io/guides/kerberos/>>>>)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> One thing I'd like to add is a minimal example of how to
>>>>> Kerberize
>>>>>>> a
>>>>>>>>>>>> hook.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
>>>>>>>>>> Concepts >
>>>>>>>>>>>>>> Additional Functionality > Kerberos page?)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Taylor
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> *Taylor Edmiston*
>>>>>>>>>>>>>> Blog <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>
>>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>>
>>>>> | CV
>>>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor> <
>>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor>> <
>>>>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor> <
>>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor>>>> | LinkedIn
>>>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/> <
>>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/>> <
>>>>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/> <
>>> https://www.linkedin.com/in/tedmiston/ <https://www.linkedin.com/in/tedmiston/>>>> | AngelList
>>>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor> <https://angel.co/taylor <https://angel.co/taylor>> <
>>> https://angel.co/taylor <https://angel.co/taylor> <https://angel.co/taylor <https://angel.co/taylor>>>> | Stack
>>>>> Overflow
>>>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston> <
>>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston>> <
>>>>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston> <
>>> https://stackoverflow.com/users/149428/taylor-edmiston <https://stackoverflow.com/users/149428/taylor-edmiston>>>>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
>>>>>>>>>> <fokko@driesprong.frl <ma...@driesprong.frl> <mailto:fokko@driesprong.frl <ma...@driesprong.frl>> <mailto:
>>> fokko@driesprong.frl <ma...@driesprong.frl> <mailto:fokko@driesprong.frl <ma...@driesprong.frl>>>
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Ry,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> You should ask Bolke de Bruin. He's really experienced with
>>>>>>> Kerberos
>>>>>>>>>>>> and
>>>>>>>>>>>>>> he
>>>>>>>>>>>>>>> did also the implementation for Airflow. Beside that he worked
>>>>>>> also
>>>>>>>>>> on
>>>>>>>>>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <
>>> ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io <ma...@astronomer.io>>
>>>>> <mailto:ry@astronomer.io <ma...@astronomer.io> <mailto:ry@astronomer.io <ma...@astronomer.io>>>>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi everyone -
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We have several bigCo's who are considering using Airflow
>>>>> asking
>>>>>>>>>> into
>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>> support for Kerberos.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We're going to work on a proof-of-concept next week, will
>>>>> likely
>>>>>>>>>>>>>> record a
>>>>>>>>>>>>>>>> screencast on it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> For now, we're looking for any anecdotal information from
>>>>>>>>>>>> organizations
>>>>>>>>>>>>>>> who
>>>>>>>>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing
>>> to
>>>>>>> share
>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>> experiences here, or reply to me personally, it would be
>>>>> greatly
>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -Ry
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/ <http://www.astronomer.io/> <
>>> http://www.astronomer.io/ <http://www.astronomer.io/>> <
>>>>> http://www.astronomer.io/ <http://www.astronomer.io/> <http://www.astronomer.io/ <http://www.astronomer.io/>>>> |
>>>>>>>>>>>>>>> 513.417.2163 |
>>>>>>>>>>>>>>>> @rywalker <http://twitter.com/rywalker <http://twitter.com/rywalker> <
>>> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
>>>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <http://twitter.com/rywalker <http://twitter.com/rywalker>>>> | LinkedIn
>>>>>>>>>>>>>>>> <http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker> <
>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>> <
>>>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker> <
>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>>>>
>