You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Gurer Kiratli <gu...@airbnb.com.INVALID> on 2016/10/10 05:40:40 UTC

10/06 Airflow Contributors Meeting Notes

Hi all ,

Here are the meeting notes. Please add/modify as needed.

I also have recording of the meeting. It's in two files due to some Webex
issue that required a restart.

Video 1 <https://airbnb.box.com/s/8b2c5691ux9qmkujum813xyscyol6rod> , Video
2 <https://airbnb.box.com/s/m1uxmkzcedd79f9fwd0s50hcda6hirmc>

Cheers,

Gurer

*Attendees: *

   -

   Gurer Kiratli
   -

   Paul Yang
   -

   Maxime Beauchemin
   -

   Ben Tallman
   -

   George Leslie-Waksman
   -

   Vijay Bhat
   -

   Rob Froetscher
   -

   Xuanji Li
   -

   Bolke de Bruin
   -

   Catherine Wong
   -

   Sumit Maheshwari
   -

   Joe Schmid
   -

   Julia Hsieh
   -

   Sam


*Agenda:*


   -

   Airbnb Update
   -

   Get an apache 1.7.3 release out
   -

   Eliminating cold-case PRs - So far, only Jeremiah and I have added to
   the list below. We need help from all committers:
   https://cwiki.apache.org/confluence/display/AIRFLOW/Whittling+down+PR+List
   -

   Roadmap



*Meeting Notes *



   -

   Airflow
   -

      60 companies, 200 contributors, a lot of interest for meetups.
      -

      Growing faster than other projects like Luigi, Oozie, Azkaban.
      -

   Airbnb Update
   -

      Cgroups for containment. Bad tasks don’t take down workers.
      -

      Impersonation. All jobs run under service account called
      airflow_user. This creates accounting and ownership problems.
      -

      Creating a new cluster in a different data center in the world.
      -

   ING update
   -

      Moving towards cross data center availability. Airflow might play a
      critical role in here. ReAir might be useful.
      -

      Cgroups is interesting.
      -

      Integration with legacy technologies with APIs is key.
      -

   Apache Release
   -

      Next week we are going to cut a release candidate.
      -

      Bolke, Max and some folks were working 1.7.1.3 to 1.7.2 having just
      cherry picking PRs that is related to Apache compliance. This is most to
      test out to process. 1.7.2 will be essentially just 1.7.1.3 not really a
      need to install this.
      -

      ETA is end of next week.
      -

      We will have the RC for 1.8.0. We will inform thru the Apache mail
      list. We should have other folks test as well.
      -

      We need to release every month.
      -

      Each organization has to have a pre-prod/staging environment.
      -

      We can have a Apache staging environment.
      -

      There are plugins and niche operators for every company. How are we
      going to handle?
      -

         Maybe we can modularize scheduler, UI etc.
         -

         If people want certain functionality they can be introduced as
         plugins or module.
         -

         Need to decouple components.
         -

         We need to design this plugin architecture easier said than done.
         There is already JIRAs for this JIRA1
         <https://issues.apache.org/jira/browse/AIRFLOW-299> , JIRA2
         <https://issues.apache.org/jira/browse/AIRFLOW-226> for this.
         -

      Airflow 2.0?
      -

         We might want to have a major fork out.
         -

         This might mean breaking backwards compatibility, repackaging.
         -

         This can prevent moving features for 1 and slate them for 2.0.
         -

         We can break the operators into sub-packages
         -

         Stateless Webservers could be in this.
         -

         DSL for defining pipelines should be backwards compatible. Old
         DAGs should work.
         -

         We can use Git hashes to see get the versions of the versions of
         DAGs.
         -

         We can have a field in the DAG that specify which Airflow version
         this DAG was designed for.
         -

   Cold PRs
   -

      We want more committers. More interaction gets you closer to being a
      committer.
      -

      It will be easier to review more PRs.
      -

      Lots of PRs need rebasing and would have conflicts.
      -

      IF you send a PR that is touching the core hence “dangerous” this
      would require much more scrutiny. Get buy in from the committers
      beforehand. Your work might not be committed at all. Having a
design doc is
      a good idea. Or your PR will be treated as a design doc. : ) If we trust
      you, if you have already done committed PRs before there is more
confidence
      so more chance to be reviewed. All the PRs has to be linted and testing
      needs to converged.
      -

      Testing is unstable. Travis is flapping. Some tests have some
      randomness. This has to be fixed. Cos your PR might be good but some test
      will fail, this is misleading.
      -

      We have a very limited control on the GitHub repo. Apache owns it.
      They will not give us the admin. We can’t introduce 3rd party services as
      we are not admins of the repo.
      -

      [Andrew Phillips] In Jcloud <https://github.com/jclouds/jclouds/> we
      have a GitHub-hosted mirror. We can do whatever we want in this
repo. We do
      CI thru this. We can consider this in Airflow too. "Read-only
mirror of ASF
      Git Repo for jclouds http://jclouds.apache.org/"
      -

      Maybe make a GitHub organization like Airflow-Airflow or something.
      -

      Can we have a policy on open PRs that has been open for n weeks? It
      will be closed. Maybe we can have a policy if it hits a certain
age, we ask
      for a rebase and if we don’t hear for n days we should then close.
      -

      We might be able to automate this. Rails does it in a good way. But
      at this stage we might do it manually.
      -

      Let’s have a policy and put it out in the wiki. Also clarify
      ownership of components.
      -

      For different requests how do we handle communication?
      -

         Kill JIRA. Use GitHub Issues. Apache allows it. GitHub issues is
         easier to search.
         -

         Gitter is not super helpful. It’s more for ad hoc communication
         like Slack.
         -

         Dev mailing list would be for generic questions, requests. We have
         to have this. Apache enforces this. All the decision has to
be given thru
         the Apache mailing list for legal reasons. Apache doesn’t want secret
         decision. We can have a mirror Google groups that is
subscribed to this. We
         can possibly multiple Apache mailing list.
         -

         Mail is slower but mitigates being spread around different
         timezones issue.
         -

      We can break out tickets to Newbie tickets and Projects. Projects
      will be sponsored by
      -

      Differentiate the level of interest and level of desire to commitment.


Q4, 2017 Vision / Roadmap

We will create a wiki page with these and see interest.


Possible Deliverables

   -

   Integration Testing Environment(s)
   -

   Modularizing Airflow
   -

   Containers (Docker, Kubernetes, ECS)
   -

      How do we package management for the DAG and its dependencies, the
      environment?
      -

      Running on your laptop is an important thing to keep. This helped the
      project a lot. This shouldn’t be let go.
      -

   Multitenancy
   -

   Security Improvements
   -

      UI and CLI level roles.
      -

      Kerberos support.
      -

      Putting the password information in a Vault. Currently if you have
      the key you have access to the whole vault.
      -

      Managing connection pools better.
      -

      Defining the difference between a hook and connection.
      -

   Stateless Webservers
   -

      This can help UI improvements.
      -

   UI improvements
   -

      Performance.. like better assets caching, compression and use of cdn..
      -

      Reactify.
      -

      Support for large DAGs.
      -

      This might be dependent on the API.
      -

   Managing logs
   -

   Tighter Dev, test, deploy story
   -

      The local environment and production environment need to be in line.
      -

      DAG.validate method.
      -

   Rest API
   -

      Outside interface where other applications integrates. Like to
      trigger a task.
      -

      Internal service APIs to get execution details. To abstract the
      Database.
      -

      It’s good to see who is interested in these features(care about them)
      and driving these features.
      -

   Hardening the scheduler. Stabilizing.
   -

      Documenting how the scheduler operates.
      -

      Running a single task continuously.
      -

      More visibility into Why do we need to restart scheduler?
      -

      Clarifying the contract between the scheduler and the workers(???)
      -

      Stuck scheduler issues.
      -

      Backfill has a separate code path. It should be a flavor.
      -

   Better documentation
   -

      Onboarding documentation
      -

      Runbook
      -

      How the system works?
      -

   Event driven scheduler
   -

      First task snowballing and kicking other tasks. This reduces the
      white space between when task is runnable and the task is actually is run.
      -

   Revamp the SubDAG operator
   -

      So many special cases in the code base.
      -

      It should be handled by the scheduler.
      -

   Having a task name, separate from task id. Same for the DAG name and DAG
   id.
   - Remove pickling.