You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Gurer Kiratli <gu...@airbnb.com.INVALID> on 2016/10/10 05:40:40 UTC
10/06 Airflow Contributors Meeting Notes
Hi all ,
Here are the meeting notes. Please add/modify as needed.
I also have recording of the meeting. It's in two files due to some Webex
issue that required a restart.
Video 1 <https://airbnb.box.com/s/8b2c5691ux9qmkujum813xyscyol6rod> , Video
2 <https://airbnb.box.com/s/m1uxmkzcedd79f9fwd0s50hcda6hirmc>
Cheers,
Gurer
*Attendees: *
-
Gurer Kiratli
-
Paul Yang
-
Maxime Beauchemin
-
Ben Tallman
-
George Leslie-Waksman
-
Vijay Bhat
-
Rob Froetscher
-
Xuanji Li
-
Bolke de Bruin
-
Catherine Wong
-
Sumit Maheshwari
-
Joe Schmid
-
Julia Hsieh
-
Sam
*Agenda:*
-
Airbnb Update
-
Get an apache 1.7.3 release out
-
Eliminating cold-case PRs - So far, only Jeremiah and I have added to
the list below. We need help from all committers:
https://cwiki.apache.org/confluence/display/AIRFLOW/Whittling+down+PR+List
-
Roadmap
*Meeting Notes *
-
Airflow
-
60 companies, 200 contributors, a lot of interest for meetups.
-
Growing faster than other projects like Luigi, Oozie, Azkaban.
-
Airbnb Update
-
Cgroups for containment. Bad tasks don’t take down workers.
-
Impersonation. All jobs run under service account called
airflow_user. This creates accounting and ownership problems.
-
Creating a new cluster in a different data center in the world.
-
ING update
-
Moving towards cross data center availability. Airflow might play a
critical role in here. ReAir might be useful.
-
Cgroups is interesting.
-
Integration with legacy technologies with APIs is key.
-
Apache Release
-
Next week we are going to cut a release candidate.
-
Bolke, Max and some folks were working 1.7.1.3 to 1.7.2 having just
cherry picking PRs that is related to Apache compliance. This is most to
test out to process. 1.7.2 will be essentially just 1.7.1.3 not really a
need to install this.
-
ETA is end of next week.
-
We will have the RC for 1.8.0. We will inform thru the Apache mail
list. We should have other folks test as well.
-
We need to release every month.
-
Each organization has to have a pre-prod/staging environment.
-
We can have a Apache staging environment.
-
There are plugins and niche operators for every company. How are we
going to handle?
-
Maybe we can modularize scheduler, UI etc.
-
If people want certain functionality they can be introduced as
plugins or module.
-
Need to decouple components.
-
We need to design this plugin architecture easier said than done.
There is already JIRAs for this JIRA1
<https://issues.apache.org/jira/browse/AIRFLOW-299> , JIRA2
<https://issues.apache.org/jira/browse/AIRFLOW-226> for this.
-
Airflow 2.0?
-
We might want to have a major fork out.
-
This might mean breaking backwards compatibility, repackaging.
-
This can prevent moving features for 1 and slate them for 2.0.
-
We can break the operators into sub-packages
-
Stateless Webservers could be in this.
-
DSL for defining pipelines should be backwards compatible. Old
DAGs should work.
-
We can use Git hashes to see get the versions of the versions of
DAGs.
-
We can have a field in the DAG that specify which Airflow version
this DAG was designed for.
-
Cold PRs
-
We want more committers. More interaction gets you closer to being a
committer.
-
It will be easier to review more PRs.
-
Lots of PRs need rebasing and would have conflicts.
-
IF you send a PR that is touching the core hence “dangerous” this
would require much more scrutiny. Get buy in from the committers
beforehand. Your work might not be committed at all. Having a
design doc is
a good idea. Or your PR will be treated as a design doc. : ) If we trust
you, if you have already done committed PRs before there is more
confidence
so more chance to be reviewed. All the PRs has to be linted and testing
needs to converged.
-
Testing is unstable. Travis is flapping. Some tests have some
randomness. This has to be fixed. Cos your PR might be good but some test
will fail, this is misleading.
-
We have a very limited control on the GitHub repo. Apache owns it.
They will not give us the admin. We can’t introduce 3rd party services as
we are not admins of the repo.
-
[Andrew Phillips] In Jcloud <https://github.com/jclouds/jclouds/> we
have a GitHub-hosted mirror. We can do whatever we want in this
repo. We do
CI thru this. We can consider this in Airflow too. "Read-only
mirror of ASF
Git Repo for jclouds http://jclouds.apache.org/"
-
Maybe make a GitHub organization like Airflow-Airflow or something.
-
Can we have a policy on open PRs that has been open for n weeks? It
will be closed. Maybe we can have a policy if it hits a certain
age, we ask
for a rebase and if we don’t hear for n days we should then close.
-
We might be able to automate this. Rails does it in a good way. But
at this stage we might do it manually.
-
Let’s have a policy and put it out in the wiki. Also clarify
ownership of components.
-
For different requests how do we handle communication?
-
Kill JIRA. Use GitHub Issues. Apache allows it. GitHub issues is
easier to search.
-
Gitter is not super helpful. It’s more for ad hoc communication
like Slack.
-
Dev mailing list would be for generic questions, requests. We have
to have this. Apache enforces this. All the decision has to
be given thru
the Apache mailing list for legal reasons. Apache doesn’t want secret
decision. We can have a mirror Google groups that is
subscribed to this. We
can possibly multiple Apache mailing list.
-
Mail is slower but mitigates being spread around different
timezones issue.
-
We can break out tickets to Newbie tickets and Projects. Projects
will be sponsored by
-
Differentiate the level of interest and level of desire to commitment.
Q4, 2017 Vision / Roadmap
We will create a wiki page with these and see interest.
Possible Deliverables
-
Integration Testing Environment(s)
-
Modularizing Airflow
-
Containers (Docker, Kubernetes, ECS)
-
How do we package management for the DAG and its dependencies, the
environment?
-
Running on your laptop is an important thing to keep. This helped the
project a lot. This shouldn’t be let go.
-
Multitenancy
-
Security Improvements
-
UI and CLI level roles.
-
Kerberos support.
-
Putting the password information in a Vault. Currently if you have
the key you have access to the whole vault.
-
Managing connection pools better.
-
Defining the difference between a hook and connection.
-
Stateless Webservers
-
This can help UI improvements.
-
UI improvements
-
Performance.. like better assets caching, compression and use of cdn..
-
Reactify.
-
Support for large DAGs.
-
This might be dependent on the API.
-
Managing logs
-
Tighter Dev, test, deploy story
-
The local environment and production environment need to be in line.
-
DAG.validate method.
-
Rest API
-
Outside interface where other applications integrates. Like to
trigger a task.
-
Internal service APIs to get execution details. To abstract the
Database.
-
It’s good to see who is interested in these features(care about them)
and driving these features.
-
Hardening the scheduler. Stabilizing.
-
Documenting how the scheduler operates.
-
Running a single task continuously.
-
More visibility into Why do we need to restart scheduler?
-
Clarifying the contract between the scheduler and the workers(???)
-
Stuck scheduler issues.
-
Backfill has a separate code path. It should be a flavor.
-
Better documentation
-
Onboarding documentation
-
Runbook
-
How the system works?
-
Event driven scheduler
-
First task snowballing and kicking other tasks. This reduces the
white space between when task is runnable and the task is actually is run.
-
Revamp the SubDAG operator
-
So many special cases in the code base.
-
It should be handled by the scheduler.
-
Having a task name, separate from task id. Same for the DAG name and DAG
id.
- Remove pickling.