You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Kamil Breguła <ka...@polidea.com> on 2020/05/11 12:08:13 UTC

Re: API spec questions

Hello,

I apologize for my very negative behavior. I misunderstood the rules.
I updated the specification based on the discussion. I hope that now I
have not missed any suggestions that I had to make. I tried to make
every change in a separate commit, so you can review the changes one
by one.

Here is new PR:  https://github.com/apache/airflow/pull/8721

Short summary:
* Changed type for DAGID, DAGRunID, TaskID. Now we use string instead
of an integer.
* EventLog collection is read-only
* Used ExcutionDate in DagRuns endpoints
* Used custom action to control task instances - Removed PATCH
/dags/{dag_id}/taskInstances/{task_id}/{execution_date} endpoint and
added PATCH /dags/{dag_id}/clearTaskInstanaces.
* Added endpoint - POST /dagRuns "Trigger a DAG Run"
* Added filter parameters to GET /dags/{dag_id}/taskInstances.
* Added filter parameters to GET /dags/{dag_id}/dagRuns
* Removed DELETE /taskInstances endpoint
* The connection ID is a unique key that identifies the connection
* Moved TaskInstances resource under DagRuns resource
* Fixed many typos

There is one more topic for discussion - reading resources across
multiple collections.
In other words - how do I retrieve task instances for multiple DAGs?
We have several solutions.
I) Currently, the endpoint that receives a list of task instances is
available at /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances:
This endpoint support reading resources across multiple DAGs by
specifying a "-" as a dag_id or an execution_date.
This is based on Google recommendations - AIP159 [1]. I relied on
these recommendations because it is the most comprehensive study on
API design principles. Ash Berlin-Taylor rightly pointed out that this
would be a backward-incompatible change.
II) We can use a different character that will have the same role -
'*'. This character cannot be in Dag/Task ID, so it's safe.
III) Ash proposed to add a separate endpoint that will not include the
DAG ID in the address - /taskInstances

If we want to choose one solution then I think it's worth looking at
what the endpoints for DAGs look like.
/dags
/dags/{dag_id}
/dags/{dag_id}/clearTaskInstanaces
/dags/{dag_id}/dagRuns
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
/dags/{dag_id}/dagRuns/{execution_date}
/dags/{dag_id}/structure
/dags/{dag_id}/tasks
/dags/{dag_id}/tasks/{task_id}
In my opinion, here is a clear hierarchy of resources that will
facilitate the use of API. The third solution causes that this
hierarchy is disturbed and the endpoints that are used to receive the
list of elements will be at the highest level.
We will have the following endpoints:
/dags
/dags/{dag_id}
/dags/{dag_id}/clearTaskInstanaces
/dagRuns
/dags/{dag_id}/dagRuns
/taskInstances
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
/xcomEntries
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
/dags/{dag_id}/dagRuns/{execution_date}
/dags/{dag_id}/structure
/dags/{dag_id}/tasks
/dags/{dag_id}/tasks/{task_id}

4) Some endpoints will have similar behavior, so we can delete them.
Then we will have the following list of endpoints.
/dags
/dags/{dag_id}
/dags/{dag_id}/clearTaskInstanaces
/dagRuns
/taskInstances
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
/xcomEntries
/dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
/dags/{dag_id}/dagRuns/{execution_date}
/dags/{dag_id}/structure
/dags/{dag_id}/tasks
/dags/{dag_id}/tasks/{task_id}

Which solution do you like the most? I, II, III, or IV?

Best regards,
Kamil Breguła

[1] https://aip.dev/159

On Wed, Apr 15, 2020 at 8:32 PM Daniel (Daniel Lamblin ) [Data
Infrastructure] <la...@coupang.com> wrote:
>
> I'm glad the general mood is that connection_id should be unique.
> FWIW when I had multiple connections in v1.8.2 with mysql I didn't seem to be getting any randomized loadbalancing anyway. Then again, maybe random was just 100 selections of 1 over 2.
> There are many other ways to load balance connections, each specific to the service type, so I don't see it Airflow's place to provide a semi-generic option to do it.
>
> +1 for connection ID being unique.
>
> Pardon outlook for changing links to the ConnectionID, DagID and PoolID being integers in a version of the API.
> Are we past that decision already; I'd expect to use string names.
>
> I'd also asked about DAG run ID, task ID, and finally whether there'd be an endpoint with which to clear tasks, because crud operations don't model the interplay of task instance, jobs, and dag run state involved.
> -Daniel
>
> On 4/14/20, 8:49 AM, "Xinbin Huang" <bi...@gmail.com> wrote:
>
>     [Warning]: This email originated from an external source. Do not open links or attachments unless you know the content is safe.
>     [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지 마십시오.
>
>
>     +1 on making connection IDs unique. It's confusing to have Airflow handled
>     load balancing here.
>
>     On Tue, Apr 14, 2020, 4:59 AM Kaxil Naik <ka...@gmail.com> wrote:
>
>     > +1 to make connection ids unique
>     >
>     > On Tue, Apr 14, 2020 at 11:59 AM Jarek Potiuk <Ja...@polidea.com>
>     > wrote:
>     >
>     > > I am also supporting converting the connection to be unique.
>     > >
>     > > I've worked with similar approach long time ago (10s of years) and it was
>     > > fine then where we have not yet figured out how to scale client/server
>     > > architecture and we did not have all the nice infrastructure like
>     > > load/balancing, cloud services etc. I believe in most of the current
>     > > services/systems load-balancing should be handled by the service itself
>     > or
>     > > some kind of proxy between - not by the client side - in production
>     > > environments.
>     > >
>     > > It's far more robust and might provide much better control (you can
>     > control
>     > > multiple independent client's connection and do not rely on random
>     > > distribution of connections). There are scenarios that would not work
>     > well
>     > > in this case - for example physical proximity of the workers, current
>     > load
>     > > on different services etc. etc. It is very limiting to rely on this
>     > > feature.
>     > >
>     > > On Tue, Apr 14, 2020 at 11:36 AM Kamil Breguła <
>     > kamil.bregula@polidea.com>
>     > > wrote:
>     > >
>     > > > Hello,
>     > > >
>     > > > This can cause big problems with idempotence. According to RFC-7231,
>     > > > the DELETE method should be idempotent.
>     > > >
>     > > > For example:
>     > > > If you want to delete items with index from 1 to 4, you should set the
>     > > > following request
>     > > > DELETE /connections/hdfs_default/4
>     > > > DELETE /connections/hdfs_default/3
>     > > > DELETE /connections/hdfs_default/2
>     > > > DELETE /connections/hdfs_default/1
>     > > >
>     > > > However, if the user sends these requests in a different order, they
>     > > > will only delete the first and third items.
>     > > > DELETE /connections/hdfs_default/1
>     > > > DELETE /connections/hdfs_default/2
>     > > > DELETE /connections/hdfs_default/3
>     > > > DELETE /connections/hdfs_default/4
>     > > >
>     > > > If you use asynchronous HTTP clients (a popular in Node), the order of
>     > > > requests will not be kept. It will also be a big problem with
>     > > > simultaneous modifications.
>     > > >
>     > > > I am also in favor of abandoning support for many connections. This
>     > > > can be solved on a different layer.
>     > > >
>     > > > Best regards,
>     > > > Kamil
>     > > >
>     > > >
>     > > > On Thu, Apr 9, 2020 at 6:47 PM Ash Berlin-Taylor <as...@apache.org>
>     > wrote:
>     > > > >
>     > > > > They are, but we _can_ make a choice to remove that feature -- it is
>     > > not
>     > > > > widely used and is confusing to many when they stumble on it.
>     > > > >
>     > > > > It's not something we should do lightly, but it is a possibility.
>     > > > >
>     > > > > I think I'm probably leaning towards the "ordinal" concept:
>     > > > >
>     > > > > /connections/hdfs_default -> list of connections with that ID
>     > > > > /connections/hdfs_default/0 first connection of that type
>     > > > >
>     > > > > Something like that.
>     > > > >
>     > > > > On Apr 9 2020, at 2:31 pm, Shaw, Damian P.
>     > > > > <da...@credit-suisse.com> wrote:
>     > > > >
>     > > > > > FYI if you look back at the thread "Re: [2.0 spring cleaning]
>     > Require
>     > > > > > unique conn_id" on 2019-04-14 you can see a message from Kevin Yang
>     > > > > > stating that this random choice of connections is a "feature" used
>     > to
>     > > > > > load balance connections in AirBnB. So users are relying on this
>     > > > behavior.
>     > > > > >
>     > > > > >
>     > > > > > -----Original Message-----
>     > > > > > From: Daniel (Daniel Lamblin ) [Data Infrastructure]
>     > > > > > <la...@coupang.com>
>     > > > > > Sent: Wednesday, April 8, 2020 20:01
>     > > > > > To: dev@airflow.apache.org
>     > > > > > Subject: Re: API spec questions
>     > > > > >
>     > > > > > Having been bit by accidentally having two connections by the same
>     > > > > > name or conn_id, I'd prefer if were made unique. In my experience
>     > > > > > there's little utility in having multiple connections by the same
>     > > > > > name. Tasks that use a connection do to fairly randomly choose one,
>     > > > > > rather they seem pretty consistent in using the one created
>     > earliest,
>     > > > > > which often has the lower id integer.
>     > > > > >
>     > > > > > Circling back to how this is used by the API, from a user
>     > > perspective,
>     > > > > > the following in path integer fields were ones I'd have expected to
>     > > be
>     > > > > > strings instead:
>     > > > > > ConnectionID
>     > > >
>     > >
>     > https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1845&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=ur3AAggOlQ7BskKBxQxRq0i%2FS36Ol72c7H7ddwXo5C4%3D&amp;reserved=0
>     > > > > > DAGID
>     > > >
>     > >
>     > https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1853&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=vyhFiV6Fd3pnVvlAoxnscSvvcy2NX%2BMASGmuHKu%2Fy84%3D&amp;reserved=0
>     > > > > > PoolID
>     > > >
>     > >
>     > https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1901&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=4Fog92AZju0%2BN44FoBdU4fj2DwhgCbUfKYUbdjeWpgo%3D&amp;reserved=0
>     > > > > >
>     > > > > > Though it’s a url-encoding hassle, I also expected that DAGRunID
>     > > would
>     > > > > > be more like the Run ID E.G.
>     > > > > > "scheduled__2020-04-08T23:10:00+00:00",
>     > > > > > "manual__2020-04-08T23:00:56+00:00",
>     > > > > > "manual__2020-04-08T16:24:56.692868+00:00" ,
>     > > > > > "backfill_2020-04-08T22:05:00+00:00" etc
>     > > > > >
>     > > > > > Then TaskID is confusing to me; AFAIK the PK to task instances are
>     > > > > > task_id, dag_id, and execution_date and the api call appears to
>     > align
>     > > > > > with that having the pattern:
>     > > > > > /dags/{dag_id}/taskInstances/{task_id}/{execution_date}:
>     > > > > > But if task_id is a numbered id, then… execution_date isn't even
>     > > > > > needed… I'm thinking it should have been a string.
>     > > > > >
>     > > > > > An aside to this, I've always wondered what happens if an
>     > externally
>     > > > > > triggered DAG Run has the same execution date as a pre-existing
>     > > > > > scheduled DAG Run. They'd have different run_ids, EG
>     > > > > > "scheduled__2020-04-08T23:10:00+00:00" vs
>     > > > > > "manual__2020-04-08T23:10:00+00:00" but then task instances for
>     > those
>     > > > > > runs might not be unique.
>     > > > > >
>     > > > > > Lastly, the UI and CLI operation of clearing tasks seems analogous
>     > to
>     > > > > > the delete task instance API end point. But probably it's not, and
>     > > > > > this could become confusing to users.
>     > > > > > There should be a non-db-model api call for clearing tasks like you
>     > > > > > can from the UI and the CLI. If I read it right, clearing does the
>     > > > > > following: it sets the state of the task instance to None unless
>     > the
>     > > > > > state was Running then instead it sets it and related job ids to
>     > > > > > Shutdown. It deletes reschedules of the TI and it sets the dag runs
>     > > > > > for those task instances back to Running.
>     > > > > > This would be a lot to do for a user using PATCH calls to change
>     > Task
>     > > > > > Instances and Dag Runs together (and there's no API for Jobs).
>     > > > > >
>     > > > > > -Daniel L.
>     > > > > > P.S. as far as renaming all parameters on operators and hooks with
>     > > > > > *_conn_id, I do not want to see that breaking change happen. But IF
>     > > it
>     > > > > > HAS TO, I'm still of the opinion that the default_args use for
>     > > > > > X_conn_id is not preferable to having all operators take simpler
>     > > > > > source_conn_id and optional target_conn_id parameter names that are
>     > > > > > consistent across all operators.
>     > > > > >
>     > > > > > On 4/8/20, 9:47 AM, "Ash Berlin-Taylor" <as...@apache.org> wrote:
>     > > > > >
>     > > > > >    [Warning]: This email originated from an external source. Do not
>     > > > > > open links or attachments unless you know the content is safe.
>     > > > > >    [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지
>     > > 마십시오.
>     > > > > >
>     > > > > >
>     > > > > >    To expand on the "so I think we need to do one of":
>     > > > > >
>     > > > > >
>     > > > > >    - we need to update the name of "conn_id" somehow. I can't think
>     > > of
>     > > > a
>     > > > > >    better option, and given all the operators have "x_conn_id" I
>     > > don't
>     > > > > >    think that change should be made lightly.
>     > > > > >
>     > > > > >    - make conn_id unique (this "poor" HA has been a source of
>     > > > > > confusion in
>     > > > > >    the past) and the ID we use in the API
>     > > > > >
>     > > > > >    Or a third option:
>     > > > > >
>     > > > > >    - Have the API take conn_id and either return all conns for that
>     > > > > >    conn_id, or conn_id plus an ordinal (?idx=0 type thing) to
>     > return
>     > > a
>     > > > > >    single value.
>     > > > > >
>     > > > > >    On Apr 8 2020, at 5:42 pm, Ash Berlin-Taylor <as...@apache.org>
>     > > > wrote:
>     > > > > >
>     > > > > >    > Hi everyone,
>     > > > > >    >
>     > > > > >    > So as I mentioned in the AIP voting thread, I think we need to
>     > > > give
>     > > > > >    > some more thought to how we are exposing connection ids in the
>     > > > API.
>     > > > > >    >
>     > > > > >    > Right now as proposed (and merged without approval, not cool.
>     > > The
>     > > > AIP
>     > > > > >    > we voted on did not contain a PR against apache/airflow.) it
>     > has
>     > > > an
>     > > > > >    > end point of `/connections/{connection_id} `
>     > > > > >    >
>     > > > > >    > My issue here is as I said in the previous thread: that is
>     > going
>     > > > > > to be
>     > > > > >    > mightly confusing to our users because there is a "conn_id"
>     > > > concept
>     > > > > >    > that is exposed, so people are going to try putting
>     > > "aws_default"
>     > > > etc
>     > > > > >    > in there. I don't care what the API spec says, I care what our
>     > > > users
>     > > > > >    > expect, and having a connection_id/id and a conn_id fields is
>     > > > > > just confusing
>     > > > > >    >
>     > > > > >    > So I think we need to do one of:
>     > > > > >    > - we need to update the name of "conn_id" somehow, make
>     > conn_id
>     > > > unique
>     > > > > >    > (this "poor" HA has been a source of confusion in the past)
>     > > > > >    >
>     > > > > >    > There are similar problems for the DAG run -- the spec has the
>     > > > > > type as
>     > > > > >    > an integer, but everything else about the Airflow system uses
>     > > the
>     > > > > >    > (unique) run_id parameter, and I would expect the API to use
>     > > > > > that. The
>     > > > > >    > autoinc. id on the run column is literally never used in the
>     > > code
>     > > > > >    > base, so exposing that via the API seems odd.
>     > > > > >    >
>     > > > > >    > A few other smaller points:
>     > > > > >    >
>     > > > > >    > EventLog: Those are "audit"/action logs, so we probably
>     > > shouldn't
>     > > > let
>     > > > > >    > people delete them via the API!
>     > > > > >    >
>     > > > > >    > pool_id: still an integer. It should be the "name".
>     > > > > >    >
>     > > > > >    > How should add/delete variables and connections work in the
>     > API
>     > > > with
>     > > > > >    > the addition of the new "Secrets Backends"?
>     > > > > >    >
>     > > > > >    > xcomValues: task_id is listed as an integer.
>     > > > > >    >
>     > > > > >    >
>     > > > > >    > -ash
>     > > > > >
>     > > > > >
>     > > > > >
>     > > > > >
>     > > > > >
>     > > > > >
>     > > > >
>     > > >
>     > >
>     > ===============================================================================
>     > > > > > Please access the attached hyperlink for an important electronic
>     > > > > > communications disclaimer:
>     > > > > > https://kor01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.credit-suisse.com%2Flegal%2Fen%2Fdisclaimer_email_ib.html&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=6e9Ck952oFmXDkoDYnM0XgCWsplYb7Dua37BkAPt22A%3D&amp;reserved=0
>     > > > > >
>     > > > >
>     > > >
>     > >
>     > ===============================================================================
>     > > > > >
>     > > >
>     > >
>     > >
>     > > --
>     > >
>     > > Jarek Potiuk
>     > > Polidea <https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0> | Principal Software Engineer
>     > >
>     > > M: +48 660 796 129 <+48660796129>
>     > > [image: Polidea] <https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0>
>     > >
>     >
>
>

Re: API spec questions

Posted by Kamil Breguła <ka...@polidea.com>.

GET /dags/DAG_A/dagRuns/2020-01-01/taskInstances
GET /dags/DAG_A/dagRuns/2020-01-01/taskInstances
On Tue, May 12, 2020 at 12:10 PM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> Hi Kamil,
>
> Thanks for re-opening this discussion!
>
> My initial complaint about "putting multiple values in a single path
> element" is that it feels messy, and confusing, and very unlike any REST
> API I've consumed or written before.
>
> However I've just realised that I may have _wildy_ misunderstood the proposal.
>
> I (somehow!) thought you were suggesting this:
>
>   GET /dags/dag_id1-dag_id2-dag_id3/dagRuns/2020-01-01T00/taskInstances
>
> But on reading Google's https://aip.dev/159 that now makes more sense,
> and that isn't what you were suggesting, but instaed a single, litteral
> `-` to mean "any dag id". Is this correct?
>

Yes. In detail, it means that the SQL query will not have a condition.
/dag/{dag_Id}/taskInstanace
SELECT * FROM task_instance WHERE dag_id= {dag_id}
/dag/~/taskInstanace
SELECT * FROM task_instance

> So I think then my only ask is that we have a `dag_ids` parameter that
> we can use to filter on :)
>
>
>
>
> Additionally/related when selecting across multiple DAGs we would very
> quickly run in to that was fixed in github.com/apache/airflow/pull/7364
> -- where asking for all the task stats for the DAGs on a single page
> exceeded the HTTP request line limit.
>
> i.e. this could break if the DAG ids are large, or if they is a large
> number of dags to be fetched.
>
>   GET /dags/-/dagRuns/2020-01-01T00/taskInstances?dag_ids=...&dag_ids=...&dag_ids=...
>
> How about this as an additional endpoint:
>
>   POST /dags/-/dagRuns/2020-01-01/taskInstances/list
>
>   dag_ids=...&dag_ids=...&dag_ids...
>
>
> (The /list suffix isn't needed here as there is no POST for
> taskInstances, but there is in other places, so making it explicit means
> we can use the same pattern for, say `POST /dags/-/dagRuns/list`)
>
>

We can add it, but I don't sure, I have the impression that this will
require a separate specification that will require the design of BATCH
API.

Currently, if you would like to download resources from several
collections, you can send many HTTP requests.
GET /dags/DAG_A/dagRuns/2020-01-01/taskInstances
GET /dags/DAG_B/dagRuns/2020-01-01/taskInstances
GET /dags/DAG_C/dagRuns/2020-01-01/taskInstances

I would like to focus on making all operations possible, but not
necessarily making the API efficient in all cases.

>
>
> If it is valid to pass just one of ExecutionDateStartRange and
> ExecutionDateEndRange could we call it ExecutionDateGTE etc. (for
> Greater-than-or-equal) so it makes more sense when a single parameter is providerd.
>
>

I will change it.

>
>
> A minor style point: The filter/query string fields are all defined as
> BumpyCase, but I'd prefer us to use snake_case (`execution_date_gte`
> etc) - the only sort of applicable proposal I can find is https://aip.dev/140
>

I will change it.

> (As in I can't find anything in Google recs that talk about case for
> case of query string/filter params. I guess they suggest having it all
> as a single `?filter=` param :- https://aip.dev/160 )
>
The API guides discusses the latest recommendations for API design.
Google formerly used query parameters.
Example:
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/list


> -ash
>
>
>
> On May 12 2020, at 12:12 am, Jarek Potiuk <Ja...@polidea.com> wrote:
>
> > While I understand the benefits of the "clean"  "REST" API where resources
> > are identified by a nice hierarchy, I do not think it is necessary to
> > follow it for this kind of queries.I think the "resources" based approach
> > indeed breaks down when you want to make more complex queries and want to
> > do it efficiently, without running many API calls or returning superfluous
> > information. I think that's the reason where GraphQL shines compared to
> > REST-like APIs. I think in this case "purity" impacts performance a lot.
> >
> > I know we did not choose GraphQL (and for good reasons - it's
> > complexity is
> > not needed in our case) - but I think there are cases where we want
> > efficiency and we want to limit a number of requests or superfluous
> > information retrieved.
> >
> > But looking at the balance between purity and efficiency for asking about
> > "no matter which dag" or "no matter which executionDate"  - I think the
> > AIP-159 proposal strikes a good balance. It's pretty natural, easy to read
> > and does not break the structure.
> >
> > I would not use "*" because it can be interpreted differently when
> > URL-encoded/decoded because it is a reserved character (
> > https://tools.ietf.org/html/rfc3986#section-2.2) - and some languages
> > (Apparently
> > PHP
> > <https://stackoverflow.com/questions/6533561/urlencode-the-asterisk-star-character>
> > might have problems with encoding it accidentally)  . I'd use one of the
> > un-reserved ones (https://tools.ietf.org/html/rfc3986#section-2.3)).
> > If -
> > (dash) is out of the question due to compatibility, I'd use ~(tilde).
> >
> > So if we only want to focus on this kind of query, I think option II. (but
> > with ~) is good for me.
> >
> > I have only one doubt about it. Do we want to handle other queries to
> > taskInstances across dag_id or execution_date? For example
> > corresponding to
> > DAG_ID IN (DAG_1, DAG_2, DAG_3) or DAG_ID LIKE ("DAG_%") from typical SQL
> > queries?. I think that might be a useful type of queries. even more so for
> > executionDate. From what I see we do not have filters for those, but I
> > wonder if this is not something we could do with additional filters and
> > still keep the nice structure?
> >
> > For example:
> >
> > /dags/-/dagRuns/{execiton_date}/taskInstances?dag_id=DAG_1,DAG_2,DAG_3,
> >
> > Not sure if this is something we even want to follow?
> >
> > Ash - what are your thoughts about it? Why do you have doubts about the
> > AIP159 approach ? What problems do you foresee with it ?
> >
> > J.
> >
> >
> >
> > On Mon, May 11, 2020 at 7:37 PM QP Hou <qp...@scribd.com> wrote:
> >
> >> I would recommend option III.
> >>
> >> IMHO, hierarchy based endpoints are for CURD operations on a single
> >> entity. The flat endpoint, like /taskinstance, is for read-only
> >> multi-facet search queries.
> >>
> >> For example:
> >>
> >> * to create a resource Foo: "POST /api/parent/{parent_id}/foo"
> >> * to update a resource Foo: "PUT /api/parent/{parent_id}/foo/{foo_id}"
> >> * to get a single Foo entity: "GET /api/parent/{parent_id}/foo/{foo_id}"
> >> * to search for multiple Foo entities using multiple filters: "GET
> >> /api/foo?parent_id=id1,id2,id3&owner=user1"
> >>
> >> On Mon, May 11, 2020 at 5:08 AM Kamil Breguła <ka...@polidea.com>
> >> wrote:
> >> >
> >> > Hello,
> >> >
> >> > I apologize for my very negative behavior. I misunderstood the rules.
> >> > I updated the specification based on the discussion. I hope that
> >> now I
> >> > have not missed any suggestions that I had to make. I tried to make
> >> > every change in a separate commit, so you can review the changes one
> >> > by one.
> >> >
> >> > Here is new PR:  https://github.com/apache/airflow/pull/8721
> >> >
> >> > Short summary:
> >> > * Changed type for DAGID, DAGRunID, TaskID. Now we use string instead
> >> > of an integer.
> >> > * EventLog collection is read-only
> >> > * Used ExcutionDate in DagRuns endpoints
> >> > * Used custom action to control task instances - Removed PATCH
> >> > /dags/{dag_id}/taskInstances/{task_id}/{execution_date} endpoint and
> >> > added PATCH /dags/{dag_id}/clearTaskInstanaces.
> >> > * Added endpoint - POST /dagRuns "Trigger a DAG Run"
> >> > * Added filter parameters to GET /dags/{dag_id}/taskInstances.
> >> > * Added filter parameters to GET /dags/{dag_id}/dagRuns
> >> > * Removed DELETE /taskInstances endpoint
> >> > * The connection ID is a unique key that identifies the connection
> >> > * Moved TaskInstances resource under DagRuns resource
> >> > * Fixed many typos
> >> >
> >> > There is one more topic for discussion - reading resources across
> >> > multiple collections.
> >> > In other words - how do I retrieve task instances for multiple DAGs?
> >> > We have several solutions.
> >> > I) Currently, the endpoint that receives a list of task instances is
> >> > available at /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances:
> >> > This endpoint support reading resources across multiple DAGs by
> >> > specifying a "-" as a dag_id or an execution_date.
> >> > This is based on Google recommendations - AIP159 [1]. I relied on
> >> > these recommendations because it is the most comprehensive study on
> >> > API design principles. Ash Berlin-Taylor rightly pointed out that this
> >> > would be a backward-incompatible change.
> >> > II) We can use a different character that will have the same role -
> >> > '*'. This character cannot be in Dag/Task ID, so it's safe.
> >> > III) Ash proposed to add a separate endpoint that will not include the
> >> > DAG ID in the address - /taskInstances
> >> >
> >> > If we want to choose one solution then I think it's worth looking at
> >> > what the endpoints for DAGs look like.
> >> > /dags
> >> > /dags/{dag_id}
> >> > /dags/{dag_id}/clearTaskInstanaces
> >> > /dags/{dag_id}/dagRuns
> >> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
> >> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> >> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> >> >
> >> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> >> >
> >> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
> >> >
> >> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> >> > /dags/{dag_id}/dagRuns/{execution_date}
> >> > /dags/{dag_id}/structure
> >> > /dags/{dag_id}/tasks
> >> > /dags/{dag_id}/tasks/{task_id}
> >> > In my opinion, here is a clear hierarchy of resources that will
> >> > facilitate the use of API. The third solution causes that this
> >> > hierarchy is disturbed and the endpoints that are used to receive the
> >> > list of elements will be at the highest level.
> >> > We will have the following endpoints:
> >> > /dags
> >> > /dags/{dag_id}
> >> > /dags/{dag_id}/clearTaskInstanaces
> >> > /dagRuns
> >> > /dags/{dag_id}/dagRuns
> >> > /taskInstances
> >> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
> >> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> >> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> >> >
> >> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> >> > /xcomEntries
> >> >
> >> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
> >> >
> >> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> >> > /dags/{dag_id}/dagRuns/{execution_date}
> >> > /dags/{dag_id}/structure
> >> > /dags/{dag_id}/tasks
> >> > /dags/{dag_id}/tasks/{task_id}
> >> >
> >> > 4) Some endpoints will have similar behavior, so we can delete them.
> >> > Then we will have the following list of endpoints.
> >> > /dags
> >> > /dags/{dag_id}
> >> > /dags/{dag_id}/clearTaskInstanaces
> >> > /dagRuns
> >> > /taskInstances
> >> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> >> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> >> >
> >> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> >> > /xcomEntries
> >> >
> >> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> >> > /dags/{dag_id}/dagRuns/{execution_date}
> >> > /dags/{dag_id}/structure
> >> > /dags/{dag_id}/tasks
> >> > /dags/{dag_id}/tasks/{task_id}
> >> >
> >> > Which solution do you like the most? I, II, III, or IV?
> >> >
> >> > Best regards,
> >> > Kamil Breguła
> >> >
> >> > [1] https://aip.dev/159
> >> >
> >> > On Wed, Apr 15, 2020 at 8:32 PM Daniel (Daniel Lamblin ) [Data
> >> > Infrastructure] <la...@coupang.com> wrote:
> >> > >
> >> > > I'm glad the general mood is that connection_id should be unique.
> >> > > FWIW when I had multiple connections in v1.8.2 with mysql I didn't
> >> seem to be getting any randomized loadbalancing anyway. Then again, maybe
> >> random was just 100 selections of 1 over 2.
> >> > > There are many other ways to load balance connections, each specific
> >> to the service type, so I don't see it Airflow's place to provide a
> >> semi-generic option to do it.
> >> > >
> >> > > +1 for connection ID being unique.
> >> > >
> >> > > Pardon outlook for changing links to the ConnectionID, DagID and
> >> PoolID being integers in a version of the API.
> >> > > Are we past that decision already; I'd expect to use string names.
> >> > >
> >> > > I'd also asked about DAG run ID, task ID, and finally whether there'd
> >> be an endpoint with which to clear tasks, because crud operations don't
> >> model the interplay of task instance, jobs, and dag run state involved.
> >> > > -Daniel
> >> > >
> >> > > On 4/14/20, 8:49 AM, "Xinbin Huang" <bi...@gmail.com> wrote:
> >> > >
> >> > >     [Warning]: This email originated from an external source. Do not
> >> open links or attachments unless you know the content is safe.
> >> > >     [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지 마십시오.
> >> > >
> >> > >
> >> > >     +1 on making connection IDs unique. It's confusing to have Airflow
> >> handled
> >> > >     load balancing here.
> >> > >
> >> > >     On Tue, Apr 14, 2020, 4:59 AM Kaxil Naik <ka...@gmail.com>
> >> wrote:
> >> > >
> >> > >     > +1 to make connection ids unique
> >> > >     >
> >> > >     > On Tue, Apr 14, 2020 at 11:59 AM Jarek Potiuk <
> >> Jarek.Potiuk@polidea.com>
> >> > >     > wrote:
> >> > >     >
> >> > >     > > I am also supporting converting the connection to be unique.
> >> > >     > >
> >> > >     > > I've worked with similar approach long time ago (10s of years)
> >> and it was
> >> > >     > > fine then where we have not yet figured out how to scale
> >> client/server
> >> > >     > > architecture and we did not have all the nice infrastructure
> >> like
> >> > >     > > load/balancing, cloud services etc. I believe in most of the
> >> current
> >> > >     > > services/systems load-balancing should be handled by the
> >> service itself
> >> > >     > or
> >> > >     > > some kind of proxy between - not by the client side - in
> >> production
> >> > >     > > environments.
> >> > >     > >
> >> > >     > > It's far more robust and might provide much better control
> >> (you can
> >> > >     > control
> >> > >     > > multiple independent client's connection and do not rely on
> >> random
> >> > >     > > distribution of connections). There are scenarios that would
> >> not work
> >> > >     > well
> >> > >     > > in this case - for example physical proximity of the workers,
> >> current
> >> > >     > load
> >> > >     > > on different services etc. etc. It is very limiting to
> >> rely on
> >> this
> >> > >     > > feature.
> >> > >     > >
> >> > >     > > On Tue, Apr 14, 2020 at 11:36 AM Kamil Breguła <
> >> > >     > kamil.bregula@polidea.com>
> >> > >     > > wrote:
> >> > >     > >
> >> > >     > > > Hello,
> >> > >     > > >
> >> > >     > > > This can cause big problems with idempotence. According to
> >> RFC-7231,
> >> > >     > > > the DELETE method should be idempotent.
> >> > >     > > >
> >> > >     > > > For example:
> >> > >     > > > If you want to delete items with index from 1 to 4, you
> >> should set the
> >> > >     > > > following request
> >> > >     > > > DELETE /connections/hdfs_default/4
> >> > >     > > > DELETE /connections/hdfs_default/3
> >> > >     > > > DELETE /connections/hdfs_default/2
> >> > >     > > > DELETE /connections/hdfs_default/1
> >> > >     > > >
> >> > >     > > > However, if the user sends these requests in a different
> >> order, they
> >> > >     > > > will only delete the first and third items.
> >> > >     > > > DELETE /connections/hdfs_default/1
> >> > >     > > > DELETE /connections/hdfs_default/2
> >> > >     > > > DELETE /connections/hdfs_default/3
> >> > >     > > > DELETE /connections/hdfs_default/4
> >> > >     > > >
> >> > >     > > > If you use asynchronous HTTP clients (a popular in Node),
> >> the order of
> >> > >     > > > requests will not be kept. It will also be a big
> >> problem with
> >> > >     > > > simultaneous modifications.
> >> > >     > > >
> >> > >     > > > I am also in favor of abandoning support for many
> >> connections. This
> >> > >     > > > can be solved on a different layer.
> >> > >     > > >
> >> > >     > > > Best regards,
> >> > >     > > > Kamil
> >> > >     > > >
> >> > >     > > >
> >> > >     > > > On Thu, Apr 9, 2020 at 6:47 PM Ash Berlin-Taylor <
> >> ash@apache.org>
> >> > >     > wrote:
> >> > >     > > > >
> >> > >     > > > > They are, but we _can_ make a choice to remove that
> >> feature -- it is
> >> > >     > > not
> >> > >     > > > > widely used and is confusing to many when they
> >> stumble on
> >> it.
> >> > >     > > > >
> >> > >     > > > > It's not something we should do lightly, but it is a
> >> possibility.
> >> > >     > > > >
> >> > >     > > > > I think I'm probably leaning towards the "ordinal" concept:
> >> > >     > > > >
> >> > >     > > > > /connections/hdfs_default -> list of connections with that
> >> ID
> >> > >     > > > > /connections/hdfs_default/0 first connection of that type
> >> > >     > > > >
> >> > >     > > > > Something like that.
> >> > >     > > > >
> >> > >     > > > > On Apr 9 2020, at 2:31 pm, Shaw, Damian P.
> >> > >     > > > > <da...@credit-suisse.com> wrote:
> >> > >     > > > >
> >> > >     > > > > > FYI if you look back at the thread "Re: [2.0 spring
> >> cleaning]
> >> > >     > Require
> >> > >     > > > > > unique conn_id" on 2019-04-14 you can see a message from
> >> Kevin Yang
> >> > >     > > > > > stating that this random choice of connections is a
> >> "feature" used
> >> > >     > to
> >> > >     > > > > > load balance connections in AirBnB. So users are relying
> >> on this
> >> > >     > > > behavior.
> >> > >     > > > > >
> >> > >     > > > > >
> >> > >     > > > > > -----Original Message-----
> >> > >     > > > > > From: Daniel (Daniel Lamblin ) [Data Infrastructure]
> >> > >     > > > > > <la...@coupang.com>
> >> > >     > > > > > Sent: Wednesday, April 8, 2020 20:01
> >> > >     > > > > > To: dev@airflow.apache.org
> >> > >     > > > > > Subject: Re: API spec questions
> >> > >     > > > > >
> >> > >     > > > > > Having been bit by accidentally having two connections
> >> by the same
> >> > >     > > > > > name or conn_id, I'd prefer if were made unique. In my
> >> experience
> >> > >     > > > > > there's little utility in having multiple
> >> connections by
> >> the same
> >> > >     > > > > > name. Tasks that use a connection do to fairly randomly
> >> choose one,
> >> > >     > > > > > rather they seem pretty consistent in using the one
> >> created
> >> > >     > earliest,
> >> > >     > > > > > which often has the lower id integer.
> >> > >     > > > > >
> >> > >     > > > > > Circling back to how this is used by the API, from
> >> a user
> >> > >     > > perspective,
> >> > >     > > > > > the following in path integer fields were ones I'd have
> >> expected to
> >> > >     > > be
> >> > >     > > > > > strings instead:
> >> > >     > > > > > ConnectionID
> >> > >     > > >
> >> > >     > >
> >> > >     >
> >> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1845&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=ur3AAggOlQ7BskKBxQxRq0i%2FS36Ol72c7H7ddwXo5C4%3D&amp;reserved=0
> >> > >     > > > > > DAGID
> >> > >     > > >
> >> > >     > >
> >> > >     >
> >> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1853&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=vyhFiV6Fd3pnVvlAoxnscSvvcy2NX%2BMASGmuHKu%2Fy84%3D&amp;reserved=0
> >> > >     > > > > > PoolID
> >> > >     > > >
> >> > >     > >
> >> > >     >
> >> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1901&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=4Fog92AZju0%2BN44FoBdU4fj2DwhgCbUfKYUbdjeWpgo%3D&amp;reserved=0
> >> > >     > > > > >
> >> > >     > > > > > Though it’s a url-encoding hassle, I also expected that
> >> DAGRunID
> >> > >     > > would
> >> > >     > > > > > be more like the Run ID E.G.
> >> > >     > > > > > "scheduled__2020-04-08T23:10:00+00:00",
> >> > >     > > > > > "manual__2020-04-08T23:00:56+00:00",
> >> > >     > > > > > "manual__2020-04-08T16:24:56.692868+00:00" ,
> >> > >     > > > > > "backfill_2020-04-08T22:05:00+00:00" etc
> >> > >     > > > > >
> >> > >     > > > > > Then TaskID is confusing to me; AFAIK the PK to task
> >> instances are
> >> > >     > > > > > task_id, dag_id, and execution_date and the api call
> >> appears to
> >> > >     > align
> >> > >     > > > > > with that having the pattern:
> >> > >     > > > > > /dags/{dag_id}/taskInstances/{task_id}/{execution_date}:
> >> > >     > > > > > But if task_id is a numbered id, then… execution_date
> >> isn't even
> >> > >     > > > > > needed… I'm thinking it should have been a string.
> >> > >     > > > > >
> >> > >     > > > > > An aside to this, I've always wondered what happens
> >> if an
> >> > >     > externally
> >> > >     > > > > > triggered DAG Run has the same execution date as a
> >> pre-existing
> >> > >     > > > > > scheduled DAG Run. They'd have different run_ids, EG
> >> > >     > > > > > "scheduled__2020-04-08T23:10:00+00:00" vs
> >> > >     > > > > > "manual__2020-04-08T23:10:00+00:00" but then task
> >> instances for
> >> > >     > those
> >> > >     > > > > > runs might not be unique.
> >> > >     > > > > >
> >> > >     > > > > > Lastly, the UI and CLI operation of clearing tasks seems
> >> analogous
> >> > >     > to
> >> > >     > > > > > the delete task instance API end point. But probably
> >> it's not, and
> >> > >     > > > > > this could become confusing to users.
> >> > >     > > > > > There should be a non-db-model api call for clearing
> >> tasks like you
> >> > >     > > > > > can from the UI and the CLI. If I read it right,
> >> clearing does the
> >> > >     > > > > > following: it sets the state of the task instance to
> >> None unless
> >> > >     > the
> >> > >     > > > > > state was Running then instead it sets it and related
> >> job ids to
> >> > >     > > > > > Shutdown. It deletes reschedules of the TI and it sets
> >> the dag runs
> >> > >     > > > > > for those task instances back to Running.
> >> > >     > > > > > This would be a lot to do for a user using PATCH calls
> >> to change
> >> > >     > Task
> >> > >     > > > > > Instances and Dag Runs together (and there's no API for
> >> Jobs).
> >> > >     > > > > >
> >> > >     > > > > > -Daniel L.
> >> > >     > > > > > P.S. as far as renaming all parameters on operators and
> >> hooks with
> >> > >     > > > > > *_conn_id, I do not want to see that breaking change
> >> happen. But IF
> >> > >     > > it
> >> > >     > > > > > HAS TO, I'm still of the opinion that the default_args
> >> use for
> >> > >     > > > > > X_conn_id is not preferable to having all operators take
> >> simpler
> >> > >     > > > > > source_conn_id and optional target_conn_id parameter
> >> names that are
> >> > >     > > > > > consistent across all operators.
> >> > >     > > > > >
> >> > >     > > > > > On 4/8/20, 9:47 AM, "Ash Berlin-Taylor" <as...@apache.org>
> >> wrote:
> >> > >     > > > > >
> >> > >     > > > > >    [Warning]: This email originated from an external
> >> source. Do not
> >> > >     > > > > > open links or attachments unless you know the
> >> content is
> >> safe.
> >> > >     > > > > >    [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나
> >> 첨부파일을 열지
> >> > >     > > 마십시오.
> >> > >     > > > > >
> >> > >     > > > > >
> >> > >     > > > > >    To expand on the "so I think we need to do one of":
> >> > >     > > > > >
> >> > >     > > > > >
> >> > >     > > > > >    - we need to update the name of "conn_id"
> >> somehow. I
> >> can't think
> >> > >     > > of
> >> > >     > > > a
> >> > >     > > > > >    better option, and given all the operators have
> >> "x_conn_id" I
> >> > >     > > don't
> >> > >     > > > > >    think that change should be made lightly.
> >> > >     > > > > >
> >> > >     > > > > >    - make conn_id unique (this "poor" HA has been a
> >> source of
> >> > >     > > > > > confusion in
> >> > >     > > > > >    the past) and the ID we use in the API
> >> > >     > > > > >
> >> > >     > > > > >    Or a third option:
> >> > >     > > > > >
> >> > >     > > > > >    - Have the API take conn_id and either return all
> >> conns for that
> >> > >     > > > > >    conn_id, or conn_id plus an ordinal (?idx=0 type
> >> thing) to
> >> > >     > return
> >> > >     > > a
> >> > >     > > > > >    single value.
> >> > >     > > > > >
> >> > >     > > > > >    On Apr 8 2020, at 5:42 pm, Ash Berlin-Taylor <
> >> ash@apache.org>
> >> > >     > > > wrote:
> >> > >     > > > > >
> >> > >     > > > > >    > Hi everyone,
> >> > >     > > > > >    >
> >> > >     > > > > >    > So as I mentioned in the AIP voting thread, I think
> >> we need to
> >> > >     > > > give
> >> > >     > > > > >    > some more thought to how we are exposing connection
> >> ids in the
> >> > >     > > > API.
> >> > >     > > > > >    >
> >> > >     > > > > >    > Right now as proposed (and merged without approval,
> >> not cool.
> >> > >     > > The
> >> > >     > > > AIP
> >> > >     > > > > >    > we voted on did not contain a PR against
> >> apache/airflow.) it
> >> > >     > has
> >> > >     > > > an
> >> > >     > > > > >    > end point of `/connections/{connection_id} `
> >> > >     > > > > >    >
> >> > >     > > > > >    > My issue here is as I said in the previous thread:
> >> that is
> >> > >     > going
> >> > >     > > > > > to be
> >> > >     > > > > >    > mightly confusing to our users because there
> >> is a
> >> "conn_id"
> >> > >     > > > concept
> >> > >     > > > > >    > that is exposed, so people are going to try putting
> >> > >     > > "aws_default"
> >> > >     > > > etc
> >> > >     > > > > >    > in there. I don't care what the API spec says, I
> >> care what our
> >> > >     > > > users
> >> > >     > > > > >    > expect, and having a connection_id/id and a conn_id
> >> fields is
> >> > >     > > > > > just confusing
> >> > >     > > > > >    >
> >> > >     > > > > >    > So I think we need to do one of:
> >> > >     > > > > >    > - we need to update the name of "conn_id" somehow,
> >> make
> >> > >     > conn_id
> >> > >     > > > unique
> >> > >     > > > > >    > (this "poor" HA has been a source of confusion in
> >> the past)
> >> > >     > > > > >    >
> >> > >     > > > > >    > There are similar problems for the DAG run -- the
> >> spec has the
> >> > >     > > > > > type as
> >> > >     > > > > >    > an integer, but everything else about the Airflow
> >> system uses
> >> > >     > > the
> >> > >     > > > > >    > (unique) run_id parameter, and I would expect the
> >> API to use
> >> > >     > > > > > that. The
> >> > >     > > > > >    > autoinc. id on the run column is literally never
> >> used in the
> >> > >     > > code
> >> > >     > > > > >    > base, so exposing that via the API seems odd.
> >> > >     > > > > >    >
> >> > >     > > > > >    > A few other smaller points:
> >> > >     > > > > >    >
> >> > >     > > > > >    > EventLog: Those are "audit"/action logs, so we
> >> probably
> >> > >     > > shouldn't
> >> > >     > > > let
> >> > >     > > > > >    > people delete them via the API!
> >> > >     > > > > >    >
> >> > >     > > > > >    > pool_id: still an integer. It should be the "name".
> >> > >     > > > > >    >
> >> > >     > > > > >    > How should add/delete variables and connections
> >> work in the
> >> > >     > API
> >> > >     > > > with
> >> > >     > > > > >    > the addition of the new "Secrets Backends"?
> >> > >     > > > > >    >
> >> > >     > > > > >    > xcomValues: task_id is listed as an integer.
> >> > >     > > > > >    >
> >> > >     > > > > >    >
> >> > >     > > > > >    > -ash
> >> > >     > > > > >
> >> > >     > > > > >
> >> > >     > > > > >
> >> > >     > > > > >
> >> > >     > > > > >
> >> > >     > > > > >
> >> > >     > > > >
> >> > >     > > >
> >> > >     > >
> >> > >     >
> >> ===============================================================================
> >> > >     > > > > > Please access the attached hyperlink for an important
> >> electronic
> >> > >     > > > > > communications disclaimer:
> >> > >     > > > > >
> >> https://kor01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.credit-suisse.com%2Flegal%2Fen%2Fdisclaimer_email_ib.html&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=6e9Ck952oFmXDkoDYnM0XgCWsplYb7Dua37BkAPt22A%3D&amp;reserved=0
> >> > >     > > > > >
> >> > >     > > > >
> >> > >     > > >
> >> > >     > >
> >> > >     >
> >> ===============================================================================
> >> > >     > > > > >
> >> > >     > > >
> >> > >     > >
> >> > >     > >
> >> > >     > > --
> >> > >     > >
> >> > >     > > Jarek Potiuk
> >> > >     > > Polidea <
> >> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0>
> >> | Principal Software Engineer
> >> > >     > >
> >> > >     > > M: +48 660 796 129 <+48660796129>
> >> > >     > > [image: Polidea] <
> >> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0
> >> >
> >> > >     > >
> >> > >     >
> >> > >
> >> > >
> >>
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >

Re: API spec questions

Posted by Ash Berlin-Taylor <as...@apache.org>.

1. list is not even close to a valid execution date, so I don't see that as a clash.

2. Respond with 400 Bad Request if URL dag id is not whatever wildcard char we pick. (Is ~ URL safe?) when dag_ids query param is provided.

3. Docs is the answer to that. "This is the same as GET, but allows for longer request bodies"

Reason I'd like such a filter is that without it out request requires N http requests and N db queries instead of 1. Even doing it as batch requests would still need N queries.

On 12 May 2020 18:00:50 BST, "Kamil Breguła" <ka...@polidea.com> wrote:
>Hello,
>
>It depends on the specific implementation of the Batch API. The
>Microsoft API can be used with requests.
>https://docs.microsoft.com/en-us/graph/json-batching
>
>I would not like to add new endpoints to the API due to premature
>optimization.  Adding a new endpoint - /dags/{dag_id}/dagRuns/lisst/
>s opens up many questions.
>1. We currently have an endpoint that has a similar URL -
>/dags/{dag_id}/dagRuns/{execution_date}/  so we have a collision.
>/dags/{dag_id}/dagRuns/list/ can be interpreter as two way:
>1. /dags/{dag_id}/dagRuns/list/
>2. /dags/{dag_id}/dagRuns/{execution_date}/ with execution_date =
>"list"
>First, the endpoint URL pattern is matched, and then the parameters
>are verified in the next step.
>
>2. This will also make it difficult to use the API, since DAG_ID will
>appear in several places, which can cause confusion First in the URL
>with the wildcard - "~" (dag_id) and the second time in body requests
>(dag_iids).
>POST /dags/{dag_id}/dagRuns/list/
>
>dag_ids=DAG_A&dag_ids=DAG_B&&dag_ids=DAG_B
>
>3.  We will have two endpoints that will have similar behavior.
>GET /dags/{dag_id}/dagRuns/
>POST /dags/{dag_id}/dagRuns/list/
>This will cause confusion among users.
>
>Are there any reasons why you should add the dag_ids filter in
>addition to performance?  Currently, the user can realize his use case
>and this is probably the most important thing.
>
>Best regards
>
>On Tue, May 12, 2020 at 6:34 PM Ash Berlin-Taylor <as...@apache.org>
>wrote:
>>
>> Such a batch endpoint is much much harder for API clients to build
>requests for, and consume (you can no longer just use cURL/requests/any
>http client), so I'm not a fan of that
>>
>> On 12 May 2020 17:07:12 BST, "Kamil Breguła"
><ka...@polidea.com> wrote:
>>>
>>> On Tue, May 12, 2020 at 3:49 PM Jarek Potiuk
><Ja...@polidea.com> wrote:
>>>>
>>>>
>>>>  My 3 cents:
>>>>
>>>>
>>>>>  But on reading Google's https://aip.dev/159 that now makes more
>sense,
>>>>>  and that isn't what you were suggesting, but instaed a single,
>litteral
>>>>>  `-` to mean "any dag id". Is this correct?
>>>>>
>>>>
>>>>  That's also my understanding.
>>>>
>>>>
>>>>>  So I think then my only ask is that we have a `dag_ids` parameter
>that
>>>>>  we can use to filter on :)
>>>>>
>>>>
>>>>  Yep. This is definitely needed.
>>>>
>>>>  Additionally/related when selecting across multiple DAGs we would
>very
>>>>>
>>>>>  quickly run in to that was fixed in
>github.com/apache/airflow/pull/7364
>>>>>  -- where asking for all the task stats for the DAGs on a single
>page
>>>>>  exceeded the HTTP request line limit.
>>>>>
>>>>>  i.e. this could break if the DAG ids are large, or if they is a
>large
>>>>>  number of dags to be fetched.
>>>>>
>>>>>    GET
>>>>> 
>/dags/-/dagRuns/2020-01-01T00/taskInstances?dag_ids=...&dag_ids=...&dag_ids=...
>>>>>
>>>>>  How about this as an additional endpoint:
>>>>>
>>>>>    POST /dags/-/dagRuns/2020-01-01/taskInstances/list
>>>>>
>>>>>    dag_ids=...&dag_ids=...&dag_ids...
>>>>>
>>>>
>>>>  Very sensible proposal. The URL max length very much depends on
>the client
>>>>  (especially when it is run from browser). where POST is virtually
>>>>  unlimited. It's common thing to use POST in this case.
>>>>
>>>
>>> I'm not sure ... I think the following conventions are most commonly
>used
>>> GET /items/ - fetch list of items
>>> POST /items/ - create item
>>>
>>> Google recommends using the batch API endpoint in these cases. This
>>> involves sending a single HTTP request using the POST method, but
>the
>>> request body has many HTTP requests for multiple collections.
>>>
>>> POST /batch/v1 HTTP/1.1
>>> Authorization: Bearer your_auth_token
>>> Content-Type: multipart/mixed; boundary=batch_foobarbaz
>>> Content-Length: total_content_length
>>>
>>> --batch_foobarbaz
>>> Content-Type: application/http
>>> Content-ID: <it...@barnyard.example.com>
>>>
>>> GET /dag/DAG_A/taskInstance/
>>>
>>> --batch_foobarbaz--
>>>
>>> GET /dag/DAG_B/taskInstance/
>>>
>>> --batch_foobarbaz--
>>>
>>> GET /dag/DAG_B/taskInstance/
>>>
>>> --batch_foobarbaz--
>>>
>>> Here is more information:
>https://developers.google.com/gmail/api/guides/batch
>>>
>>> We can, of course, come up with our approach and not rely on the
>idea
>>> of Google solutions. I refer to Google because I have experience
>with
>>> API and design principles.   I would like this API to be permanent
>and
>>> there was no need to introduce breaking changes. It is always
>possible
>>> to add filtering and endpoint later, but removing it is a breaking
>>> change.
>>>
>>> Kubernetes doesn't provide batch API and the user build their
>>> solutions without it.
>>>
>>> I still looked at how Microsoft implements this.
>>> https://docs.microsoft.com/en-us/graph/json-batching
>>>
>>> I will only quote a fragment that can be valuable to us. This only
>>> convinces me that if we don't have Batch API then we should
>carefully
>>> add the filtering option in the query string.
>>>
>>>>  Bypassing URL length limitations with batching
>>>>
>>>>  An additional use case for JSON batching is to bypass URL length
>limitations. In cases
>>>>  where the filter clause is complex, the URL length might surpass
>limitations built into
>>>>  browsers or other HTTP clients. You can use
>>>>  JSON batching as a workaround for running these requests because
>the
>>>>  lengthy URL simply becomes part of the request payload.
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>>  (The /list suffix isn't needed here as there is no POST for
>>>>>  taskInstances, but there is in other places, so making it
>explicit means
>>>>>  we can use the same pattern for, say `POST /dags/-/dagRuns/list`)
>>>>>
>>>>>
>>>>
>>>>>
>>>>>  If it is valid to pass just one of ExecutionDateStartRange and
>>>>>  ExecutionDateEndRange could we call it ExecutionDateGTE etc. (for
>>>>>  Greater-than-or-equal) so it makes more sense when a single
>parameter is
>>>>>  providerd.
>>>>>
>>>>>  +1
>>>>
>>>>
>>>>
>>>>>  A minor style point: The filter/query string fields are all
>defined as
>>>>>  BumpyCase, but I'd prefer us to use snake_case
>(`execution_date_gte`
>>>>>  etc) - the only sort of applicable proposal I can find is
>>>>>  https://aip.dev/140
>>>>>
>>>>>
>>>>  Yeah. In this case I also like snakes better than Bumpys or
>Camels.
>>>>  Especially python ;).
>>>>
>>>>
>>>>>  (As in I can't find anything in Google recs that talk about case
>for
>>>>>  case of query string/filter params. I guess they suggest having
>it all
>>>>>  as a single `?filter=` param :- https://aip.dev/160 )
>>>>>
>>>>
>>>>  Yeah. I think I'd also want to understand better how filter
>examples
>>>>  woudl look like. It's not obvious from the spec (no time to dig
>deeply).
>>>>
>>>>  J.

Re: API spec questions

Posted by Kamil Breguła <ka...@polidea.com>.

Hello,

It depends on the specific implementation of the Batch API. The
Microsoft API can be used with requests.
https://docs.microsoft.com/en-us/graph/json-batching

I would not like to add new endpoints to the API due to premature
optimization.  Adding a new endpoint - /dags/{dag_id}/dagRuns/lisst/
s opens up many questions.
1. We currently have an endpoint that has a similar URL -
/dags/{dag_id}/dagRuns/{execution_date}/  so we have a collision.
/dags/{dag_id}/dagRuns/list/ can be interpreter as two way:
1. /dags/{dag_id}/dagRuns/list/
2. /dags/{dag_id}/dagRuns/{execution_date}/ with execution_date = "list"
First, the endpoint URL pattern is matched, and then the parameters
are verified in the next step.

2. This will also make it difficult to use the API, since DAG_ID will
appear in several places, which can cause confusion First in the URL
with the wildcard - "~" (dag_id) and the second time in body requests
(dag_iids).
POST /dags/{dag_id}/dagRuns/list/

dag_ids=DAG_A&dag_ids=DAG_B&&dag_ids=DAG_B

3.  We will have two endpoints that will have similar behavior.
GET /dags/{dag_id}/dagRuns/
POST /dags/{dag_id}/dagRuns/list/
This will cause confusion among users.

Are there any reasons why you should add the dag_ids filter in
addition to performance?  Currently, the user can realize his use case
and this is probably the most important thing.

Best regards

On Tue, May 12, 2020 at 6:34 PM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> Such a batch endpoint is much much harder for API clients to build requests for, and consume (you can no longer just use cURL/requests/any http client), so I'm not a fan of that
>
> On 12 May 2020 17:07:12 BST, "Kamil Breguła" <ka...@polidea.com> wrote:
>>
>> On Tue, May 12, 2020 at 3:49 PM Jarek Potiuk <Ja...@polidea.com> wrote:
>>>
>>>
>>>  My 3 cents:
>>>
>>>
>>>>  But on reading Google's https://aip.dev/159 that now makes more sense,
>>>>  and that isn't what you were suggesting, but instaed a single, litteral
>>>>  `-` to mean "any dag id". Is this correct?
>>>>
>>>
>>>  That's also my understanding.
>>>
>>>
>>>>  So I think then my only ask is that we have a `dag_ids` parameter that
>>>>  we can use to filter on :)
>>>>
>>>
>>>  Yep. This is definitely needed.
>>>
>>>  Additionally/related when selecting across multiple DAGs we would very
>>>>
>>>>  quickly run in to that was fixed in github.com/apache/airflow/pull/7364
>>>>  -- where asking for all the task stats for the DAGs on a single page
>>>>  exceeded the HTTP request line limit.
>>>>
>>>>  i.e. this could break if the DAG ids are large, or if they is a large
>>>>  number of dags to be fetched.
>>>>
>>>>    GET
>>>>  /dags/-/dagRuns/2020-01-01T00/taskInstances?dag_ids=...&dag_ids=...&dag_ids=...
>>>>
>>>>  How about this as an additional endpoint:
>>>>
>>>>    POST /dags/-/dagRuns/2020-01-01/taskInstances/list
>>>>
>>>>    dag_ids=...&dag_ids=...&dag_ids...
>>>>
>>>
>>>  Very sensible proposal. The URL max length very much depends on the client
>>>  (especially when it is run from browser). where POST is virtually
>>>  unlimited. It's common thing to use POST in this case.
>>>
>>
>> I'm not sure ... I think the following conventions are most commonly used
>> GET /items/ - fetch list of items
>> POST /items/ - create item
>>
>> Google recommends using the batch API endpoint in these cases. This
>> involves sending a single HTTP request using the POST method, but the
>> request body has many HTTP requests for multiple collections.
>>
>> POST /batch/v1 HTTP/1.1
>> Authorization: Bearer your_auth_token
>> Content-Type: multipart/mixed; boundary=batch_foobarbaz
>> Content-Length: total_content_length
>>
>> --batch_foobarbaz
>> Content-Type: application/http
>> Content-ID: <it...@barnyard.example.com>
>>
>> GET /dag/DAG_A/taskInstance/
>>
>> --batch_foobarbaz--
>>
>> GET /dag/DAG_B/taskInstance/
>>
>> --batch_foobarbaz--
>>
>> GET /dag/DAG_B/taskInstance/
>>
>> --batch_foobarbaz--
>>
>> Here is more information: https://developers.google.com/gmail/api/guides/batch
>>
>> We can, of course, come up with our approach and not rely on the idea
>> of Google solutions. I refer to Google because I have experience with
>> API and design principles.   I would like this API to be permanent and
>> there was no need to introduce breaking changes. It is always possible
>> to add filtering and endpoint later, but removing it is a breaking
>> change.
>>
>> Kubernetes doesn't provide batch API and the user build their
>> solutions without it.
>>
>> I still looked at how Microsoft implements this.
>> https://docs.microsoft.com/en-us/graph/json-batching
>>
>> I will only quote a fragment that can be valuable to us. This only
>> convinces me that if we don't have Batch API then we should carefully
>> add the filtering option in the query string.
>>
>>>  Bypassing URL length limitations with batching
>>>
>>>  An additional use case for JSON batching is to bypass URL length limitations. In cases
>>>  where the filter clause is complex, the URL length might surpass limitations built into
>>>  browsers or other HTTP clients. You can use
>>>  JSON batching as a workaround for running these requests because the
>>>  lengthy URL simply becomes part of the request payload.
>>
>>
>>
>>
>>
>>>
>>>>
>>>>  (The /list suffix isn't needed here as there is no POST for
>>>>  taskInstances, but there is in other places, so making it explicit means
>>>>  we can use the same pattern for, say `POST /dags/-/dagRuns/list`)
>>>>
>>>>
>>>
>>>>
>>>>  If it is valid to pass just one of ExecutionDateStartRange and
>>>>  ExecutionDateEndRange could we call it ExecutionDateGTE etc. (for
>>>>  Greater-than-or-equal) so it makes more sense when a single parameter is
>>>>  providerd.
>>>>
>>>>  +1
>>>
>>>
>>>
>>>>  A minor style point: The filter/query string fields are all defined as
>>>>  BumpyCase, but I'd prefer us to use snake_case (`execution_date_gte`
>>>>  etc) - the only sort of applicable proposal I can find is
>>>>  https://aip.dev/140
>>>>
>>>>
>>>  Yeah. In this case I also like snakes better than Bumpys or Camels.
>>>  Especially python ;).
>>>
>>>
>>>>  (As in I can't find anything in Google recs that talk about case for
>>>>  case of query string/filter params. I guess they suggest having it all
>>>>  as a single `?filter=` param :- https://aip.dev/160 )
>>>>
>>>
>>>  Yeah. I think I'd also want to understand better how filter examples
>>>  woudl look like. It's not obvious from the spec (no time to dig deeply).
>>>
>>>  J.

Re: API spec questions

Posted by Ash Berlin-Taylor <as...@apache.org>.

Such a batch endpoint is much much harder for API clients to build requests for, and consume (you can no longer just use cURL/requests/any http client), so I'm not a fan of that

On 12 May 2020 17:07:12 BST, "Kamil Breguła" <ka...@polidea.com> wrote:
>On Tue, May 12, 2020 at 3:49 PM Jarek Potiuk <Ja...@polidea.com>
>wrote:
>>
>> My 3 cents:
>>
>>
>> > But on reading Google's https://aip.dev/159 that now makes more
>sense,
>> > and that isn't what you were suggesting, but instaed a single,
>litteral
>> > `-` to mean "any dag id". Is this correct?
>> >
>>
>> That's also my understanding.
>>
>>
>> > So I think then my only ask is that we have a `dag_ids` parameter
>that
>> > we can use to filter on :)
>> >
>>
>> Yep. This is definitely needed.
>>
>> Additionally/related when selecting across multiple DAGs we would
>very
>> > quickly run in to that was fixed in
>github.com/apache/airflow/pull/7364
>> > -- where asking for all the task stats for the DAGs on a single
>page
>> > exceeded the HTTP request line limit.
>> >
>> > i.e. this could break if the DAG ids are large, or if they is a
>large
>> > number of dags to be fetched.
>> >
>> >   GET
>> >
>/dags/-/dagRuns/2020-01-01T00/taskInstances?dag_ids=...&dag_ids=...&dag_ids=...
>> >
>> > How about this as an additional endpoint:
>> >
>> >   POST /dags/-/dagRuns/2020-01-01/taskInstances/list
>> >
>> >   dag_ids=...&dag_ids=...&dag_ids...
>> >
>>
>> Very sensible proposal. The URL max length very much depends on the
>client
>> (especially when it is run from browser). where POST is virtually
>> unlimited. It's common thing to use POST in this case.
>>
>
>I'm not sure ... I think the following conventions are most commonly
>used
>GET /items/ - fetch list of items
>POST /items/ - create item
>
>Google recommends using the batch API endpoint in these cases. This
>involves sending a single HTTP request using the POST method, but the
>request body has many HTTP requests for multiple collections.
>
>POST /batch/v1 HTTP/1.1
>Authorization: Bearer your_auth_token
>Content-Type: multipart/mixed; boundary=batch_foobarbaz
>Content-Length: total_content_length
>
>--batch_foobarbaz
>Content-Type: application/http
>Content-ID: <it...@barnyard.example.com>
>
>GET /dag/DAG_A/taskInstance/
>
>--batch_foobarbaz--
>
>GET /dag/DAG_B/taskInstance/
>
>--batch_foobarbaz--
>
>GET /dag/DAG_B/taskInstance/
>
>--batch_foobarbaz--
>
>Here is more information:
>https://developers.google.com/gmail/api/guides/batch
>
>We can, of course, come up with our approach and not rely on the idea
>of Google solutions. I refer to Google because I have experience with
>API and design principles.   I would like this API to be permanent and
>there was no need to introduce breaking changes. It is always possible
>to add filtering and endpoint later, but removing it is a breaking
>change.
>
>Kubernetes doesn't provide batch API and the user build their
>solutions without it.
>
>I still looked at how Microsoft implements this.
>https://docs.microsoft.com/en-us/graph/json-batching
>
>I will only quote a fragment that can be valuable to us. This only
>convinces me that if we don't have Batch API then we should carefully
>add the filtering option in the query string.
>
>> Bypassing URL length limitations with batching
>>
>> An additional use case for JSON batching is to bypass URL length
>limitations. In cases
>> where the filter clause is complex, the URL length might surpass
>limitations built into
>> browsers or other HTTP clients. You can use
>> JSON batching as a workaround for running these requests because the
>> lengthy URL simply becomes part of the request payload.
>
>
>
>
>>
>> >
>> > (The /list suffix isn't needed here as there is no POST for
>> > taskInstances, but there is in other places, so making it explicit
>means
>> > we can use the same pattern for, say `POST /dags/-/dagRuns/list`)
>> >
>> >
>>
>> >
>> > If it is valid to pass just one of ExecutionDateStartRange and
>> > ExecutionDateEndRange could we call it ExecutionDateGTE etc. (for
>> > Greater-than-or-equal) so it makes more sense when a single
>parameter is
>> > providerd.
>> >
>> > +1
>>
>>
>> > A minor style point: The filter/query string fields are all defined
>as
>> > BumpyCase, but I'd prefer us to use snake_case
>(`execution_date_gte`
>> > etc) - the only sort of applicable proposal I can find is
>> > https://aip.dev/140
>> >
>> >
>> Yeah. In this case I also like snakes better than Bumpys or Camels.
>> Especially python ;).
>>
>>
>> > (As in I can't find anything in Google recs that talk about case
>for
>> > case of query string/filter params. I guess they suggest having it
>all
>> > as a single `?filter=` param :- https://aip.dev/160 )
>> >
>>
>> Yeah. I think I'd also want to understand better how filter examples
>> woudl look like. It's not obvious from the spec (no time to dig
>deeply).
>>
>> J.

Re: API spec questions

Posted by Kamil Breguła <ka...@polidea.com>.

On Tue, May 12, 2020 at 3:49 PM Jarek Potiuk <Ja...@polidea.com> wrote:
>
> My 3 cents:
>
>
> > But on reading Google's https://aip.dev/159 that now makes more sense,
> > and that isn't what you were suggesting, but instaed a single, litteral
> > `-` to mean "any dag id". Is this correct?
> >
>
> That's also my understanding.
>
>
> > So I think then my only ask is that we have a `dag_ids` parameter that
> > we can use to filter on :)
> >
>
> Yep. This is definitely needed.
>
> Additionally/related when selecting across multiple DAGs we would very
> > quickly run in to that was fixed in github.com/apache/airflow/pull/7364
> > -- where asking for all the task stats for the DAGs on a single page
> > exceeded the HTTP request line limit.
> >
> > i.e. this could break if the DAG ids are large, or if they is a large
> > number of dags to be fetched.
> >
> >   GET
> > /dags/-/dagRuns/2020-01-01T00/taskInstances?dag_ids=...&dag_ids=...&dag_ids=...
> >
> > How about this as an additional endpoint:
> >
> >   POST /dags/-/dagRuns/2020-01-01/taskInstances/list
> >
> >   dag_ids=...&dag_ids=...&dag_ids...
> >
>
> Very sensible proposal. The URL max length very much depends on the client
> (especially when it is run from browser). where POST is virtually
> unlimited. It's common thing to use POST in this case.
>

I'm not sure ... I think the following conventions are most commonly used
GET /items/ - fetch list of items
POST /items/ - create item

Google recommends using the batch API endpoint in these cases. This
involves sending a single HTTP request using the POST method, but the
request body has many HTTP requests for multiple collections.

POST /batch/v1 HTTP/1.1
Authorization: Bearer your_auth_token
Content-Type: multipart/mixed; boundary=batch_foobarbaz
Content-Length: total_content_length

--batch_foobarbaz
Content-Type: application/http
Content-ID: <it...@barnyard.example.com>

GET /dag/DAG_A/taskInstance/

--batch_foobarbaz--

GET /dag/DAG_B/taskInstance/

--batch_foobarbaz--

GET /dag/DAG_B/taskInstance/

--batch_foobarbaz--

Here is more information: https://developers.google.com/gmail/api/guides/batch

We can, of course, come up with our approach and not rely on the idea
of Google solutions. I refer to Google because I have experience with
API and design principles.   I would like this API to be permanent and
there was no need to introduce breaking changes. It is always possible
to add filtering and endpoint later, but removing it is a breaking
change.

Kubernetes doesn't provide batch API and the user build their
solutions without it.

I still looked at how Microsoft implements this.
https://docs.microsoft.com/en-us/graph/json-batching

I will only quote a fragment that can be valuable to us. This only
convinces me that if we don't have Batch API then we should carefully
add the filtering option in the query string.

> Bypassing URL length limitations with batching
>
> An additional use case for JSON batching is to bypass URL length limitations. In cases
> where the filter clause is complex, the URL length might surpass limitations built into
> browsers or other HTTP clients. You can use
> JSON batching as a workaround for running these requests because the
> lengthy URL simply becomes part of the request payload.




>
> >
> > (The /list suffix isn't needed here as there is no POST for
> > taskInstances, but there is in other places, so making it explicit means
> > we can use the same pattern for, say `POST /dags/-/dagRuns/list`)
> >
> >
>
> >
> > If it is valid to pass just one of ExecutionDateStartRange and
> > ExecutionDateEndRange could we call it ExecutionDateGTE etc. (for
> > Greater-than-or-equal) so it makes more sense when a single parameter is
> > providerd.
> >
> > +1
>
>
> > A minor style point: The filter/query string fields are all defined as
> > BumpyCase, but I'd prefer us to use snake_case (`execution_date_gte`
> > etc) - the only sort of applicable proposal I can find is
> > https://aip.dev/140
> >
> >
> Yeah. In this case I also like snakes better than Bumpys or Camels.
> Especially python ;).
>
>
> > (As in I can't find anything in Google recs that talk about case for
> > case of query string/filter params. I guess they suggest having it all
> > as a single `?filter=` param :- https://aip.dev/160 )
> >
>
> Yeah. I think I'd also want to understand better how filter examples
> woudl look like. It's not obvious from the spec (no time to dig deeply).
>
> J.

Re: API spec questions

Posted by Jarek Potiuk <Ja...@polidea.com>.

My 3 cents:


> But on reading Google's https://aip.dev/159 that now makes more sense,
> and that isn't what you were suggesting, but instaed a single, litteral
> `-` to mean "any dag id". Is this correct?
>

That's also my understanding.


> So I think then my only ask is that we have a `dag_ids` parameter that
> we can use to filter on :)
>

Yep. This is definitely needed.

Additionally/related when selecting across multiple DAGs we would very
> quickly run in to that was fixed in github.com/apache/airflow/pull/7364
> -- where asking for all the task stats for the DAGs on a single page
> exceeded the HTTP request line limit.
>
> i.e. this could break if the DAG ids are large, or if they is a large
> number of dags to be fetched.
>
>   GET
> /dags/-/dagRuns/2020-01-01T00/taskInstances?dag_ids=...&dag_ids=...&dag_ids=...
>
> How about this as an additional endpoint:
>
>   POST /dags/-/dagRuns/2020-01-01/taskInstances/list
>
>   dag_ids=...&dag_ids=...&dag_ids...
>

Very sensible proposal. The URL max length very much depends on the client
(especially when it is run from browser). where POST is virtually
unlimited. It's common thing to use POST in this case.


>
> (The /list suffix isn't needed here as there is no POST for
> taskInstances, but there is in other places, so making it explicit means
> we can use the same pattern for, say `POST /dags/-/dagRuns/list`)
>
>

>
> If it is valid to pass just one of ExecutionDateStartRange and
> ExecutionDateEndRange could we call it ExecutionDateGTE etc. (for
> Greater-than-or-equal) so it makes more sense when a single parameter is
> providerd.
>
> +1


> A minor style point: The filter/query string fields are all defined as
> BumpyCase, but I'd prefer us to use snake_case (`execution_date_gte`
> etc) - the only sort of applicable proposal I can find is
> https://aip.dev/140
>
>
Yeah. In this case I also like snakes better than Bumpys or Camels.
Especially python ;).


> (As in I can't find anything in Google recs that talk about case for
> case of query string/filter params. I guess they suggest having it all
> as a single `?filter=` param :- https://aip.dev/160 )
>

Yeah. I think I'd also want to understand better how filter examples
woudl look like. It's not obvious from the spec (no time to dig deeply).

J.

Re: API spec questions

Posted by Ash Berlin-Taylor <as...@apache.org>.

Hi Kamil,

Thanks for re-opening this discussion!

My initial complaint about "putting multiple values in a single path
element" is that it feels messy, and confusing, and very unlike any REST
API I've consumed or written before.

However I've just realised that I may have _wildy_ misunderstood the proposal.

I (somehow!) thought you were suggesting this:

  GET /dags/dag_id1-dag_id2-dag_id3/dagRuns/2020-01-01T00/taskInstances

But on reading Google's https://aip.dev/159 that now makes more sense,
and that isn't what you were suggesting, but instaed a single, litteral
`-` to mean "any dag id". Is this correct?

So I think then my only ask is that we have a `dag_ids` parameter that
we can use to filter on :)




Additionally/related when selecting across multiple DAGs we would very
quickly run in to that was fixed in github.com/apache/airflow/pull/7364
-- where asking for all the task stats for the DAGs on a single page
exceeded the HTTP request line limit.

i.e. this could break if the DAG ids are large, or if they is a large
number of dags to be fetched.

  GET /dags/-/dagRuns/2020-01-01T00/taskInstances?dag_ids=...&dag_ids=...&dag_ids=...

How about this as an additional endpoint:

  POST /dags/-/dagRuns/2020-01-01/taskInstances/list

  dag_ids=...&dag_ids=...&dag_ids...


(The /list suffix isn't needed here as there is no POST for
taskInstances, but there is in other places, so making it explicit means
we can use the same pattern for, say `POST /dags/-/dagRuns/list`)




If it is valid to pass just one of ExecutionDateStartRange and
ExecutionDateEndRange could we call it ExecutionDateGTE etc. (for
Greater-than-or-equal) so it makes more sense when a single parameter is providerd.




A minor style point: The filter/query string fields are all defined as
BumpyCase, but I'd prefer us to use snake_case (`execution_date_gte`
etc) - the only sort of applicable proposal I can find is https://aip.dev/140

(As in I can't find anything in Google recs that talk about case for
case of query string/filter params. I guess they suggest having it all
as a single `?filter=` param :- https://aip.dev/160 )


-ash



On May 12 2020, at 12:12 am, Jarek Potiuk <Ja...@polidea.com> wrote:

> While I understand the benefits of the "clean"  "REST" API where resources
> are identified by a nice hierarchy, I do not think it is necessary to
> follow it for this kind of queries.I think the "resources" based approach
> indeed breaks down when you want to make more complex queries and want to
> do it efficiently, without running many API calls or returning superfluous
> information. I think that's the reason where GraphQL shines compared to
> REST-like APIs. I think in this case "purity" impacts performance a lot.
>  
> I know we did not choose GraphQL (and for good reasons - it's
> complexity is
> not needed in our case) - but I think there are cases where we want
> efficiency and we want to limit a number of requests or superfluous
> information retrieved.
>  
> But looking at the balance between purity and efficiency for asking about
> "no matter which dag" or "no matter which executionDate"  - I think the
> AIP-159 proposal strikes a good balance. It's pretty natural, easy to read
> and does not break the structure.
>  
> I would not use "*" because it can be interpreted differently when
> URL-encoded/decoded because it is a reserved character (
> https://tools.ietf.org/html/rfc3986#section-2.2) - and some languages
> (Apparently
> PHP
> <https://stackoverflow.com/questions/6533561/urlencode-the-asterisk-star-character>
> might have problems with encoding it accidentally)  . I'd use one of the
> un-reserved ones (https://tools.ietf.org/html/rfc3986#section-2.3)).
> If -
> (dash) is out of the question due to compatibility, I'd use ~(tilde).
>  
> So if we only want to focus on this kind of query, I think option II. (but
> with ~) is good for me.
>  
> I have only one doubt about it. Do we want to handle other queries to
> taskInstances across dag_id or execution_date? For example
> corresponding to
> DAG_ID IN (DAG_1, DAG_2, DAG_3) or DAG_ID LIKE ("DAG_%") from typical SQL
> queries?. I think that might be a useful type of queries. even more so for
> executionDate. From what I see we do not have filters for those, but I
> wonder if this is not something we could do with additional filters and
> still keep the nice structure?
>  
> For example:
>  
> /dags/-/dagRuns/{execiton_date}/taskInstances?dag_id=DAG_1,DAG_2,DAG_3,
>  
> Not sure if this is something we even want to follow?
>  
> Ash - what are your thoughts about it? Why do you have doubts about the
> AIP159 approach ? What problems do you foresee with it ?
>  
> J.
>  
>  
>  
> On Mon, May 11, 2020 at 7:37 PM QP Hou <qp...@scribd.com> wrote:
>  
>> I would recommend option III.
>>  
>> IMHO, hierarchy based endpoints are for CURD operations on a single
>> entity. The flat endpoint, like /taskinstance, is for read-only
>> multi-facet search queries.
>>  
>> For example:
>>  
>> * to create a resource Foo: "POST /api/parent/{parent_id}/foo"
>> * to update a resource Foo: "PUT /api/parent/{parent_id}/foo/{foo_id}"
>> * to get a single Foo entity: "GET /api/parent/{parent_id}/foo/{foo_id}"
>> * to search for multiple Foo entities using multiple filters: "GET
>> /api/foo?parent_id=id1,id2,id3&owner=user1"
>>  
>> On Mon, May 11, 2020 at 5:08 AM Kamil Breguła <ka...@polidea.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > I apologize for my very negative behavior. I misunderstood the rules.
>> > I updated the specification based on the discussion. I hope that
>> now I
>> > have not missed any suggestions that I had to make. I tried to make
>> > every change in a separate commit, so you can review the changes one
>> > by one.
>> >
>> > Here is new PR:  https://github.com/apache/airflow/pull/8721
>> >
>> > Short summary:
>> > * Changed type for DAGID, DAGRunID, TaskID. Now we use string instead
>> > of an integer.
>> > * EventLog collection is read-only
>> > * Used ExcutionDate in DagRuns endpoints
>> > * Used custom action to control task instances - Removed PATCH
>> > /dags/{dag_id}/taskInstances/{task_id}/{execution_date} endpoint and
>> > added PATCH /dags/{dag_id}/clearTaskInstanaces.
>> > * Added endpoint - POST /dagRuns "Trigger a DAG Run"
>> > * Added filter parameters to GET /dags/{dag_id}/taskInstances.
>> > * Added filter parameters to GET /dags/{dag_id}/dagRuns
>> > * Removed DELETE /taskInstances endpoint
>> > * The connection ID is a unique key that identifies the connection
>> > * Moved TaskInstances resource under DagRuns resource
>> > * Fixed many typos
>> >
>> > There is one more topic for discussion - reading resources across
>> > multiple collections.
>> > In other words - how do I retrieve task instances for multiple DAGs?
>> > We have several solutions.
>> > I) Currently, the endpoint that receives a list of task instances is
>> > available at /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances:
>> > This endpoint support reading resources across multiple DAGs by
>> > specifying a "-" as a dag_id or an execution_date.
>> > This is based on Google recommendations - AIP159 [1]. I relied on
>> > these recommendations because it is the most comprehensive study on
>> > API design principles. Ash Berlin-Taylor rightly pointed out that this
>> > would be a backward-incompatible change.
>> > II) We can use a different character that will have the same role -
>> > '*'. This character cannot be in Dag/Task ID, so it's safe.
>> > III) Ash proposed to add a separate endpoint that will not include the
>> > DAG ID in the address - /taskInstances
>> >
>> > If we want to choose one solution then I think it's worth looking at
>> > what the endpoints for DAGs look like.
>> > /dags
>> > /dags/{dag_id}
>> > /dags/{dag_id}/clearTaskInstanaces
>> > /dags/{dag_id}/dagRuns
>> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
>> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
>> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
>> >
>> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
>> >
>> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
>> >
>> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
>> > /dags/{dag_id}/dagRuns/{execution_date}
>> > /dags/{dag_id}/structure
>> > /dags/{dag_id}/tasks
>> > /dags/{dag_id}/tasks/{task_id}
>> > In my opinion, here is a clear hierarchy of resources that will
>> > facilitate the use of API. The third solution causes that this
>> > hierarchy is disturbed and the endpoints that are used to receive the
>> > list of elements will be at the highest level.
>> > We will have the following endpoints:
>> > /dags
>> > /dags/{dag_id}
>> > /dags/{dag_id}/clearTaskInstanaces
>> > /dagRuns
>> > /dags/{dag_id}/dagRuns
>> > /taskInstances
>> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
>> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
>> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
>> >
>> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
>> > /xcomEntries
>> >
>> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
>> >
>> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
>> > /dags/{dag_id}/dagRuns/{execution_date}
>> > /dags/{dag_id}/structure
>> > /dags/{dag_id}/tasks
>> > /dags/{dag_id}/tasks/{task_id}
>> >
>> > 4) Some endpoints will have similar behavior, so we can delete them.
>> > Then we will have the following list of endpoints.
>> > /dags
>> > /dags/{dag_id}
>> > /dags/{dag_id}/clearTaskInstanaces
>> > /dagRuns
>> > /taskInstances
>> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
>> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
>> >
>> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
>> > /xcomEntries
>> >
>> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
>> > /dags/{dag_id}/dagRuns/{execution_date}
>> > /dags/{dag_id}/structure
>> > /dags/{dag_id}/tasks
>> > /dags/{dag_id}/tasks/{task_id}
>> >
>> > Which solution do you like the most? I, II, III, or IV?
>> >
>> > Best regards,
>> > Kamil Breguła
>> >
>> > [1] https://aip.dev/159
>> >
>> > On Wed, Apr 15, 2020 at 8:32 PM Daniel (Daniel Lamblin ) [Data
>> > Infrastructure] <la...@coupang.com> wrote:
>> > >
>> > > I'm glad the general mood is that connection_id should be unique.
>> > > FWIW when I had multiple connections in v1.8.2 with mysql I didn't
>> seem to be getting any randomized loadbalancing anyway. Then again, maybe
>> random was just 100 selections of 1 over 2.
>> > > There are many other ways to load balance connections, each specific
>> to the service type, so I don't see it Airflow's place to provide a
>> semi-generic option to do it.
>> > >
>> > > +1 for connection ID being unique.
>> > >
>> > > Pardon outlook for changing links to the ConnectionID, DagID and
>> PoolID being integers in a version of the API.
>> > > Are we past that decision already; I'd expect to use string names.
>> > >
>> > > I'd also asked about DAG run ID, task ID, and finally whether there'd
>> be an endpoint with which to clear tasks, because crud operations don't
>> model the interplay of task instance, jobs, and dag run state involved.
>> > > -Daniel
>> > >
>> > > On 4/14/20, 8:49 AM, "Xinbin Huang" <bi...@gmail.com> wrote:
>> > >
>> > >     [Warning]: This email originated from an external source. Do not
>> open links or attachments unless you know the content is safe.
>> > >     [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지 마십시오.
>> > >
>> > >
>> > >     +1 on making connection IDs unique. It's confusing to have Airflow
>> handled
>> > >     load balancing here.
>> > >
>> > >     On Tue, Apr 14, 2020, 4:59 AM Kaxil Naik <ka...@gmail.com>
>> wrote:
>> > >
>> > >     > +1 to make connection ids unique
>> > >     >
>> > >     > On Tue, Apr 14, 2020 at 11:59 AM Jarek Potiuk <
>> Jarek.Potiuk@polidea.com>
>> > >     > wrote:
>> > >     >
>> > >     > > I am also supporting converting the connection to be unique.
>> > >     > >
>> > >     > > I've worked with similar approach long time ago (10s of years)
>> and it was
>> > >     > > fine then where we have not yet figured out how to scale
>> client/server
>> > >     > > architecture and we did not have all the nice infrastructure
>> like
>> > >     > > load/balancing, cloud services etc. I believe in most of the
>> current
>> > >     > > services/systems load-balancing should be handled by the
>> service itself
>> > >     > or
>> > >     > > some kind of proxy between - not by the client side - in
>> production
>> > >     > > environments.
>> > >     > >
>> > >     > > It's far more robust and might provide much better control
>> (you can
>> > >     > control
>> > >     > > multiple independent client's connection and do not rely on
>> random
>> > >     > > distribution of connections). There are scenarios that would
>> not work
>> > >     > well
>> > >     > > in this case - for example physical proximity of the workers,
>> current
>> > >     > load
>> > >     > > on different services etc. etc. It is very limiting to
>> rely on
>> this
>> > >     > > feature.
>> > >     > >
>> > >     > > On Tue, Apr 14, 2020 at 11:36 AM Kamil Breguła <
>> > >     > kamil.bregula@polidea.com>
>> > >     > > wrote:
>> > >     > >
>> > >     > > > Hello,
>> > >     > > >
>> > >     > > > This can cause big problems with idempotence. According to
>> RFC-7231,
>> > >     > > > the DELETE method should be idempotent.
>> > >     > > >
>> > >     > > > For example:
>> > >     > > > If you want to delete items with index from 1 to 4, you
>> should set the
>> > >     > > > following request
>> > >     > > > DELETE /connections/hdfs_default/4
>> > >     > > > DELETE /connections/hdfs_default/3
>> > >     > > > DELETE /connections/hdfs_default/2
>> > >     > > > DELETE /connections/hdfs_default/1
>> > >     > > >
>> > >     > > > However, if the user sends these requests in a different
>> order, they
>> > >     > > > will only delete the first and third items.
>> > >     > > > DELETE /connections/hdfs_default/1
>> > >     > > > DELETE /connections/hdfs_default/2
>> > >     > > > DELETE /connections/hdfs_default/3
>> > >     > > > DELETE /connections/hdfs_default/4
>> > >     > > >
>> > >     > > > If you use asynchronous HTTP clients (a popular in Node),
>> the order of
>> > >     > > > requests will not be kept. It will also be a big
>> problem with
>> > >     > > > simultaneous modifications.
>> > >     > > >
>> > >     > > > I am also in favor of abandoning support for many
>> connections. This
>> > >     > > > can be solved on a different layer.
>> > >     > > >
>> > >     > > > Best regards,
>> > >     > > > Kamil
>> > >     > > >
>> > >     > > >
>> > >     > > > On Thu, Apr 9, 2020 at 6:47 PM Ash Berlin-Taylor <
>> ash@apache.org>
>> > >     > wrote:
>> > >     > > > >
>> > >     > > > > They are, but we _can_ make a choice to remove that
>> feature -- it is
>> > >     > > not
>> > >     > > > > widely used and is confusing to many when they
>> stumble on
>> it.
>> > >     > > > >
>> > >     > > > > It's not something we should do lightly, but it is a
>> possibility.
>> > >     > > > >
>> > >     > > > > I think I'm probably leaning towards the "ordinal" concept:
>> > >     > > > >
>> > >     > > > > /connections/hdfs_default -> list of connections with that
>> ID
>> > >     > > > > /connections/hdfs_default/0 first connection of that type
>> > >     > > > >
>> > >     > > > > Something like that.
>> > >     > > > >
>> > >     > > > > On Apr 9 2020, at 2:31 pm, Shaw, Damian P.
>> > >     > > > > <da...@credit-suisse.com> wrote:
>> > >     > > > >
>> > >     > > > > > FYI if you look back at the thread "Re: [2.0 spring
>> cleaning]
>> > >     > Require
>> > >     > > > > > unique conn_id" on 2019-04-14 you can see a message from
>> Kevin Yang
>> > >     > > > > > stating that this random choice of connections is a
>> "feature" used
>> > >     > to
>> > >     > > > > > load balance connections in AirBnB. So users are relying
>> on this
>> > >     > > > behavior.
>> > >     > > > > >
>> > >     > > > > >
>> > >     > > > > > -----Original Message-----
>> > >     > > > > > From: Daniel (Daniel Lamblin ) [Data Infrastructure]
>> > >     > > > > > <la...@coupang.com>
>> > >     > > > > > Sent: Wednesday, April 8, 2020 20:01
>> > >     > > > > > To: dev@airflow.apache.org
>> > >     > > > > > Subject: Re: API spec questions
>> > >     > > > > >
>> > >     > > > > > Having been bit by accidentally having two connections
>> by the same
>> > >     > > > > > name or conn_id, I'd prefer if were made unique. In my
>> experience
>> > >     > > > > > there's little utility in having multiple
>> connections by
>> the same
>> > >     > > > > > name. Tasks that use a connection do to fairly randomly
>> choose one,
>> > >     > > > > > rather they seem pretty consistent in using the one
>> created
>> > >     > earliest,
>> > >     > > > > > which often has the lower id integer.
>> > >     > > > > >
>> > >     > > > > > Circling back to how this is used by the API, from
>> a user
>> > >     > > perspective,
>> > >     > > > > > the following in path integer fields were ones I'd have
>> expected to
>> > >     > > be
>> > >     > > > > > strings instead:
>> > >     > > > > > ConnectionID
>> > >     > > >
>> > >     > >
>> > >     >
>> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1845&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=ur3AAggOlQ7BskKBxQxRq0i%2FS36Ol72c7H7ddwXo5C4%3D&amp;reserved=0
>> > >     > > > > > DAGID
>> > >     > > >
>> > >     > >
>> > >     >
>> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1853&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=vyhFiV6Fd3pnVvlAoxnscSvvcy2NX%2BMASGmuHKu%2Fy84%3D&amp;reserved=0
>> > >     > > > > > PoolID
>> > >     > > >
>> > >     > >
>> > >     >
>> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1901&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=4Fog92AZju0%2BN44FoBdU4fj2DwhgCbUfKYUbdjeWpgo%3D&amp;reserved=0
>> > >     > > > > >
>> > >     > > > > > Though it’s a url-encoding hassle, I also expected that
>> DAGRunID
>> > >     > > would
>> > >     > > > > > be more like the Run ID E.G.
>> > >     > > > > > "scheduled__2020-04-08T23:10:00+00:00",
>> > >     > > > > > "manual__2020-04-08T23:00:56+00:00",
>> > >     > > > > > "manual__2020-04-08T16:24:56.692868+00:00" ,
>> > >     > > > > > "backfill_2020-04-08T22:05:00+00:00" etc
>> > >     > > > > >
>> > >     > > > > > Then TaskID is confusing to me; AFAIK the PK to task
>> instances are
>> > >     > > > > > task_id, dag_id, and execution_date and the api call
>> appears to
>> > >     > align
>> > >     > > > > > with that having the pattern:
>> > >     > > > > > /dags/{dag_id}/taskInstances/{task_id}/{execution_date}:
>> > >     > > > > > But if task_id is a numbered id, then… execution_date
>> isn't even
>> > >     > > > > > needed… I'm thinking it should have been a string.
>> > >     > > > > >
>> > >     > > > > > An aside to this, I've always wondered what happens
>> if an
>> > >     > externally
>> > >     > > > > > triggered DAG Run has the same execution date as a
>> pre-existing
>> > >     > > > > > scheduled DAG Run. They'd have different run_ids, EG
>> > >     > > > > > "scheduled__2020-04-08T23:10:00+00:00" vs
>> > >     > > > > > "manual__2020-04-08T23:10:00+00:00" but then task
>> instances for
>> > >     > those
>> > >     > > > > > runs might not be unique.
>> > >     > > > > >
>> > >     > > > > > Lastly, the UI and CLI operation of clearing tasks seems
>> analogous
>> > >     > to
>> > >     > > > > > the delete task instance API end point. But probably
>> it's not, and
>> > >     > > > > > this could become confusing to users.
>> > >     > > > > > There should be a non-db-model api call for clearing
>> tasks like you
>> > >     > > > > > can from the UI and the CLI. If I read it right,
>> clearing does the
>> > >     > > > > > following: it sets the state of the task instance to
>> None unless
>> > >     > the
>> > >     > > > > > state was Running then instead it sets it and related
>> job ids to
>> > >     > > > > > Shutdown. It deletes reschedules of the TI and it sets
>> the dag runs
>> > >     > > > > > for those task instances back to Running.
>> > >     > > > > > This would be a lot to do for a user using PATCH calls
>> to change
>> > >     > Task
>> > >     > > > > > Instances and Dag Runs together (and there's no API for
>> Jobs).
>> > >     > > > > >
>> > >     > > > > > -Daniel L.
>> > >     > > > > > P.S. as far as renaming all parameters on operators and
>> hooks with
>> > >     > > > > > *_conn_id, I do not want to see that breaking change
>> happen. But IF
>> > >     > > it
>> > >     > > > > > HAS TO, I'm still of the opinion that the default_args
>> use for
>> > >     > > > > > X_conn_id is not preferable to having all operators take
>> simpler
>> > >     > > > > > source_conn_id and optional target_conn_id parameter
>> names that are
>> > >     > > > > > consistent across all operators.
>> > >     > > > > >
>> > >     > > > > > On 4/8/20, 9:47 AM, "Ash Berlin-Taylor" <as...@apache.org>
>> wrote:
>> > >     > > > > >
>> > >     > > > > >    [Warning]: This email originated from an external
>> source. Do not
>> > >     > > > > > open links or attachments unless you know the
>> content is
>> safe.
>> > >     > > > > >    [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나
>> 첨부파일을 열지
>> > >     > > 마십시오.
>> > >     > > > > >
>> > >     > > > > >
>> > >     > > > > >    To expand on the "so I think we need to do one of":
>> > >     > > > > >
>> > >     > > > > >
>> > >     > > > > >    - we need to update the name of "conn_id"
>> somehow. I
>> can't think
>> > >     > > of
>> > >     > > > a
>> > >     > > > > >    better option, and given all the operators have
>> "x_conn_id" I
>> > >     > > don't
>> > >     > > > > >    think that change should be made lightly.
>> > >     > > > > >
>> > >     > > > > >    - make conn_id unique (this "poor" HA has been a
>> source of
>> > >     > > > > > confusion in
>> > >     > > > > >    the past) and the ID we use in the API
>> > >     > > > > >
>> > >     > > > > >    Or a third option:
>> > >     > > > > >
>> > >     > > > > >    - Have the API take conn_id and either return all
>> conns for that
>> > >     > > > > >    conn_id, or conn_id plus an ordinal (?idx=0 type
>> thing) to
>> > >     > return
>> > >     > > a
>> > >     > > > > >    single value.
>> > >     > > > > >
>> > >     > > > > >    On Apr 8 2020, at 5:42 pm, Ash Berlin-Taylor <
>> ash@apache.org>
>> > >     > > > wrote:
>> > >     > > > > >
>> > >     > > > > >    > Hi everyone,
>> > >     > > > > >    >
>> > >     > > > > >    > So as I mentioned in the AIP voting thread, I think
>> we need to
>> > >     > > > give
>> > >     > > > > >    > some more thought to how we are exposing connection
>> ids in the
>> > >     > > > API.
>> > >     > > > > >    >
>> > >     > > > > >    > Right now as proposed (and merged without approval,
>> not cool.
>> > >     > > The
>> > >     > > > AIP
>> > >     > > > > >    > we voted on did not contain a PR against
>> apache/airflow.) it
>> > >     > has
>> > >     > > > an
>> > >     > > > > >    > end point of `/connections/{connection_id} `
>> > >     > > > > >    >
>> > >     > > > > >    > My issue here is as I said in the previous thread:
>> that is
>> > >     > going
>> > >     > > > > > to be
>> > >     > > > > >    > mightly confusing to our users because there
>> is a
>> "conn_id"
>> > >     > > > concept
>> > >     > > > > >    > that is exposed, so people are going to try putting
>> > >     > > "aws_default"
>> > >     > > > etc
>> > >     > > > > >    > in there. I don't care what the API spec says, I
>> care what our
>> > >     > > > users
>> > >     > > > > >    > expect, and having a connection_id/id and a conn_id
>> fields is
>> > >     > > > > > just confusing
>> > >     > > > > >    >
>> > >     > > > > >    > So I think we need to do one of:
>> > >     > > > > >    > - we need to update the name of "conn_id" somehow,
>> make
>> > >     > conn_id
>> > >     > > > unique
>> > >     > > > > >    > (this "poor" HA has been a source of confusion in
>> the past)
>> > >     > > > > >    >
>> > >     > > > > >    > There are similar problems for the DAG run -- the
>> spec has the
>> > >     > > > > > type as
>> > >     > > > > >    > an integer, but everything else about the Airflow
>> system uses
>> > >     > > the
>> > >     > > > > >    > (unique) run_id parameter, and I would expect the
>> API to use
>> > >     > > > > > that. The
>> > >     > > > > >    > autoinc. id on the run column is literally never
>> used in the
>> > >     > > code
>> > >     > > > > >    > base, so exposing that via the API seems odd.
>> > >     > > > > >    >
>> > >     > > > > >    > A few other smaller points:
>> > >     > > > > >    >
>> > >     > > > > >    > EventLog: Those are "audit"/action logs, so we
>> probably
>> > >     > > shouldn't
>> > >     > > > let
>> > >     > > > > >    > people delete them via the API!
>> > >     > > > > >    >
>> > >     > > > > >    > pool_id: still an integer. It should be the "name".
>> > >     > > > > >    >
>> > >     > > > > >    > How should add/delete variables and connections
>> work in the
>> > >     > API
>> > >     > > > with
>> > >     > > > > >    > the addition of the new "Secrets Backends"?
>> > >     > > > > >    >
>> > >     > > > > >    > xcomValues: task_id is listed as an integer.
>> > >     > > > > >    >
>> > >     > > > > >    >
>> > >     > > > > >    > -ash
>> > >     > > > > >
>> > >     > > > > >
>> > >     > > > > >
>> > >     > > > > >
>> > >     > > > > >
>> > >     > > > > >
>> > >     > > > >
>> > >     > > >
>> > >     > >
>> > >     >
>> ===============================================================================
>> > >     > > > > > Please access the attached hyperlink for an important
>> electronic
>> > >     > > > > > communications disclaimer:
>> > >     > > > > >
>> https://kor01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.credit-suisse.com%2Flegal%2Fen%2Fdisclaimer_email_ib.html&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=6e9Ck952oFmXDkoDYnM0XgCWsplYb7Dua37BkAPt22A%3D&amp;reserved=0
>> > >     > > > > >
>> > >     > > > >
>> > >     > > >
>> > >     > >
>> > >     >
>> ===============================================================================
>> > >     > > > > >
>> > >     > > >
>> > >     > >
>> > >     > >
>> > >     > > --
>> > >     > >
>> > >     > > Jarek Potiuk
>> > >     > > Polidea <
>> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0>
>> | Principal Software Engineer
>> > >     > >
>> > >     > > M: +48 660 796 129 <+48660796129>
>> > >     > > [image: Polidea] <
>> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0
>> >
>> > >     > >
>> > >     >
>> > >
>> > >
>>  
>  
>  
> --  
>  
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>  
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Re: API spec questions

Posted by Jarek Potiuk <Ja...@polidea.com>.

While I understand the benefits of the "clean"  "REST" API where resources
are identified by a nice hierarchy, I do not think it is necessary to
follow it for this kind of queries.I think the "resources" based approach
indeed breaks down when you want to make more complex queries and want to
do it efficiently, without running many API calls or returning superfluous
information. I think that's the reason where GraphQL shines compared to
REST-like APIs. I think in this case "purity" impacts performance a lot.

I know we did not choose GraphQL (and for good reasons - it's complexity is
not needed in our case) - but I think there are cases where we want
efficiency and we want to limit a number of requests or superfluous
information retrieved.

But looking at the balance between purity and efficiency for asking about
"no matter which dag" or "no matter which executionDate"  - I think the
AIP-159 proposal strikes a good balance. It's pretty natural, easy to read
and does not break the structure.

I would not use "*" because it can be interpreted differently when
URL-encoded/decoded because it is a reserved character (
https://tools.ietf.org/html/rfc3986#section-2.2) - and some languages
(Apparently
PHP
<https://stackoverflow.com/questions/6533561/urlencode-the-asterisk-star-character>
might have problems with encoding it accidentally)  . I'd use one of the
un-reserved ones (https://tools.ietf.org/html/rfc3986#section-2.3)). If -
(dash) is out of the question due to compatibility, I'd use ~(tilde).

So if we only want to focus on this kind of query, I think option II. (but
with ~) is good for me.

I have only one doubt about it. Do we want to handle other queries to
taskInstances across dag_id or execution_date? For example corresponding to
DAG_ID IN (DAG_1, DAG_2, DAG_3) or DAG_ID LIKE ("DAG_%") from typical SQL
queries?. I think that might be a useful type of queries. even more so for
executionDate. From what I see we do not have filters for those, but I
wonder if this is not something we could do with additional filters and
still keep the nice structure?

For example:

/dags/-/dagRuns/{execiton_date}/taskInstances?dag_id=DAG_1,DAG_2,DAG_3,

Not sure if this is something we even want to follow?

Ash - what are your thoughts about it? Why do you have doubts about the
AIP159 approach ? What problems do you foresee with it ?

J.



On Mon, May 11, 2020 at 7:37 PM QP Hou <qp...@scribd.com> wrote:

> I would recommend option III.
>
> IMHO, hierarchy based endpoints are for CURD operations on a single
> entity. The flat endpoint, like /taskinstance, is for read-only
> multi-facet search queries.
>
> For example:
>
> * to create a resource Foo: "POST /api/parent/{parent_id}/foo"
> * to update a resource Foo: "PUT /api/parent/{parent_id}/foo/{foo_id}"
> * to get a single Foo entity: "GET /api/parent/{parent_id}/foo/{foo_id}"
> * to search for multiple Foo entities using multiple filters: "GET
> /api/foo?parent_id=id1,id2,id3&owner=user1"
>
> On Mon, May 11, 2020 at 5:08 AM Kamil Breguła <ka...@polidea.com>
> wrote:
> >
> > Hello,
> >
> > I apologize for my very negative behavior. I misunderstood the rules.
> > I updated the specification based on the discussion. I hope that now I
> > have not missed any suggestions that I had to make. I tried to make
> > every change in a separate commit, so you can review the changes one
> > by one.
> >
> > Here is new PR:  https://github.com/apache/airflow/pull/8721
> >
> > Short summary:
> > * Changed type for DAGID, DAGRunID, TaskID. Now we use string instead
> > of an integer.
> > * EventLog collection is read-only
> > * Used ExcutionDate in DagRuns endpoints
> > * Used custom action to control task instances - Removed PATCH
> > /dags/{dag_id}/taskInstances/{task_id}/{execution_date} endpoint and
> > added PATCH /dags/{dag_id}/clearTaskInstanaces.
> > * Added endpoint - POST /dagRuns "Trigger a DAG Run"
> > * Added filter parameters to GET /dags/{dag_id}/taskInstances.
> > * Added filter parameters to GET /dags/{dag_id}/dagRuns
> > * Removed DELETE /taskInstances endpoint
> > * The connection ID is a unique key that identifies the connection
> > * Moved TaskInstances resource under DagRuns resource
> > * Fixed many typos
> >
> > There is one more topic for discussion - reading resources across
> > multiple collections.
> > In other words - how do I retrieve task instances for multiple DAGs?
> > We have several solutions.
> > I) Currently, the endpoint that receives a list of task instances is
> > available at /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances:
> > This endpoint support reading resources across multiple DAGs by
> > specifying a "-" as a dag_id or an execution_date.
> > This is based on Google recommendations - AIP159 [1]. I relied on
> > these recommendations because it is the most comprehensive study on
> > API design principles. Ash Berlin-Taylor rightly pointed out that this
> > would be a backward-incompatible change.
> > II) We can use a different character that will have the same role -
> > '*'. This character cannot be in Dag/Task ID, so it's safe.
> > III) Ash proposed to add a separate endpoint that will not include the
> > DAG ID in the address - /taskInstances
> >
> > If we want to choose one solution then I think it's worth looking at
> > what the endpoints for DAGs look like.
> > /dags
> > /dags/{dag_id}
> > /dags/{dag_id}/clearTaskInstanaces
> > /dags/{dag_id}/dagRuns
> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> >
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> >
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
> >
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> > /dags/{dag_id}/dagRuns/{execution_date}
> > /dags/{dag_id}/structure
> > /dags/{dag_id}/tasks
> > /dags/{dag_id}/tasks/{task_id}
> > In my opinion, here is a clear hierarchy of resources that will
> > facilitate the use of API. The third solution causes that this
> > hierarchy is disturbed and the endpoints that are used to receive the
> > list of elements will be at the highest level.
> > We will have the following endpoints:
> > /dags
> > /dags/{dag_id}
> > /dags/{dag_id}/clearTaskInstanaces
> > /dagRuns
> > /dags/{dag_id}/dagRuns
> > /taskInstances
> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> >
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> > /xcomEntries
> >
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
> >
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> > /dags/{dag_id}/dagRuns/{execution_date}
> > /dags/{dag_id}/structure
> > /dags/{dag_id}/tasks
> > /dags/{dag_id}/tasks/{task_id}
> >
> > 4) Some endpoints will have similar behavior, so we can delete them.
> > Then we will have the following list of endpoints.
> > /dags
> > /dags/{dag_id}
> > /dags/{dag_id}/clearTaskInstanaces
> > /dagRuns
> > /taskInstances
> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> > /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> >
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> > /xcomEntries
> >
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> > /dags/{dag_id}/dagRuns/{execution_date}
> > /dags/{dag_id}/structure
> > /dags/{dag_id}/tasks
> > /dags/{dag_id}/tasks/{task_id}
> >
> > Which solution do you like the most? I, II, III, or IV?
> >
> > Best regards,
> > Kamil Breguła
> >
> > [1] https://aip.dev/159
> >
> > On Wed, Apr 15, 2020 at 8:32 PM Daniel (Daniel Lamblin ) [Data
> > Infrastructure] <la...@coupang.com> wrote:
> > >
> > > I'm glad the general mood is that connection_id should be unique.
> > > FWIW when I had multiple connections in v1.8.2 with mysql I didn't
> seem to be getting any randomized loadbalancing anyway. Then again, maybe
> random was just 100 selections of 1 over 2.
> > > There are many other ways to load balance connections, each specific
> to the service type, so I don't see it Airflow's place to provide a
> semi-generic option to do it.
> > >
> > > +1 for connection ID being unique.
> > >
> > > Pardon outlook for changing links to the ConnectionID, DagID and
> PoolID being integers in a version of the API.
> > > Are we past that decision already; I'd expect to use string names.
> > >
> > > I'd also asked about DAG run ID, task ID, and finally whether there'd
> be an endpoint with which to clear tasks, because crud operations don't
> model the interplay of task instance, jobs, and dag run state involved.
> > > -Daniel
> > >
> > > On 4/14/20, 8:49 AM, "Xinbin Huang" <bi...@gmail.com> wrote:
> > >
> > >     [Warning]: This email originated from an external source. Do not
> open links or attachments unless you know the content is safe.
> > >     [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지 마십시오.
> > >
> > >
> > >     +1 on making connection IDs unique. It's confusing to have Airflow
> handled
> > >     load balancing here.
> > >
> > >     On Tue, Apr 14, 2020, 4:59 AM Kaxil Naik <ka...@gmail.com>
> wrote:
> > >
> > >     > +1 to make connection ids unique
> > >     >
> > >     > On Tue, Apr 14, 2020 at 11:59 AM Jarek Potiuk <
> Jarek.Potiuk@polidea.com>
> > >     > wrote:
> > >     >
> > >     > > I am also supporting converting the connection to be unique.
> > >     > >
> > >     > > I've worked with similar approach long time ago (10s of years)
> and it was
> > >     > > fine then where we have not yet figured out how to scale
> client/server
> > >     > > architecture and we did not have all the nice infrastructure
> like
> > >     > > load/balancing, cloud services etc. I believe in most of the
> current
> > >     > > services/systems load-balancing should be handled by the
> service itself
> > >     > or
> > >     > > some kind of proxy between - not by the client side - in
> production
> > >     > > environments.
> > >     > >
> > >     > > It's far more robust and might provide much better control
> (you can
> > >     > control
> > >     > > multiple independent client's connection and do not rely on
> random
> > >     > > distribution of connections). There are scenarios that would
> not work
> > >     > well
> > >     > > in this case - for example physical proximity of the workers,
> current
> > >     > load
> > >     > > on different services etc. etc. It is very limiting to rely on
> this
> > >     > > feature.
> > >     > >
> > >     > > On Tue, Apr 14, 2020 at 11:36 AM Kamil Breguła <
> > >     > kamil.bregula@polidea.com>
> > >     > > wrote:
> > >     > >
> > >     > > > Hello,
> > >     > > >
> > >     > > > This can cause big problems with idempotence. According to
> RFC-7231,
> > >     > > > the DELETE method should be idempotent.
> > >     > > >
> > >     > > > For example:
> > >     > > > If you want to delete items with index from 1 to 4, you
> should set the
> > >     > > > following request
> > >     > > > DELETE /connections/hdfs_default/4
> > >     > > > DELETE /connections/hdfs_default/3
> > >     > > > DELETE /connections/hdfs_default/2
> > >     > > > DELETE /connections/hdfs_default/1
> > >     > > >
> > >     > > > However, if the user sends these requests in a different
> order, they
> > >     > > > will only delete the first and third items.
> > >     > > > DELETE /connections/hdfs_default/1
> > >     > > > DELETE /connections/hdfs_default/2
> > >     > > > DELETE /connections/hdfs_default/3
> > >     > > > DELETE /connections/hdfs_default/4
> > >     > > >
> > >     > > > If you use asynchronous HTTP clients (a popular in Node),
> the order of
> > >     > > > requests will not be kept. It will also be a big problem with
> > >     > > > simultaneous modifications.
> > >     > > >
> > >     > > > I am also in favor of abandoning support for many
> connections. This
> > >     > > > can be solved on a different layer.
> > >     > > >
> > >     > > > Best regards,
> > >     > > > Kamil
> > >     > > >
> > >     > > >
> > >     > > > On Thu, Apr 9, 2020 at 6:47 PM Ash Berlin-Taylor <
> ash@apache.org>
> > >     > wrote:
> > >     > > > >
> > >     > > > > They are, but we _can_ make a choice to remove that
> feature -- it is
> > >     > > not
> > >     > > > > widely used and is confusing to many when they stumble on
> it.
> > >     > > > >
> > >     > > > > It's not something we should do lightly, but it is a
> possibility.
> > >     > > > >
> > >     > > > > I think I'm probably leaning towards the "ordinal" concept:
> > >     > > > >
> > >     > > > > /connections/hdfs_default -> list of connections with that
> ID
> > >     > > > > /connections/hdfs_default/0 first connection of that type
> > >     > > > >
> > >     > > > > Something like that.
> > >     > > > >
> > >     > > > > On Apr 9 2020, at 2:31 pm, Shaw, Damian P.
> > >     > > > > <da...@credit-suisse.com> wrote:
> > >     > > > >
> > >     > > > > > FYI if you look back at the thread "Re: [2.0 spring
> cleaning]
> > >     > Require
> > >     > > > > > unique conn_id" on 2019-04-14 you can see a message from
> Kevin Yang
> > >     > > > > > stating that this random choice of connections is a
> "feature" used
> > >     > to
> > >     > > > > > load balance connections in AirBnB. So users are relying
> on this
> > >     > > > behavior.
> > >     > > > > >
> > >     > > > > >
> > >     > > > > > -----Original Message-----
> > >     > > > > > From: Daniel (Daniel Lamblin ) [Data Infrastructure]
> > >     > > > > > <la...@coupang.com>
> > >     > > > > > Sent: Wednesday, April 8, 2020 20:01
> > >     > > > > > To: dev@airflow.apache.org
> > >     > > > > > Subject: Re: API spec questions
> > >     > > > > >
> > >     > > > > > Having been bit by accidentally having two connections
> by the same
> > >     > > > > > name or conn_id, I'd prefer if were made unique. In my
> experience
> > >     > > > > > there's little utility in having multiple connections by
> the same
> > >     > > > > > name. Tasks that use a connection do to fairly randomly
> choose one,
> > >     > > > > > rather they seem pretty consistent in using the one
> created
> > >     > earliest,
> > >     > > > > > which often has the lower id integer.
> > >     > > > > >
> > >     > > > > > Circling back to how this is used by the API, from a user
> > >     > > perspective,
> > >     > > > > > the following in path integer fields were ones I'd have
> expected to
> > >     > > be
> > >     > > > > > strings instead:
> > >     > > > > > ConnectionID
> > >     > > >
> > >     > >
> > >     >
> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1845&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=ur3AAggOlQ7BskKBxQxRq0i%2FS36Ol72c7H7ddwXo5C4%3D&amp;reserved=0
> > >     > > > > > DAGID
> > >     > > >
> > >     > >
> > >     >
> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1853&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=vyhFiV6Fd3pnVvlAoxnscSvvcy2NX%2BMASGmuHKu%2Fy84%3D&amp;reserved=0
> > >     > > > > > PoolID
> > >     > > >
> > >     > >
> > >     >
> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1901&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=4Fog92AZju0%2BN44FoBdU4fj2DwhgCbUfKYUbdjeWpgo%3D&amp;reserved=0
> > >     > > > > >
> > >     > > > > > Though it’s a url-encoding hassle, I also expected that
> DAGRunID
> > >     > > would
> > >     > > > > > be more like the Run ID E.G.
> > >     > > > > > "scheduled__2020-04-08T23:10:00+00:00",
> > >     > > > > > "manual__2020-04-08T23:00:56+00:00",
> > >     > > > > > "manual__2020-04-08T16:24:56.692868+00:00" ,
> > >     > > > > > "backfill_2020-04-08T22:05:00+00:00" etc
> > >     > > > > >
> > >     > > > > > Then TaskID is confusing to me; AFAIK the PK to task
> instances are
> > >     > > > > > task_id, dag_id, and execution_date and the api call
> appears to
> > >     > align
> > >     > > > > > with that having the pattern:
> > >     > > > > > /dags/{dag_id}/taskInstances/{task_id}/{execution_date}:
> > >     > > > > > But if task_id is a numbered id, then… execution_date
> isn't even
> > >     > > > > > needed… I'm thinking it should have been a string.
> > >     > > > > >
> > >     > > > > > An aside to this, I've always wondered what happens if an
> > >     > externally
> > >     > > > > > triggered DAG Run has the same execution date as a
> pre-existing
> > >     > > > > > scheduled DAG Run. They'd have different run_ids, EG
> > >     > > > > > "scheduled__2020-04-08T23:10:00+00:00" vs
> > >     > > > > > "manual__2020-04-08T23:10:00+00:00" but then task
> instances for
> > >     > those
> > >     > > > > > runs might not be unique.
> > >     > > > > >
> > >     > > > > > Lastly, the UI and CLI operation of clearing tasks seems
> analogous
> > >     > to
> > >     > > > > > the delete task instance API end point. But probably
> it's not, and
> > >     > > > > > this could become confusing to users.
> > >     > > > > > There should be a non-db-model api call for clearing
> tasks like you
> > >     > > > > > can from the UI and the CLI. If I read it right,
> clearing does the
> > >     > > > > > following: it sets the state of the task instance to
> None unless
> > >     > the
> > >     > > > > > state was Running then instead it sets it and related
> job ids to
> > >     > > > > > Shutdown. It deletes reschedules of the TI and it sets
> the dag runs
> > >     > > > > > for those task instances back to Running.
> > >     > > > > > This would be a lot to do for a user using PATCH calls
> to change
> > >     > Task
> > >     > > > > > Instances and Dag Runs together (and there's no API for
> Jobs).
> > >     > > > > >
> > >     > > > > > -Daniel L.
> > >     > > > > > P.S. as far as renaming all parameters on operators and
> hooks with
> > >     > > > > > *_conn_id, I do not want to see that breaking change
> happen. But IF
> > >     > > it
> > >     > > > > > HAS TO, I'm still of the opinion that the default_args
> use for
> > >     > > > > > X_conn_id is not preferable to having all operators take
> simpler
> > >     > > > > > source_conn_id and optional target_conn_id parameter
> names that are
> > >     > > > > > consistent across all operators.
> > >     > > > > >
> > >     > > > > > On 4/8/20, 9:47 AM, "Ash Berlin-Taylor" <as...@apache.org>
> wrote:
> > >     > > > > >
> > >     > > > > >    [Warning]: This email originated from an external
> source. Do not
> > >     > > > > > open links or attachments unless you know the content is
> safe.
> > >     > > > > >    [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나
> 첨부파일을 열지
> > >     > > 마십시오.
> > >     > > > > >
> > >     > > > > >
> > >     > > > > >    To expand on the "so I think we need to do one of":
> > >     > > > > >
> > >     > > > > >
> > >     > > > > >    - we need to update the name of "conn_id" somehow. I
> can't think
> > >     > > of
> > >     > > > a
> > >     > > > > >    better option, and given all the operators have
> "x_conn_id" I
> > >     > > don't
> > >     > > > > >    think that change should be made lightly.
> > >     > > > > >
> > >     > > > > >    - make conn_id unique (this "poor" HA has been a
> source of
> > >     > > > > > confusion in
> > >     > > > > >    the past) and the ID we use in the API
> > >     > > > > >
> > >     > > > > >    Or a third option:
> > >     > > > > >
> > >     > > > > >    - Have the API take conn_id and either return all
> conns for that
> > >     > > > > >    conn_id, or conn_id plus an ordinal (?idx=0 type
> thing) to
> > >     > return
> > >     > > a
> > >     > > > > >    single value.
> > >     > > > > >
> > >     > > > > >    On Apr 8 2020, at 5:42 pm, Ash Berlin-Taylor <
> ash@apache.org>
> > >     > > > wrote:
> > >     > > > > >
> > >     > > > > >    > Hi everyone,
> > >     > > > > >    >
> > >     > > > > >    > So as I mentioned in the AIP voting thread, I think
> we need to
> > >     > > > give
> > >     > > > > >    > some more thought to how we are exposing connection
> ids in the
> > >     > > > API.
> > >     > > > > >    >
> > >     > > > > >    > Right now as proposed (and merged without approval,
> not cool.
> > >     > > The
> > >     > > > AIP
> > >     > > > > >    > we voted on did not contain a PR against
> apache/airflow.) it
> > >     > has
> > >     > > > an
> > >     > > > > >    > end point of `/connections/{connection_id} `
> > >     > > > > >    >
> > >     > > > > >    > My issue here is as I said in the previous thread:
> that is
> > >     > going
> > >     > > > > > to be
> > >     > > > > >    > mightly confusing to our users because there is a
> "conn_id"
> > >     > > > concept
> > >     > > > > >    > that is exposed, so people are going to try putting
> > >     > > "aws_default"
> > >     > > > etc
> > >     > > > > >    > in there. I don't care what the API spec says, I
> care what our
> > >     > > > users
> > >     > > > > >    > expect, and having a connection_id/id and a conn_id
> fields is
> > >     > > > > > just confusing
> > >     > > > > >    >
> > >     > > > > >    > So I think we need to do one of:
> > >     > > > > >    > - we need to update the name of "conn_id" somehow,
> make
> > >     > conn_id
> > >     > > > unique
> > >     > > > > >    > (this "poor" HA has been a source of confusion in
> the past)
> > >     > > > > >    >
> > >     > > > > >    > There are similar problems for the DAG run -- the
> spec has the
> > >     > > > > > type as
> > >     > > > > >    > an integer, but everything else about the Airflow
> system uses
> > >     > > the
> > >     > > > > >    > (unique) run_id parameter, and I would expect the
> API to use
> > >     > > > > > that. The
> > >     > > > > >    > autoinc. id on the run column is literally never
> used in the
> > >     > > code
> > >     > > > > >    > base, so exposing that via the API seems odd.
> > >     > > > > >    >
> > >     > > > > >    > A few other smaller points:
> > >     > > > > >    >
> > >     > > > > >    > EventLog: Those are "audit"/action logs, so we
> probably
> > >     > > shouldn't
> > >     > > > let
> > >     > > > > >    > people delete them via the API!
> > >     > > > > >    >
> > >     > > > > >    > pool_id: still an integer. It should be the "name".
> > >     > > > > >    >
> > >     > > > > >    > How should add/delete variables and connections
> work in the
> > >     > API
> > >     > > > with
> > >     > > > > >    > the addition of the new "Secrets Backends"?
> > >     > > > > >    >
> > >     > > > > >    > xcomValues: task_id is listed as an integer.
> > >     > > > > >    >
> > >     > > > > >    >
> > >     > > > > >    > -ash
> > >     > > > > >
> > >     > > > > >
> > >     > > > > >
> > >     > > > > >
> > >     > > > > >
> > >     > > > > >
> > >     > > > >
> > >     > > >
> > >     > >
> > >     >
> ===============================================================================
> > >     > > > > > Please access the attached hyperlink for an important
> electronic
> > >     > > > > > communications disclaimer:
> > >     > > > > >
> https://kor01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.credit-suisse.com%2Flegal%2Fen%2Fdisclaimer_email_ib.html&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=6e9Ck952oFmXDkoDYnM0XgCWsplYb7Dua37BkAPt22A%3D&amp;reserved=0
> > >     > > > > >
> > >     > > > >
> > >     > > >
> > >     > >
> > >     >
> ===============================================================================
> > >     > > > > >
> > >     > > >
> > >     > >
> > >     > >
> > >     > > --
> > >     > >
> > >     > > Jarek Potiuk
> > >     > > Polidea <
> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0>
> | Principal Software Engineer
> > >     > >
> > >     > > M: +48 660 796 129 <+48660796129>
> > >     > > [image: Polidea] <
> https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0
> >
> > >     > >
> > >     >
> > >
> > >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: API spec questions

Posted by QP Hou <qp...@scribd.com>.

I would recommend option III.

IMHO, hierarchy based endpoints are for CURD operations on a single
entity. The flat endpoint, like /taskinstance, is for read-only
multi-facet search queries.

For example:

* to create a resource Foo: "POST /api/parent/{parent_id}/foo"
* to update a resource Foo: "PUT /api/parent/{parent_id}/foo/{foo_id}"
* to get a single Foo entity: "GET /api/parent/{parent_id}/foo/{foo_id}"
* to search for multiple Foo entities using multiple filters: "GET
/api/foo?parent_id=id1,id2,id3&owner=user1"

On Mon, May 11, 2020 at 5:08 AM Kamil Breguła <ka...@polidea.com> wrote:
>
> Hello,
>
> I apologize for my very negative behavior. I misunderstood the rules.
> I updated the specification based on the discussion. I hope that now I
> have not missed any suggestions that I had to make. I tried to make
> every change in a separate commit, so you can review the changes one
> by one.
>
> Here is new PR:  https://github.com/apache/airflow/pull/8721
>
> Short summary:
> * Changed type for DAGID, DAGRunID, TaskID. Now we use string instead
> of an integer.
> * EventLog collection is read-only
> * Used ExcutionDate in DagRuns endpoints
> * Used custom action to control task instances - Removed PATCH
> /dags/{dag_id}/taskInstances/{task_id}/{execution_date} endpoint and
> added PATCH /dags/{dag_id}/clearTaskInstanaces.
> * Added endpoint - POST /dagRuns "Trigger a DAG Run"
> * Added filter parameters to GET /dags/{dag_id}/taskInstances.
> * Added filter parameters to GET /dags/{dag_id}/dagRuns
> * Removed DELETE /taskInstances endpoint
> * The connection ID is a unique key that identifies the connection
> * Moved TaskInstances resource under DagRuns resource
> * Fixed many typos
>
> There is one more topic for discussion - reading resources across
> multiple collections.
> In other words - how do I retrieve task instances for multiple DAGs?
> We have several solutions.
> I) Currently, the endpoint that receives a list of task instances is
> available at /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances:
> This endpoint support reading resources across multiple DAGs by
> specifying a "-" as a dag_id or an execution_date.
> This is based on Google recommendations - AIP159 [1]. I relied on
> these recommendations because it is the most comprehensive study on
> API design principles. Ash Berlin-Taylor rightly pointed out that this
> would be a backward-incompatible change.
> II) We can use a different character that will have the same role -
> '*'. This character cannot be in Dag/Task ID, so it's safe.
> III) Ash proposed to add a separate endpoint that will not include the
> DAG ID in the address - /taskInstances
>
> If we want to choose one solution then I think it's worth looking at
> what the endpoints for DAGs look like.
> /dags
> /dags/{dag_id}
> /dags/{dag_id}/clearTaskInstanaces
> /dags/{dag_id}/dagRuns
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> /dags/{dag_id}/dagRuns/{execution_date}
> /dags/{dag_id}/structure
> /dags/{dag_id}/tasks
> /dags/{dag_id}/tasks/{task_id}
> In my opinion, here is a clear hierarchy of resources that will
> facilitate the use of API. The third solution causes that this
> hierarchy is disturbed and the endpoints that are used to receive the
> list of elements will be at the highest level.
> We will have the following endpoints:
> /dags
> /dags/{dag_id}
> /dags/{dag_id}/clearTaskInstanaces
> /dagRuns
> /dags/{dag_id}/dagRuns
> /taskInstances
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> /xcomEntries
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> /dags/{dag_id}/dagRuns/{execution_date}
> /dags/{dag_id}/structure
> /dags/{dag_id}/tasks
> /dags/{dag_id}/tasks/{task_id}
>
> 4) Some endpoints will have similar behavior, so we can delete them.
> Then we will have the following list of endpoints.
> /dags
> /dags/{dag_id}
> /dags/{dag_id}/clearTaskInstanaces
> /dagRuns
> /taskInstances
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/links
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/logs/{task_try_number}
> /xcomEntries
> /dags/{dag_id}/dagRuns/{execiton_date}/taskInstances/{task_id}/xcomEntries/{key}
> /dags/{dag_id}/dagRuns/{execution_date}
> /dags/{dag_id}/structure
> /dags/{dag_id}/tasks
> /dags/{dag_id}/tasks/{task_id}
>
> Which solution do you like the most? I, II, III, or IV?
>
> Best regards,
> Kamil Breguła
>
> [1] https://aip.dev/159
>
> On Wed, Apr 15, 2020 at 8:32 PM Daniel (Daniel Lamblin ) [Data
> Infrastructure] <la...@coupang.com> wrote:
> >
> > I'm glad the general mood is that connection_id should be unique.
> > FWIW when I had multiple connections in v1.8.2 with mysql I didn't seem to be getting any randomized loadbalancing anyway. Then again, maybe random was just 100 selections of 1 over 2.
> > There are many other ways to load balance connections, each specific to the service type, so I don't see it Airflow's place to provide a semi-generic option to do it.
> >
> > +1 for connection ID being unique.
> >
> > Pardon outlook for changing links to the ConnectionID, DagID and PoolID being integers in a version of the API.
> > Are we past that decision already; I'd expect to use string names.
> >
> > I'd also asked about DAG run ID, task ID, and finally whether there'd be an endpoint with which to clear tasks, because crud operations don't model the interplay of task instance, jobs, and dag run state involved.
> > -Daniel
> >
> > On 4/14/20, 8:49 AM, "Xinbin Huang" <bi...@gmail.com> wrote:
> >
> >     [Warning]: This email originated from an external source. Do not open links or attachments unless you know the content is safe.
> >     [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지 마십시오.
> >
> >
> >     +1 on making connection IDs unique. It's confusing to have Airflow handled
> >     load balancing here.
> >
> >     On Tue, Apr 14, 2020, 4:59 AM Kaxil Naik <ka...@gmail.com> wrote:
> >
> >     > +1 to make connection ids unique
> >     >
> >     > On Tue, Apr 14, 2020 at 11:59 AM Jarek Potiuk <Ja...@polidea.com>
> >     > wrote:
> >     >
> >     > > I am also supporting converting the connection to be unique.
> >     > >
> >     > > I've worked with similar approach long time ago (10s of years) and it was
> >     > > fine then where we have not yet figured out how to scale client/server
> >     > > architecture and we did not have all the nice infrastructure like
> >     > > load/balancing, cloud services etc. I believe in most of the current
> >     > > services/systems load-balancing should be handled by the service itself
> >     > or
> >     > > some kind of proxy between - not by the client side - in production
> >     > > environments.
> >     > >
> >     > > It's far more robust and might provide much better control (you can
> >     > control
> >     > > multiple independent client's connection and do not rely on random
> >     > > distribution of connections). There are scenarios that would not work
> >     > well
> >     > > in this case - for example physical proximity of the workers, current
> >     > load
> >     > > on different services etc. etc. It is very limiting to rely on this
> >     > > feature.
> >     > >
> >     > > On Tue, Apr 14, 2020 at 11:36 AM Kamil Breguła <
> >     > kamil.bregula@polidea.com>
> >     > > wrote:
> >     > >
> >     > > > Hello,
> >     > > >
> >     > > > This can cause big problems with idempotence. According to RFC-7231,
> >     > > > the DELETE method should be idempotent.
> >     > > >
> >     > > > For example:
> >     > > > If you want to delete items with index from 1 to 4, you should set the
> >     > > > following request
> >     > > > DELETE /connections/hdfs_default/4
> >     > > > DELETE /connections/hdfs_default/3
> >     > > > DELETE /connections/hdfs_default/2
> >     > > > DELETE /connections/hdfs_default/1
> >     > > >
> >     > > > However, if the user sends these requests in a different order, they
> >     > > > will only delete the first and third items.
> >     > > > DELETE /connections/hdfs_default/1
> >     > > > DELETE /connections/hdfs_default/2
> >     > > > DELETE /connections/hdfs_default/3
> >     > > > DELETE /connections/hdfs_default/4
> >     > > >
> >     > > > If you use asynchronous HTTP clients (a popular in Node), the order of
> >     > > > requests will not be kept. It will also be a big problem with
> >     > > > simultaneous modifications.
> >     > > >
> >     > > > I am also in favor of abandoning support for many connections. This
> >     > > > can be solved on a different layer.
> >     > > >
> >     > > > Best regards,
> >     > > > Kamil
> >     > > >
> >     > > >
> >     > > > On Thu, Apr 9, 2020 at 6:47 PM Ash Berlin-Taylor <as...@apache.org>
> >     > wrote:
> >     > > > >
> >     > > > > They are, but we _can_ make a choice to remove that feature -- it is
> >     > > not
> >     > > > > widely used and is confusing to many when they stumble on it.
> >     > > > >
> >     > > > > It's not something we should do lightly, but it is a possibility.
> >     > > > >
> >     > > > > I think I'm probably leaning towards the "ordinal" concept:
> >     > > > >
> >     > > > > /connections/hdfs_default -> list of connections with that ID
> >     > > > > /connections/hdfs_default/0 first connection of that type
> >     > > > >
> >     > > > > Something like that.
> >     > > > >
> >     > > > > On Apr 9 2020, at 2:31 pm, Shaw, Damian P.
> >     > > > > <da...@credit-suisse.com> wrote:
> >     > > > >
> >     > > > > > FYI if you look back at the thread "Re: [2.0 spring cleaning]
> >     > Require
> >     > > > > > unique conn_id" on 2019-04-14 you can see a message from Kevin Yang
> >     > > > > > stating that this random choice of connections is a "feature" used
> >     > to
> >     > > > > > load balance connections in AirBnB. So users are relying on this
> >     > > > behavior.
> >     > > > > >
> >     > > > > >
> >     > > > > > -----Original Message-----
> >     > > > > > From: Daniel (Daniel Lamblin ) [Data Infrastructure]
> >     > > > > > <la...@coupang.com>
> >     > > > > > Sent: Wednesday, April 8, 2020 20:01
> >     > > > > > To: dev@airflow.apache.org
> >     > > > > > Subject: Re: API spec questions
> >     > > > > >
> >     > > > > > Having been bit by accidentally having two connections by the same
> >     > > > > > name or conn_id, I'd prefer if were made unique. In my experience
> >     > > > > > there's little utility in having multiple connections by the same
> >     > > > > > name. Tasks that use a connection do to fairly randomly choose one,
> >     > > > > > rather they seem pretty consistent in using the one created
> >     > earliest,
> >     > > > > > which often has the lower id integer.
> >     > > > > >
> >     > > > > > Circling back to how this is used by the API, from a user
> >     > > perspective,
> >     > > > > > the following in path integer fields were ones I'd have expected to
> >     > > be
> >     > > > > > strings instead:
> >     > > > > > ConnectionID
> >     > > >
> >     > >
> >     > https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1845&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=ur3AAggOlQ7BskKBxQxRq0i%2FS36Ol72c7H7ddwXo5C4%3D&amp;reserved=0
> >     > > > > > DAGID
> >     > > >
> >     > >
> >     > https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1853&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=vyhFiV6Fd3pnVvlAoxnscSvvcy2NX%2BMASGmuHKu%2Fy84%3D&amp;reserved=0
> >     > > > > > PoolID
> >     > > >
> >     > >
> >     > https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fblob%2F11f1e0cad996d5596e3e4fb440eb4ec52c024f70%2Fopenapi.yaml%23L1901&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=4Fog92AZju0%2BN44FoBdU4fj2DwhgCbUfKYUbdjeWpgo%3D&amp;reserved=0
> >     > > > > >
> >     > > > > > Though it’s a url-encoding hassle, I also expected that DAGRunID
> >     > > would
> >     > > > > > be more like the Run ID E.G.
> >     > > > > > "scheduled__2020-04-08T23:10:00+00:00",
> >     > > > > > "manual__2020-04-08T23:00:56+00:00",
> >     > > > > > "manual__2020-04-08T16:24:56.692868+00:00" ,
> >     > > > > > "backfill_2020-04-08T22:05:00+00:00" etc
> >     > > > > >
> >     > > > > > Then TaskID is confusing to me; AFAIK the PK to task instances are
> >     > > > > > task_id, dag_id, and execution_date and the api call appears to
> >     > align
> >     > > > > > with that having the pattern:
> >     > > > > > /dags/{dag_id}/taskInstances/{task_id}/{execution_date}:
> >     > > > > > But if task_id is a numbered id, then… execution_date isn't even
> >     > > > > > needed… I'm thinking it should have been a string.
> >     > > > > >
> >     > > > > > An aside to this, I've always wondered what happens if an
> >     > externally
> >     > > > > > triggered DAG Run has the same execution date as a pre-existing
> >     > > > > > scheduled DAG Run. They'd have different run_ids, EG
> >     > > > > > "scheduled__2020-04-08T23:10:00+00:00" vs
> >     > > > > > "manual__2020-04-08T23:10:00+00:00" but then task instances for
> >     > those
> >     > > > > > runs might not be unique.
> >     > > > > >
> >     > > > > > Lastly, the UI and CLI operation of clearing tasks seems analogous
> >     > to
> >     > > > > > the delete task instance API end point. But probably it's not, and
> >     > > > > > this could become confusing to users.
> >     > > > > > There should be a non-db-model api call for clearing tasks like you
> >     > > > > > can from the UI and the CLI. If I read it right, clearing does the
> >     > > > > > following: it sets the state of the task instance to None unless
> >     > the
> >     > > > > > state was Running then instead it sets it and related job ids to
> >     > > > > > Shutdown. It deletes reschedules of the TI and it sets the dag runs
> >     > > > > > for those task instances back to Running.
> >     > > > > > This would be a lot to do for a user using PATCH calls to change
> >     > Task
> >     > > > > > Instances and Dag Runs together (and there's no API for Jobs).
> >     > > > > >
> >     > > > > > -Daniel L.
> >     > > > > > P.S. as far as renaming all parameters on operators and hooks with
> >     > > > > > *_conn_id, I do not want to see that breaking change happen. But IF
> >     > > it
> >     > > > > > HAS TO, I'm still of the opinion that the default_args use for
> >     > > > > > X_conn_id is not preferable to having all operators take simpler
> >     > > > > > source_conn_id and optional target_conn_id parameter names that are
> >     > > > > > consistent across all operators.
> >     > > > > >
> >     > > > > > On 4/8/20, 9:47 AM, "Ash Berlin-Taylor" <as...@apache.org> wrote:
> >     > > > > >
> >     > > > > >    [Warning]: This email originated from an external source. Do not
> >     > > > > > open links or attachments unless you know the content is safe.
> >     > > > > >    [경고]: 본 이메일은 회사 외부에서 유입되었습니다. 내용이 안전한지 확인하기 전까지는 링크나 첨부파일을 열지
> >     > > 마십시오.
> >     > > > > >
> >     > > > > >
> >     > > > > >    To expand on the "so I think we need to do one of":
> >     > > > > >
> >     > > > > >
> >     > > > > >    - we need to update the name of "conn_id" somehow. I can't think
> >     > > of
> >     > > > a
> >     > > > > >    better option, and given all the operators have "x_conn_id" I
> >     > > don't
> >     > > > > >    think that change should be made lightly.
> >     > > > > >
> >     > > > > >    - make conn_id unique (this "poor" HA has been a source of
> >     > > > > > confusion in
> >     > > > > >    the past) and the ID we use in the API
> >     > > > > >
> >     > > > > >    Or a third option:
> >     > > > > >
> >     > > > > >    - Have the API take conn_id and either return all conns for that
> >     > > > > >    conn_id, or conn_id plus an ordinal (?idx=0 type thing) to
> >     > return
> >     > > a
> >     > > > > >    single value.
> >     > > > > >
> >     > > > > >    On Apr 8 2020, at 5:42 pm, Ash Berlin-Taylor <as...@apache.org>
> >     > > > wrote:
> >     > > > > >
> >     > > > > >    > Hi everyone,
> >     > > > > >    >
> >     > > > > >    > So as I mentioned in the AIP voting thread, I think we need to
> >     > > > give
> >     > > > > >    > some more thought to how we are exposing connection ids in the
> >     > > > API.
> >     > > > > >    >
> >     > > > > >    > Right now as proposed (and merged without approval, not cool.
> >     > > The
> >     > > > AIP
> >     > > > > >    > we voted on did not contain a PR against apache/airflow.) it
> >     > has
> >     > > > an
> >     > > > > >    > end point of `/connections/{connection_id} `
> >     > > > > >    >
> >     > > > > >    > My issue here is as I said in the previous thread: that is
> >     > going
> >     > > > > > to be
> >     > > > > >    > mightly confusing to our users because there is a "conn_id"
> >     > > > concept
> >     > > > > >    > that is exposed, so people are going to try putting
> >     > > "aws_default"
> >     > > > etc
> >     > > > > >    > in there. I don't care what the API spec says, I care what our
> >     > > > users
> >     > > > > >    > expect, and having a connection_id/id and a conn_id fields is
> >     > > > > > just confusing
> >     > > > > >    >
> >     > > > > >    > So I think we need to do one of:
> >     > > > > >    > - we need to update the name of "conn_id" somehow, make
> >     > conn_id
> >     > > > unique
> >     > > > > >    > (this "poor" HA has been a source of confusion in the past)
> >     > > > > >    >
> >     > > > > >    > There are similar problems for the DAG run -- the spec has the
> >     > > > > > type as
> >     > > > > >    > an integer, but everything else about the Airflow system uses
> >     > > the
> >     > > > > >    > (unique) run_id parameter, and I would expect the API to use
> >     > > > > > that. The
> >     > > > > >    > autoinc. id on the run column is literally never used in the
> >     > > code
> >     > > > > >    > base, so exposing that via the API seems odd.
> >     > > > > >    >
> >     > > > > >    > A few other smaller points:
> >     > > > > >    >
> >     > > > > >    > EventLog: Those are "audit"/action logs, so we probably
> >     > > shouldn't
> >     > > > let
> >     > > > > >    > people delete them via the API!
> >     > > > > >    >
> >     > > > > >    > pool_id: still an integer. It should be the "name".
> >     > > > > >    >
> >     > > > > >    > How should add/delete variables and connections work in the
> >     > API
> >     > > > with
> >     > > > > >    > the addition of the new "Secrets Backends"?
> >     > > > > >    >
> >     > > > > >    > xcomValues: task_id is listed as an integer.
> >     > > > > >    >
> >     > > > > >    >
> >     > > > > >    > -ash
> >     > > > > >
> >     > > > > >
> >     > > > > >
> >     > > > > >
> >     > > > > >
> >     > > > > >
> >     > > > >
> >     > > >
> >     > >
> >     > ===============================================================================
> >     > > > > > Please access the attached hyperlink for an important electronic
> >     > > > > > communications disclaimer:
> >     > > > > > https://kor01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.credit-suisse.com%2Flegal%2Fen%2Fdisclaimer_email_ib.html&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=6e9Ck952oFmXDkoDYnM0XgCWsplYb7Dua37BkAPt22A%3D&amp;reserved=0
> >     > > > > >
> >     > > > >
> >     > > >
> >     > >
> >     > ===============================================================================
> >     > > > > >
> >     > > >
> >     > >
> >     > >
> >     > > --
> >     > >
> >     > > Jarek Potiuk
> >     > > Polidea <https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0> | Principal Software Engineer
> >     > >
> >     > > M: +48 660 796 129 <+48660796129>
> >     > > [image: Polidea] <https://kor01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.polidea.com%2F&amp;data=02%7C01%7Clamblin%40coupang.com%7Cd8716c2432c54849f3d708d7e08b68f2%7Ce3098f96361b47c6a9f4ab7bafcaffe9%7C0%7C0%7C637224761681223746&amp;sdata=tMOZFKjLGoXh9eAVW5rXhmJeJw8kYhlWAfwFICN%2Fpfw%3D&amp;reserved=0>
> >     > >
> >     >
> >
> >