You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Manu Zhang <ow...@gmail.com> on 2018/08/29 02:22:44 UTC

WebHdfsSensor doesn't support HDFS HA

Hi all,

We've been using WebHdfsSensor happily to sensor the state of upstream
tasks outputting to HDFS except when there is a namenode switch. I've
opened https://issues.apache.org/jira/browse/AIRFLOW-2901 to discuss the
HDFS HA support.

There are two solutions that I can see,

1. use pyarrow.hdfs which has HA support
2. allow user to configure a list of namenodes

WDYT ?

Thanks,
Manu Zhang

Re: WebHdfsSensor doesn't support HDFS HA

Posted by Ben Laird <br...@gmail.com>.
Manu -

This is the relevant code I was referencing before:
https://github.com/apache/incubator-airflow/blob/master/airflow/hooks/webhdfs_hook.py#L54-L71

So multiple connections for a given conn_id is already built into some
hooks, but we need a way to set this from CLI. I'll be creating a JIRA
shortly and pushing an update to the cli for this

On Thu, Aug 30, 2018 at 2:03 AM Manu Zhang <ow...@gmail.com> wrote:

> Thanks Xiaodong, that works like a charm.
>
> Manu
>
> On Thu, Aug 30, 2018 at 11:34 AM Deng Xiaodong <xd...@gmail.com>
> wrote:
>
> > Hi Manu,
> >
> > You can set up multiple connections with the same conn_id and different
> > host, rather than setting in one single connection.
> >
> >
> > XD
> >
> > On Thu, Aug 30, 2018 at 11:17 Manu Zhang <ow...@gmail.com>
> wrote:
> >
> > > Hi Ben,
> > >
> > > How do you set multiple connections through Web UI (from Connections
> item
> > > of Admin pull-down list) ? I'm tried setting a comma-separated list to
> a
> > > conn_id but that doesn't work.
> > >
> > > Thanks,
> > > Manu
> > >
> > >
> > > On Wed, Aug 29, 2018 at 11:31 PM Ben Laird <br...@gmail.com> wrote:
> > >
> > > > Hi Manu,
> > > >
> > > > We have the same use case as you, a primary and backup namenode. If I
> > > > understand your issue correctly, the WebHDFSSensor code checks an
> > > iterable
> > > > of Airflow connections to the namenode to find one that is active.
> > > >
> > > > However, my issue (which I've emailed this list about) was that you
> > > cannot
> > > > set multiple connections with the same name (e.g. webhdfs_default)
> > > through
> > > > the CLI, only in the Web interface. I'm planning on submitting a PR
> > soon
> > > to
> > > > remedy this.
> > > >
> > > > Ben
> > > >
> > > > On Wed, Aug 29, 2018 at 2:57 AM Driesprong, Fokko
> <fokko@driesprong.frl
> > >
> > > > wrote:
> > > >
> > > > > Hi Manu,
> > > > >
> > > > > Thanks for raising this question. There is a PR for moving
> > > > > <https://github.com/apache/incubator-airflow/pull/3560> to hdfs3.
> > > There
> > > > is
> > > > > code in the existing codebase, which support HA
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-airflow/blob/53b89b98371c7bb993b242c341d3941e9ce09f9a/airflow/hooks/hdfs_hook.py#L92-L96
> > > > > >,
> > > > > but this might not be for the sensor.
> > > > >
> > > > > Personally I'm not familiar with pyarrow.hdfs, so I'm not the one
> to
> > > > judge
> > > > > how mature it is. We need to replace Snakebite for sure since it is
> > > only
> > > > > compatible with Python 2.7.
> > > > >
> > > > > Cheers, Fokko
> > > > >
> > > > >
> > > > > Op wo 29 aug. 2018 om 04:29 schreef Manu Zhang <
> > > owenzhang1990@gmail.com
> > > > >:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We've been using WebHdfsSensor happily to sensor the state of
> > > upstream
> > > > > > tasks outputting to HDFS except when there is a namenode switch.
> > I've
> > > > > > opened https://issues.apache.org/jira/browse/AIRFLOW-2901 to
> > discuss
> > > > the
> > > > > > HDFS HA support.
> > > > > >
> > > > > > There are two solutions that I can see,
> > > > > >
> > > > > > 1. use pyarrow.hdfs which has HA support
> > > > > > 2. allow user to configure a list of namenodes
> > > > > >
> > > > > > WDYT ?
> > > > > >
> > > > > > Thanks,
> > > > > > Manu Zhang
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: WebHdfsSensor doesn't support HDFS HA

Posted by Manu Zhang <ow...@gmail.com>.
Thanks Xiaodong, that works like a charm.

Manu

On Thu, Aug 30, 2018 at 11:34 AM Deng Xiaodong <xd...@gmail.com> wrote:

> Hi Manu,
>
> You can set up multiple connections with the same conn_id and different
> host, rather than setting in one single connection.
>
>
> XD
>
> On Thu, Aug 30, 2018 at 11:17 Manu Zhang <ow...@gmail.com> wrote:
>
> > Hi Ben,
> >
> > How do you set multiple connections through Web UI (from Connections item
> > of Admin pull-down list) ? I'm tried setting a comma-separated list to a
> > conn_id but that doesn't work.
> >
> > Thanks,
> > Manu
> >
> >
> > On Wed, Aug 29, 2018 at 11:31 PM Ben Laird <br...@gmail.com> wrote:
> >
> > > Hi Manu,
> > >
> > > We have the same use case as you, a primary and backup namenode. If I
> > > understand your issue correctly, the WebHDFSSensor code checks an
> > iterable
> > > of Airflow connections to the namenode to find one that is active.
> > >
> > > However, my issue (which I've emailed this list about) was that you
> > cannot
> > > set multiple connections with the same name (e.g. webhdfs_default)
> > through
> > > the CLI, only in the Web interface. I'm planning on submitting a PR
> soon
> > to
> > > remedy this.
> > >
> > > Ben
> > >
> > > On Wed, Aug 29, 2018 at 2:57 AM Driesprong, Fokko <fokko@driesprong.frl
> >
> > > wrote:
> > >
> > > > Hi Manu,
> > > >
> > > > Thanks for raising this question. There is a PR for moving
> > > > <https://github.com/apache/incubator-airflow/pull/3560> to hdfs3.
> > There
> > > is
> > > > code in the existing codebase, which support HA
> > > > <
> > > >
> > >
> >
> https://github.com/apache/incubator-airflow/blob/53b89b98371c7bb993b242c341d3941e9ce09f9a/airflow/hooks/hdfs_hook.py#L92-L96
> > > > >,
> > > > but this might not be for the sensor.
> > > >
> > > > Personally I'm not familiar with pyarrow.hdfs, so I'm not the one to
> > > judge
> > > > how mature it is. We need to replace Snakebite for sure since it is
> > only
> > > > compatible with Python 2.7.
> > > >
> > > > Cheers, Fokko
> > > >
> > > >
> > > > Op wo 29 aug. 2018 om 04:29 schreef Manu Zhang <
> > owenzhang1990@gmail.com
> > > >:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We've been using WebHdfsSensor happily to sensor the state of
> > upstream
> > > > > tasks outputting to HDFS except when there is a namenode switch.
> I've
> > > > > opened https://issues.apache.org/jira/browse/AIRFLOW-2901 to
> discuss
> > > the
> > > > > HDFS HA support.
> > > > >
> > > > > There are two solutions that I can see,
> > > > >
> > > > > 1. use pyarrow.hdfs which has HA support
> > > > > 2. allow user to configure a list of namenodes
> > > > >
> > > > > WDYT ?
> > > > >
> > > > > Thanks,
> > > > > Manu Zhang
> > > > >
> > > >
> > >
> >
>

Re: WebHdfsSensor doesn't support HDFS HA

Posted by Deng Xiaodong <xd...@gmail.com>.
Hi Manu,

You can set up multiple connections with the same conn_id and different
host, rather than setting in one single connection.


XD

On Thu, Aug 30, 2018 at 11:17 Manu Zhang <ow...@gmail.com> wrote:

> Hi Ben,
>
> How do you set multiple connections through Web UI (from Connections item
> of Admin pull-down list) ? I'm tried setting a comma-separated list to a
> conn_id but that doesn't work.
>
> Thanks,
> Manu
>
>
> On Wed, Aug 29, 2018 at 11:31 PM Ben Laird <br...@gmail.com> wrote:
>
> > Hi Manu,
> >
> > We have the same use case as you, a primary and backup namenode. If I
> > understand your issue correctly, the WebHDFSSensor code checks an
> iterable
> > of Airflow connections to the namenode to find one that is active.
> >
> > However, my issue (which I've emailed this list about) was that you
> cannot
> > set multiple connections with the same name (e.g. webhdfs_default)
> through
> > the CLI, only in the Web interface. I'm planning on submitting a PR soon
> to
> > remedy this.
> >
> > Ben
> >
> > On Wed, Aug 29, 2018 at 2:57 AM Driesprong, Fokko <fo...@driesprong.frl>
> > wrote:
> >
> > > Hi Manu,
> > >
> > > Thanks for raising this question. There is a PR for moving
> > > <https://github.com/apache/incubator-airflow/pull/3560> to hdfs3.
> There
> > is
> > > code in the existing codebase, which support HA
> > > <
> > >
> >
> https://github.com/apache/incubator-airflow/blob/53b89b98371c7bb993b242c341d3941e9ce09f9a/airflow/hooks/hdfs_hook.py#L92-L96
> > > >,
> > > but this might not be for the sensor.
> > >
> > > Personally I'm not familiar with pyarrow.hdfs, so I'm not the one to
> > judge
> > > how mature it is. We need to replace Snakebite for sure since it is
> only
> > > compatible with Python 2.7.
> > >
> > > Cheers, Fokko
> > >
> > >
> > > Op wo 29 aug. 2018 om 04:29 schreef Manu Zhang <
> owenzhang1990@gmail.com
> > >:
> > >
> > > > Hi all,
> > > >
> > > > We've been using WebHdfsSensor happily to sensor the state of
> upstream
> > > > tasks outputting to HDFS except when there is a namenode switch. I've
> > > > opened https://issues.apache.org/jira/browse/AIRFLOW-2901 to discuss
> > the
> > > > HDFS HA support.
> > > >
> > > > There are two solutions that I can see,
> > > >
> > > > 1. use pyarrow.hdfs which has HA support
> > > > 2. allow user to configure a list of namenodes
> > > >
> > > > WDYT ?
> > > >
> > > > Thanks,
> > > > Manu Zhang
> > > >
> > >
> >
>

Re: WebHdfsSensor doesn't support HDFS HA

Posted by Manu Zhang <ow...@gmail.com>.
Hi Ben,

How do you set multiple connections through Web UI (from Connections item
of Admin pull-down list) ? I'm tried setting a comma-separated list to a
conn_id but that doesn't work.

Thanks,
Manu


On Wed, Aug 29, 2018 at 11:31 PM Ben Laird <br...@gmail.com> wrote:

> Hi Manu,
>
> We have the same use case as you, a primary and backup namenode. If I
> understand your issue correctly, the WebHDFSSensor code checks an iterable
> of Airflow connections to the namenode to find one that is active.
>
> However, my issue (which I've emailed this list about) was that you cannot
> set multiple connections with the same name (e.g. webhdfs_default) through
> the CLI, only in the Web interface. I'm planning on submitting a PR soon to
> remedy this.
>
> Ben
>
> On Wed, Aug 29, 2018 at 2:57 AM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> > Hi Manu,
> >
> > Thanks for raising this question. There is a PR for moving
> > <https://github.com/apache/incubator-airflow/pull/3560> to hdfs3. There
> is
> > code in the existing codebase, which support HA
> > <
> >
> https://github.com/apache/incubator-airflow/blob/53b89b98371c7bb993b242c341d3941e9ce09f9a/airflow/hooks/hdfs_hook.py#L92-L96
> > >,
> > but this might not be for the sensor.
> >
> > Personally I'm not familiar with pyarrow.hdfs, so I'm not the one to
> judge
> > how mature it is. We need to replace Snakebite for sure since it is only
> > compatible with Python 2.7.
> >
> > Cheers, Fokko
> >
> >
> > Op wo 29 aug. 2018 om 04:29 schreef Manu Zhang <owenzhang1990@gmail.com
> >:
> >
> > > Hi all,
> > >
> > > We've been using WebHdfsSensor happily to sensor the state of upstream
> > > tasks outputting to HDFS except when there is a namenode switch. I've
> > > opened https://issues.apache.org/jira/browse/AIRFLOW-2901 to discuss
> the
> > > HDFS HA support.
> > >
> > > There are two solutions that I can see,
> > >
> > > 1. use pyarrow.hdfs which has HA support
> > > 2. allow user to configure a list of namenodes
> > >
> > > WDYT ?
> > >
> > > Thanks,
> > > Manu Zhang
> > >
> >
>

Re: WebHdfsSensor doesn't support HDFS HA

Posted by Ben Laird <br...@gmail.com>.
Hi Manu,

We have the same use case as you, a primary and backup namenode. If I
understand your issue correctly, the WebHDFSSensor code checks an iterable
of Airflow connections to the namenode to find one that is active.

However, my issue (which I've emailed this list about) was that you cannot
set multiple connections with the same name (e.g. webhdfs_default) through
the CLI, only in the Web interface. I'm planning on submitting a PR soon to
remedy this.

Ben

On Wed, Aug 29, 2018 at 2:57 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Hi Manu,
>
> Thanks for raising this question. There is a PR for moving
> <https://github.com/apache/incubator-airflow/pull/3560> to hdfs3. There is
> code in the existing codebase, which support HA
> <
> https://github.com/apache/incubator-airflow/blob/53b89b98371c7bb993b242c341d3941e9ce09f9a/airflow/hooks/hdfs_hook.py#L92-L96
> >,
> but this might not be for the sensor.
>
> Personally I'm not familiar with pyarrow.hdfs, so I'm not the one to judge
> how mature it is. We need to replace Snakebite for sure since it is only
> compatible with Python 2.7.
>
> Cheers, Fokko
>
>
> Op wo 29 aug. 2018 om 04:29 schreef Manu Zhang <ow...@gmail.com>:
>
> > Hi all,
> >
> > We've been using WebHdfsSensor happily to sensor the state of upstream
> > tasks outputting to HDFS except when there is a namenode switch. I've
> > opened https://issues.apache.org/jira/browse/AIRFLOW-2901 to discuss the
> > HDFS HA support.
> >
> > There are two solutions that I can see,
> >
> > 1. use pyarrow.hdfs which has HA support
> > 2. allow user to configure a list of namenodes
> >
> > WDYT ?
> >
> > Thanks,
> > Manu Zhang
> >
>

Re: WebHdfsSensor doesn't support HDFS HA

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Hi Manu,

Thanks for raising this question. There is a PR for moving
<https://github.com/apache/incubator-airflow/pull/3560> to hdfs3. There is
code in the existing codebase, which support HA
<https://github.com/apache/incubator-airflow/blob/53b89b98371c7bb993b242c341d3941e9ce09f9a/airflow/hooks/hdfs_hook.py#L92-L96>,
but this might not be for the sensor.

Personally I'm not familiar with pyarrow.hdfs, so I'm not the one to judge
how mature it is. We need to replace Snakebite for sure since it is only
compatible with Python 2.7.

Cheers, Fokko


Op wo 29 aug. 2018 om 04:29 schreef Manu Zhang <ow...@gmail.com>:

> Hi all,
>
> We've been using WebHdfsSensor happily to sensor the state of upstream
> tasks outputting to HDFS except when there is a namenode switch. I've
> opened https://issues.apache.org/jira/browse/AIRFLOW-2901 to discuss the
> HDFS HA support.
>
> There are two solutions that I can see,
>
> 1. use pyarrow.hdfs which has HA support
> 2. allow user to configure a list of namenodes
>
> WDYT ?
>
> Thanks,
> Manu Zhang
>