You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Vijay Srinivasaraghavan <vi...@yahoo.com> on 2017/08/23 18:00:33 UTC

Support for multiple HDFS

Hello,
Is it possible for a Flink cluster to use multiple HDFS repository (HDFS-1 for managing Flink state backend, HDFS-2 for syncing results from user job)? 
The scenario can be viewed in the context of running some jobs that are meant to push the results to an archive repository (cold storage).
Since the hadoop configuration is static, I am thinking it is hard to achieve this but I could be wrong.
Please share any thoughts.
RegardsVijay

Re: Support for multiple HDFS

Posted by Haohui Mai <ri...@gmail.com>.

You can definitely use absolute URIs to access two clusters. The
configuration just has to be the union of multiple HDFS clusters
(e.g., the NameNode lists)

Accessing both secure and non-secure clusters are fairly tricky but it
can be done.

AFAIK Isolating the Hadoop configuration will require a lot of changes
in Flink itself today.

~Haohui

On Thu, Aug 24, 2017 at 5:30 AM, Vijay Srinivasaraghavan
<vi...@yahoo.com.invalid> wrote:
> I think it may not work in scenario where Hadoop security is enabled and each HCFS setup is configured differently, unless if there is a way to isolate the Hadoop configurations used in this case?
>
> Regards,
> Vijay
>
> Sent from my iPhone
>
>> On Aug 24, 2017, at 2:51 AM, Stephan Ewen <se...@apache.org> wrote:
>>
>> Hi!
>>
>> I think it can work if you fully qualify the URIs.
>>
>> For the checkpoint configuration, specify one namenode (in the
>> flink-config.yml or in the constructor of the state backend).
>> Example:   statebackend.fs.checkpoint.dir:
>> hdfs://dfsOneNamenode:port/flink/checkpoints
>>
>> For the result (for example rolling sink), configure it with
>> hdfs://dfsTwoNamenode:otherport/flink/result
>>
>> Is that what you are looking for?
>>
>> Stephan
>>
>>
>> On Thu, Aug 24, 2017 at 11:47 AM, Stefan Richter <
>> s.richter@data-artisans.com> wrote:
>>
>>> Hi,
>>>
>>> I don’t think that this is currently supported. If you see a use case for
>>> this (over creating different root directories for checkpoint data and
>>> result data) then I suggest that you open a JIRA issue with a new feature
>>> request.
>>>
>>> Best,
>>> Stefan
>>>
>>>> Am 23.08.2017 um 20:17 schrieb Vijay Srinivasaraghavan <
>>> vijikarthi@yahoo.com>:
>>>>
>>>> Hi Ted,
>>>>
>>>> I believe HDFS-6584 is more of an HDFS feature supporting archive use
>>> case through some policy configurations.
>>>>
>>>> My ask is that I have two distinct HCFS File systems which are
>>> independent but the Flink job will decide which one to use for sink while
>>> the Flink infrastructure is by default configured with one of these HCFS as
>>> state backend store.
>>>>
>>>> Hope this helps.
>>>>
>>>> Regards
>>>> Vijay
>>>>
>>>>
>>>> On Wednesday, August 23, 2017 11:06 AM, Ted Yu <yu...@gmail.com>
>>> wrote:
>>>>
>>>>
>>>> Would HDFS-6584 help with your use case ?
>>>>
>>>> On Wed, Aug 23, 2017 at 11:00 AM, Vijay Srinivasaraghavan <
>>>> vijikarthi@yahoo.com.invalid <ma...@yahoo.com.invalid>>
>>> wrote:
>>>>
>>>>> Hello,
>>>>> Is it possible for a Flink cluster to use multiple HDFS repository
>>> (HDFS-1
>>>>> for managing Flink state backend, HDFS-2 for syncing results from user
>>>>> job)?
>>>>> The scenario can be viewed in the context of running some jobs that are
>>>>> meant to push the results to an archive repository (cold storage).
>>>>> Since the hadoop configuration is static, I am thinking it is hard to
>>>>> achieve this but I could be wrong.
>>>>> Please share any thoughts.
>>>>> RegardsVijay
>>>>
>>>>
>>>
>>>
>

Re: Support for multiple HDFS

Posted by Vijay Srinivasaraghavan <vi...@yahoo.com.INVALID>.

I think it may not work in scenario where Hadoop security is enabled and each HCFS setup is configured differently, unless if there is a way to isolate the Hadoop configurations used in this case?

Regards,
Vijay

Sent from my iPhone

> On Aug 24, 2017, at 2:51 AM, Stephan Ewen <se...@apache.org> wrote:
> 
> Hi!
> 
> I think it can work if you fully qualify the URIs.
> 
> For the checkpoint configuration, specify one namenode (in the
> flink-config.yml or in the constructor of the state backend).
> Example:   statebackend.fs.checkpoint.dir:
> hdfs://dfsOneNamenode:port/flink/checkpoints
> 
> For the result (for example rolling sink), configure it with
> hdfs://dfsTwoNamenode:otherport/flink/result
> 
> Is that what you are looking for?
> 
> Stephan
> 
> 
> On Thu, Aug 24, 2017 at 11:47 AM, Stefan Richter <
> s.richter@data-artisans.com> wrote:
> 
>> Hi,
>> 
>> I don’t think that this is currently supported. If you see a use case for
>> this (over creating different root directories for checkpoint data and
>> result data) then I suggest that you open a JIRA issue with a new feature
>> request.
>> 
>> Best,
>> Stefan
>> 
>>> Am 23.08.2017 um 20:17 schrieb Vijay Srinivasaraghavan <
>> vijikarthi@yahoo.com>:
>>> 
>>> Hi Ted,
>>> 
>>> I believe HDFS-6584 is more of an HDFS feature supporting archive use
>> case through some policy configurations.
>>> 
>>> My ask is that I have two distinct HCFS File systems which are
>> independent but the Flink job will decide which one to use for sink while
>> the Flink infrastructure is by default configured with one of these HCFS as
>> state backend store.
>>> 
>>> Hope this helps.
>>> 
>>> Regards
>>> Vijay
>>> 
>>> 
>>> On Wednesday, August 23, 2017 11:06 AM, Ted Yu <yu...@gmail.com>
>> wrote:
>>> 
>>> 
>>> Would HDFS-6584 help with your use case ?
>>> 
>>> On Wed, Aug 23, 2017 at 11:00 AM, Vijay Srinivasaraghavan <
>>> vijikarthi@yahoo.com.invalid <ma...@yahoo.com.invalid>>
>> wrote:
>>> 
>>>> Hello,
>>>> Is it possible for a Flink cluster to use multiple HDFS repository
>> (HDFS-1
>>>> for managing Flink state backend, HDFS-2 for syncing results from user
>>>> job)?
>>>> The scenario can be viewed in the context of running some jobs that are
>>>> meant to push the results to an archive repository (cold storage).
>>>> Since the hadoop configuration is static, I am thinking it is hard to
>>>> achieve this but I could be wrong.
>>>> Please share any thoughts.
>>>> RegardsVijay
>>> 
>>> 
>> 
>>

Re: Support for multiple HDFS

Posted by Stephan Ewen <se...@apache.org>.

Hi!

I think it can work if you fully qualify the URIs.

For the checkpoint configuration, specify one namenode (in the
flink-config.yml or in the constructor of the state backend).
Example:   statebackend.fs.checkpoint.dir:
hdfs://dfsOneNamenode:port/flink/checkpoints

For the result (for example rolling sink), configure it with
hdfs://dfsTwoNamenode:otherport/flink/result

Is that what you are looking for?

Stephan


On Thu, Aug 24, 2017 at 11:47 AM, Stefan Richter <
s.richter@data-artisans.com> wrote:

> Hi,
>
> I don’t think that this is currently supported. If you see a use case for
> this (over creating different root directories for checkpoint data and
> result data) then I suggest that you open a JIRA issue with a new feature
> request.
>
> Best,
> Stefan
>
> > Am 23.08.2017 um 20:17 schrieb Vijay Srinivasaraghavan <
> vijikarthi@yahoo.com>:
> >
> > Hi Ted,
> >
> > I believe HDFS-6584 is more of an HDFS feature supporting archive use
> case through some policy configurations.
> >
> > My ask is that I have two distinct HCFS File systems which are
> independent but the Flink job will decide which one to use for sink while
> the Flink infrastructure is by default configured with one of these HCFS as
> state backend store.
> >
> > Hope this helps.
> >
> > Regards
> > Vijay
> >
> >
> > On Wednesday, August 23, 2017 11:06 AM, Ted Yu <yu...@gmail.com>
> wrote:
> >
> >
> > Would HDFS-6584 help with your use case ?
> >
> > On Wed, Aug 23, 2017 at 11:00 AM, Vijay Srinivasaraghavan <
> > vijikarthi@yahoo.com.invalid <ma...@yahoo.com.invalid>>
> wrote:
> >
> > > Hello,
> > > Is it possible for a Flink cluster to use multiple HDFS repository
> (HDFS-1
> > > for managing Flink state backend, HDFS-2 for syncing results from user
> > > job)?
> > > The scenario can be viewed in the context of running some jobs that are
> > > meant to push the results to an archive repository (cold storage).
> > > Since the hadoop configuration is static, I am thinking it is hard to
> > > achieve this but I could be wrong.
> > > Please share any thoughts.
> > > RegardsVijay
> >
> >
>
>

Re: Support for multiple HDFS

Posted by Stefan Richter <s....@data-artisans.com>.

Hi,

I don’t think that this is currently supported. If you see a use case for this (over creating different root directories for checkpoint data and result data) then I suggest that you open a JIRA issue with a new feature request.

Best,
Stefan

> Am 23.08.2017 um 20:17 schrieb Vijay Srinivasaraghavan <vi...@yahoo.com>:
> 
> Hi Ted,
> 
> I believe HDFS-6584 is more of an HDFS feature supporting archive use case through some policy configurations.
> 
> My ask is that I have two distinct HCFS File systems which are independent but the Flink job will decide which one to use for sink while the Flink infrastructure is by default configured with one of these HCFS as state backend store.
> 
> Hope this helps.
> 
> Regards
> Vijay
> 
> 
> On Wednesday, August 23, 2017 11:06 AM, Ted Yu <yu...@gmail.com> wrote:
> 
> 
> Would HDFS-6584 help with your use case ?
> 
> On Wed, Aug 23, 2017 at 11:00 AM, Vijay Srinivasaraghavan <
> vijikarthi@yahoo.com.invalid <ma...@yahoo.com.invalid>> wrote:
> 
> > Hello,
> > Is it possible for a Flink cluster to use multiple HDFS repository (HDFS-1
> > for managing Flink state backend, HDFS-2 for syncing results from user
> > job)?
> > The scenario can be viewed in the context of running some jobs that are
> > meant to push the results to an archive repository (cold storage).
> > Since the hadoop configuration is static, I am thinking it is hard to
> > achieve this but I could be wrong.
> > Please share any thoughts.
> > RegardsVijay
> 
>

Re: Support for multiple HDFS

Posted by Vijay Srinivasaraghavan <vi...@yahoo.com>.

Hi Ted,
I believe HDFS-6584 is more of an HDFS feature supporting archive use case through some policy configurations.
My ask is that I have two distinct HCFS File systems which are independent but the Flink job will decide which one to use for sink while the Flink infrastructure is by default configured with one of these HCFS as state backend store.
Hope this helps.
RegardsVijay 

    On Wednesday, August 23, 2017 11:06 AM, Ted Yu <yu...@gmail.com> wrote:

 Would HDFS-6584 help with your use case ?

On Wed, Aug 23, 2017 at 11:00 AM, Vijay Srinivasaraghavan <
vijikarthi@yahoo.com.invalid> wrote:

> Hello,
> Is it possible for a Flink cluster to use multiple HDFS repository (HDFS-1
> for managing Flink state backend, HDFS-2 for syncing results from user
> job)?
> The scenario can be viewed in the context of running some jobs that are
> meant to push the results to an archive repository (cold storage).
> Since the hadoop configuration is static, I am thinking it is hard to
> achieve this but I could be wrong.
> Please share any thoughts.
> RegardsVijay

Re: Support for multiple HDFS

Posted by Vijay Srinivasaraghavan <vi...@yahoo.com.INVALID>.

Hi Ted,
I believe HDFS-6584 is more of an HDFS feature supporting archive use case through some policy configurations.
My ask is that I have two distinct HCFS File systems which are independent but the Flink job will decide which one to use for sink while the Flink infrastructure is by default configured with one of these HCFS as state backend store.
Hope this helps.
RegardsVijay 

    On Wednesday, August 23, 2017 11:06 AM, Ted Yu <yu...@gmail.com> wrote:

 Would HDFS-6584 help with your use case ?

On Wed, Aug 23, 2017 at 11:00 AM, Vijay Srinivasaraghavan <
vijikarthi@yahoo.com.invalid> wrote:

> Hello,
> Is it possible for a Flink cluster to use multiple HDFS repository (HDFS-1
> for managing Flink state backend, HDFS-2 for syncing results from user
> job)?
> The scenario can be viewed in the context of running some jobs that are
> meant to push the results to an archive repository (cold storage).
> Since the hadoop configuration is static, I am thinking it is hard to
> achieve this but I could be wrong.
> Please share any thoughts.
> RegardsVijay

Re: Support for multiple HDFS

Posted by Ted Yu <yu...@gmail.com>.

Would HDFS-6584 help with your use case ?

On Wed, Aug 23, 2017 at 11:00 AM, Vijay Srinivasaraghavan <
vijikarthi@yahoo.com.invalid> wrote:

> Hello,
> Is it possible for a Flink cluster to use multiple HDFS repository (HDFS-1
> for managing Flink state backend, HDFS-2 for syncing results from user
> job)?
> The scenario can be viewed in the context of running some jobs that are
> meant to push the results to an archive repository (cold storage).
> Since the hadoop configuration is static, I am thinking it is hard to
> achieve this but I could be wrong.
> Please share any thoughts.
> RegardsVijay

Re: Support for multiple HDFS

Posted by Ted Yu <yu...@gmail.com>.

Would HDFS-6584 help with your use case ?

On Wed, Aug 23, 2017 at 11:00 AM, Vijay Srinivasaraghavan <
vijikarthi@yahoo.com.invalid> wrote:

> Hello,
> Is it possible for a Flink cluster to use multiple HDFS repository (HDFS-1
> for managing Flink state backend, HDFS-2 for syncing results from user
> job)?
> The scenario can be viewed in the context of running some jobs that are
> meant to push the results to an archive repository (cold storage).
> Since the hadoop configuration is static, I am thinking it is hard to
> achieve this but I could be wrong.
> Please share any thoughts.
> RegardsVijay