You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by Rogan Carr <ro...@gmail.com> on 2017/09/24 05:57:32 UTC

[REEF-1892] HDFS File Copy only uses local HDFS

Hi All,

I have opened an issue, REEF-1892 because file IO to WASB for REEF 0.17.x
is broken.

In REEF-1827 [2], the URI used to specify remote and local files were
changed to use the "AbsolutePath". [3]

This means that a file specified as "hdfs://my/file" becomes "/my/file" and
the hdfs:// is assumed by the `dfs` command.

This is fine if you are using vanilla HDFS, but for cases like Blob Storage
in Azure, there is a special prefix, `wasb://` that is used instead of
`hdfs://`. This means that the AbsolutePath method trims off the "wasb",
and this Copy() function instead attempts to download the file from the
local HDFS instead of WASB.

Best,
Rogan

[1] https://issues.apache.org/jira/browse/REEF-1892

[2] https://issues.apache.org/jira/browse/REEF-1827

[3] The code in question
public void Copy(Uri sourceUri, Uri destinationUri)
{

- _commandRunner.Run("dfs -cp " + sourceUri + " " + destinationUri);

+ _commandRunner.Run("dfs -cp " + sourceUri.AbsolutePath + " " +
destinationUri.AbsolutePath);

}

RE: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Shouheng Yi <sh...@microsoft.com.INVALID>.
Hi Rogan and Markus,

Here is the link to the PR: [#1383|https://github.com/apache/reef/pull/1383] to fix REEF-1827. Let me know what you think. Thanks.

Best,
Shouheng

-----Original Message-----
From: Rogan Carr [mailto:rogan.carr@gmail.com] 
Sent: Tuesday, September 26, 2017 5:11 PM
To: dev@reef.apache.org
Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS

Hi Markus,

I'd say an exception handler so that we don't have to bubble up the option to the client and change the IFileSystem interface.

Best,
Rogan

On Mon, Sep 25, 2017 at 4:28 PM, Shouheng Yi <sh...@microsoft.com.invalid>
wrote:

> +1
>
> -----Original Message-----
> From: Markus Weimer [mailto:markus@weimo.de]
> Sent: Monday, September 25, 2017 12:57 PM
> To: REEF Developers Mailinglist <de...@reef.apache.org>
> Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
>
> Hi,
>
> maybe we can code around both requirements. Have a look at this code:
>
> ```
> string host = "hOst";
> string protocol = "http";
> string path = "path/to/fIle.txt";
> Uri foo = new Uri($"{protocol}://{host}/{path}");
>
> Console.WriteLine($"Uri.ToString: {foo.ToString()}");
> Console.WriteLine($"Uri.OriginalString: {foo.OriginalString}"); ```
>
> It prints  as:
>
> ```
> Uri.ToString: http://host/path/to/fIle.txt
> Uri.OriginalString: http://hOst/path/to/fIle.txt ```
>
> Hence, we can use the `OriginalString` method to fix this. However, we 
> would loose the benefit of the `Uri` class normalizing that string. We 
> can either add a configuration parameter for this or make its handling 
> part of an exception handler when files can't be found.
>
> WDYT?
>
> Markus
>
>
>
> On Mon, Sep 25, 2017 at 10:29 AM, Shouheng Yi 
> <shouyi@microsoft.com.invalid
> >
> wrote:
>
> > Hi Markus and Rogan,
> >
> > I proposed this REEF-1827, because some clusters have specific rules 
> > on their hostname configuration - each letter in the hostname must 
> > be correctly capitalized. Keeping or undoing REEF-1827 both have 
> > their pros and cons. Here's an inconclusive list that I can summarize:
> >
> > Keeping REEF-1827
> > Pros:
> > 1. It caters to the clusters with strange DNS configs and majority 
> > of the distributed file systems are based on HDFS.
> > 2. It clarifies user's responsibility of specifying exact file path 
> > on the distributed file system, with no room for interpretation.
> > Cons:
> > 1. It will not be compatible with wasb or other distributed file system.
> >
> > Undoing REEF-1827
> > Pros:
> > 1. It can infer the file system when doing "dfs."
> > 2. It forces users to adopt correct naming convention of hostnames.
> > However mostly likely hostnames come before applications, so it's 
> > difficult for applications to change cluster setups.
> > Cons:
> > 1. We need to make forks for those strange clusters and provide 
> > support for those forks.
> >
> > Fix:
> > I think it's cluster users' responsibility to point correctly where 
> > the file is. I believe if we do exactly what they typed in their 
> > program, it will be easier for the users to debug. I think we can 
> > keep
> > REEF-1827 but also let user to specify what file system is being used.
> > We then can construct a file path and check before "dfs" is called 
> > to make sure that the file path is valid.
> >
> > Best,
> > Shouheng
> >
> > -----Original Message-----
> > From: Rogan Carr [mailto:rogan.carr@gmail.com]
> > Sent: Sunday, September 24, 2017 8:53 PM
> > To: dev@reef.apache.org
> > Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
> >
> > Hi Markus,
> >
> > >> There is no pretty solution that comes to mind. From a principled 
> > >> standpoint, we should undo REEF-1827. Hostnames are supposed to 
> > >> be case insensitive. However, clusters which don't adhere to that 
> > >> standard
> > exist.
> > >> Hence, we might need some work-around for them.
> >
> > I think the best path forward is for us to put together a fix that 
> > provides the former functionality along with a workaround for the 
> > capitalization issue addressed in REEF-1827. I'd rather not roll 
> > back
> > REEF-1827 unless this turns out to be a difficult undertaking.
> >
> > Best,
> > Rogan
> >
> > On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <ma...@weimo.de> wrote:
> >
> > > This looks like a really nasty interaction between the cluster 
> > > infrastructure and our code:
> > >
> > > REEF-1827 became necessary because some clusters have odd DNS 
> > > setups where the capitalization of hostnames mattered.
> > > `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the 
> > > same file as `hdfs://myfancynamenode/some/path.txt`. Stripping the 
> > > protocol and host from the URL fixes that.
> > >
> > > However, that assumes that the relative path given then is 
> > > evaluated with respect to the right host and protocol. This 
> > > assumption is true, if it references a file on the *default* 
> > > protocol and host of the
> > cluster.
> > > However, that default filesystem on HDI seems to be the local HDFS 
> > > of the cluster, not the WASB filesystem.
> > >
> > > There is no pretty solution that comes to mind. From a principled 
> > > standpoint, we should undo REEF-1827. Hostnames are supposed to be 
> > > case insensitive. However, clusters which don't adhere to that 
> > > standard
> > exist.
> > > Hence, we might need some work-around for them.
> > >
> > > Markus
> > >
> >
>

RE: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Shouheng Yi <sh...@microsoft.com.INVALID>.
I'm working on this this afternoon.

-----Original Message-----
From: Markus Weimer [mailto:markus@weimo.de] 
Sent: Tuesday, September 26, 2017 5:17 PM
To: REEF Developers Mailinglist <de...@reef.apache.org>
Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS

We seem to have converged. Who wants to take this on, coding wise?

Thanks,

Markus

On Tue, Sep 26, 2017 at 5:11 PM, Rogan Carr <ro...@gmail.com> wrote:

> Hi Markus,
>
> I'd say an exception handler so that we don't have to bubble up the 
> option to the client and change the IFileSystem interface.
>
> Best,
> Rogan
>
> On Mon, Sep 25, 2017 at 4:28 PM, Shouheng Yi 
> <shouyi@microsoft.com.invalid
> >
> wrote:
>
> > +1
> >
> > -----Original Message-----
> > From: Markus Weimer [mailto:markus@weimo.de]
> > Sent: Monday, September 25, 2017 12:57 PM
> > To: REEF Developers Mailinglist <de...@reef.apache.org>
> > Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
> >
> > Hi,
> >
> > maybe we can code around both requirements. Have a look at this code:
> >
> > ```
> > string host = "hOst";
> > string protocol = "http";
> > string path = "path/to/fIle.txt";
> > Uri foo = new Uri($"{protocol}://{host}/{path}");
> >
> > Console.WriteLine($"Uri.ToString: {foo.ToString()}");
> > Console.WriteLine($"Uri.OriginalString: {foo.OriginalString}"); ```
> >
> > It prints  as:
> >
> > ```
> > Uri.ToString: http://host/path/to/fIle.txt
> > Uri.OriginalString: http://hOst/path/to/fIle.txt ```
> >
> > Hence, we can use the `OriginalString` method to fix this. However, 
> > we would loose the benefit of the `Uri` class normalizing that 
> > string. We
> can
> > either add a configuration parameter for this or make its handling 
> > part
> of
> > an exception handler when files can't be found.
> >
> > WDYT?
> >
> > Markus
> >
> >
> >
> > On Mon, Sep 25, 2017 at 10:29 AM, Shouheng Yi
> <shouyi@microsoft.com.invalid
> > >
> > wrote:
> >
> > > Hi Markus and Rogan,
> > >
> > > I proposed this REEF-1827, because some clusters have specific 
> > > rules on their hostname configuration - each letter in the 
> > > hostname must be correctly capitalized. Keeping or undoing 
> > > REEF-1827 both have their pros and cons. Here's an inconclusive list that I can summarize:
> > >
> > > Keeping REEF-1827
> > > Pros:
> > > 1. It caters to the clusters with strange DNS configs and majority 
> > > of the distributed file systems are based on HDFS.
> > > 2. It clarifies user's responsibility of specifying exact file 
> > > path on the distributed file system, with no room for interpretation.
> > > Cons:
> > > 1. It will not be compatible with wasb or other distributed file
> system.
> > >
> > > Undoing REEF-1827
> > > Pros:
> > > 1. It can infer the file system when doing "dfs."
> > > 2. It forces users to adopt correct naming convention of hostnames.
> > > However mostly likely hostnames come before applications, so it's 
> > > difficult for applications to change cluster setups.
> > > Cons:
> > > 1. We need to make forks for those strange clusters and provide 
> > > support for those forks.
> > >
> > > Fix:
> > > I think it's cluster users' responsibility to point correctly 
> > > where the file is. I believe if we do exactly what they typed in 
> > > their program, it will be easier for the users to debug. I think 
> > > we can keep
> > > REEF-1827 but also let user to specify what file system is being used.
> > > We then can construct a file path and check before "dfs" is called 
> > > to make sure that the file path is valid.
> > >
> > > Best,
> > > Shouheng
> > >
> > > -----Original Message-----
> > > From: Rogan Carr [mailto:rogan.carr@gmail.com]
> > > Sent: Sunday, September 24, 2017 8:53 PM
> > > To: dev@reef.apache.org
> > > Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
> > >
> > > Hi Markus,
> > >
> > > >> There is no pretty solution that comes to mind. From a 
> > > >> principled standpoint, we should undo REEF-1827. Hostnames are 
> > > >> supposed to be case insensitive. However, clusters which don't 
> > > >> adhere to that standard
> > > exist.
> > > >> Hence, we might need some work-around for them.
> > >
> > > I think the best path forward is for us to put together a fix that 
> > > provides the former functionality along with a workaround for the 
> > > capitalization issue addressed in REEF-1827. I'd rather not roll 
> > > back
> > > REEF-1827 unless this turns out to be a difficult undertaking.
> > >
> > > Best,
> > > Rogan
> > >
> > > On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <ma...@weimo.de>
> wrote:
> > >
> > > > This looks like a really nasty interaction between the cluster 
> > > > infrastructure and our code:
> > > >
> > > > REEF-1827 became necessary because some clusters have odd DNS 
> > > > setups where the capitalization of hostnames mattered.
> > > > `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the 
> > > > same file as `hdfs://myfancynamenode/some/path.txt`. Stripping 
> > > > the protocol and host from the URL fixes that.
> > > >
> > > > However, that assumes that the relative path given then is 
> > > > evaluated with respect to the right host and protocol. This 
> > > > assumption is true, if it references a file on the *default* 
> > > > protocol and host of the
> > > cluster.
> > > > However, that default filesystem on HDI seems to be the local 
> > > > HDFS of the cluster, not the WASB filesystem.
> > > >
> > > > There is no pretty solution that comes to mind. From a 
> > > > principled standpoint, we should undo REEF-1827. Hostnames are 
> > > > supposed to be case insensitive. However, clusters which don't 
> > > > adhere to that standard
> > > exist.
> > > > Hence, we might need some work-around for them.
> > > >
> > > > Markus
> > > >
> > >
> >
>

Re: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Markus Weimer <ma...@weimo.de>.
We seem to have converged. Who wants to take this on, coding wise?

Thanks,

Markus

On Tue, Sep 26, 2017 at 5:11 PM, Rogan Carr <ro...@gmail.com> wrote:

> Hi Markus,
>
> I'd say an exception handler so that we don't have to bubble up the option
> to the client and change the IFileSystem interface.
>
> Best,
> Rogan
>
> On Mon, Sep 25, 2017 at 4:28 PM, Shouheng Yi <shouyi@microsoft.com.invalid
> >
> wrote:
>
> > +1
> >
> > -----Original Message-----
> > From: Markus Weimer [mailto:markus@weimo.de]
> > Sent: Monday, September 25, 2017 12:57 PM
> > To: REEF Developers Mailinglist <de...@reef.apache.org>
> > Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
> >
> > Hi,
> >
> > maybe we can code around both requirements. Have a look at this code:
> >
> > ```
> > string host = "hOst";
> > string protocol = "http";
> > string path = "path/to/fIle.txt";
> > Uri foo = new Uri($"{protocol}://{host}/{path}");
> >
> > Console.WriteLine($"Uri.ToString: {foo.ToString()}");
> > Console.WriteLine($"Uri.OriginalString: {foo.OriginalString}"); ```
> >
> > It prints  as:
> >
> > ```
> > Uri.ToString: http://host/path/to/fIle.txt
> > Uri.OriginalString: http://hOst/path/to/fIle.txt ```
> >
> > Hence, we can use the `OriginalString` method to fix this. However, we
> > would loose the benefit of the `Uri` class normalizing that string. We
> can
> > either add a configuration parameter for this or make its handling part
> of
> > an exception handler when files can't be found.
> >
> > WDYT?
> >
> > Markus
> >
> >
> >
> > On Mon, Sep 25, 2017 at 10:29 AM, Shouheng Yi
> <shouyi@microsoft.com.invalid
> > >
> > wrote:
> >
> > > Hi Markus and Rogan,
> > >
> > > I proposed this REEF-1827, because some clusters have specific rules
> > > on their hostname configuration - each letter in the hostname must be
> > > correctly capitalized. Keeping or undoing REEF-1827 both have their
> > > pros and cons. Here's an inconclusive list that I can summarize:
> > >
> > > Keeping REEF-1827
> > > Pros:
> > > 1. It caters to the clusters with strange DNS configs and majority of
> > > the distributed file systems are based on HDFS.
> > > 2. It clarifies user's responsibility of specifying exact file path on
> > > the distributed file system, with no room for interpretation.
> > > Cons:
> > > 1. It will not be compatible with wasb or other distributed file
> system.
> > >
> > > Undoing REEF-1827
> > > Pros:
> > > 1. It can infer the file system when doing "dfs."
> > > 2. It forces users to adopt correct naming convention of hostnames.
> > > However mostly likely hostnames come before applications, so it's
> > > difficult for applications to change cluster setups.
> > > Cons:
> > > 1. We need to make forks for those strange clusters and provide
> > > support for those forks.
> > >
> > > Fix:
> > > I think it's cluster users' responsibility to point correctly where
> > > the file is. I believe if we do exactly what they typed in their
> > > program, it will be easier for the users to debug. I think we can keep
> > > REEF-1827 but also let user to specify what file system is being used.
> > > We then can construct a file path and check before "dfs" is called to
> > > make sure that the file path is valid.
> > >
> > > Best,
> > > Shouheng
> > >
> > > -----Original Message-----
> > > From: Rogan Carr [mailto:rogan.carr@gmail.com]
> > > Sent: Sunday, September 24, 2017 8:53 PM
> > > To: dev@reef.apache.org
> > > Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
> > >
> > > Hi Markus,
> > >
> > > >> There is no pretty solution that comes to mind. From a principled
> > > >> standpoint, we should undo REEF-1827. Hostnames are supposed to be
> > > >> case insensitive. However, clusters which don't adhere to that
> > > >> standard
> > > exist.
> > > >> Hence, we might need some work-around for them.
> > >
> > > I think the best path forward is for us to put together a fix that
> > > provides the former functionality along with a workaround for the
> > > capitalization issue addressed in REEF-1827. I'd rather not roll back
> > > REEF-1827 unless this turns out to be a difficult undertaking.
> > >
> > > Best,
> > > Rogan
> > >
> > > On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <ma...@weimo.de>
> wrote:
> > >
> > > > This looks like a really nasty interaction between the cluster
> > > > infrastructure and our code:
> > > >
> > > > REEF-1827 became necessary because some clusters have odd DNS setups
> > > > where the capitalization of hostnames mattered.
> > > > `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the
> > > > same file as `hdfs://myfancynamenode/some/path.txt`. Stripping the
> > > > protocol and host from the URL fixes that.
> > > >
> > > > However, that assumes that the relative path given then is evaluated
> > > > with respect to the right host and protocol. This assumption is
> > > > true, if it references a file on the *default* protocol and host of
> > > > the
> > > cluster.
> > > > However, that default filesystem on HDI seems to be the local HDFS
> > > > of the cluster, not the WASB filesystem.
> > > >
> > > > There is no pretty solution that comes to mind. From a principled
> > > > standpoint, we should undo REEF-1827. Hostnames are supposed to be
> > > > case insensitive. However, clusters which don't adhere to that
> > > > standard
> > > exist.
> > > > Hence, we might need some work-around for them.
> > > >
> > > > Markus
> > > >
> > >
> >
>

Re: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Rogan Carr <ro...@gmail.com>.
Hi Markus,

I'd say an exception handler so that we don't have to bubble up the option
to the client and change the IFileSystem interface.

Best,
Rogan

On Mon, Sep 25, 2017 at 4:28 PM, Shouheng Yi <sh...@microsoft.com.invalid>
wrote:

> +1
>
> -----Original Message-----
> From: Markus Weimer [mailto:markus@weimo.de]
> Sent: Monday, September 25, 2017 12:57 PM
> To: REEF Developers Mailinglist <de...@reef.apache.org>
> Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
>
> Hi,
>
> maybe we can code around both requirements. Have a look at this code:
>
> ```
> string host = "hOst";
> string protocol = "http";
> string path = "path/to/fIle.txt";
> Uri foo = new Uri($"{protocol}://{host}/{path}");
>
> Console.WriteLine($"Uri.ToString: {foo.ToString()}");
> Console.WriteLine($"Uri.OriginalString: {foo.OriginalString}"); ```
>
> It prints  as:
>
> ```
> Uri.ToString: http://host/path/to/fIle.txt
> Uri.OriginalString: http://hOst/path/to/fIle.txt ```
>
> Hence, we can use the `OriginalString` method to fix this. However, we
> would loose the benefit of the `Uri` class normalizing that string. We can
> either add a configuration parameter for this or make its handling part of
> an exception handler when files can't be found.
>
> WDYT?
>
> Markus
>
>
>
> On Mon, Sep 25, 2017 at 10:29 AM, Shouheng Yi <shouyi@microsoft.com.invalid
> >
> wrote:
>
> > Hi Markus and Rogan,
> >
> > I proposed this REEF-1827, because some clusters have specific rules
> > on their hostname configuration - each letter in the hostname must be
> > correctly capitalized. Keeping or undoing REEF-1827 both have their
> > pros and cons. Here's an inconclusive list that I can summarize:
> >
> > Keeping REEF-1827
> > Pros:
> > 1. It caters to the clusters with strange DNS configs and majority of
> > the distributed file systems are based on HDFS.
> > 2. It clarifies user's responsibility of specifying exact file path on
> > the distributed file system, with no room for interpretation.
> > Cons:
> > 1. It will not be compatible with wasb or other distributed file system.
> >
> > Undoing REEF-1827
> > Pros:
> > 1. It can infer the file system when doing "dfs."
> > 2. It forces users to adopt correct naming convention of hostnames.
> > However mostly likely hostnames come before applications, so it's
> > difficult for applications to change cluster setups.
> > Cons:
> > 1. We need to make forks for those strange clusters and provide
> > support for those forks.
> >
> > Fix:
> > I think it's cluster users' responsibility to point correctly where
> > the file is. I believe if we do exactly what they typed in their
> > program, it will be easier for the users to debug. I think we can keep
> > REEF-1827 but also let user to specify what file system is being used.
> > We then can construct a file path and check before "dfs" is called to
> > make sure that the file path is valid.
> >
> > Best,
> > Shouheng
> >
> > -----Original Message-----
> > From: Rogan Carr [mailto:rogan.carr@gmail.com]
> > Sent: Sunday, September 24, 2017 8:53 PM
> > To: dev@reef.apache.org
> > Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
> >
> > Hi Markus,
> >
> > >> There is no pretty solution that comes to mind. From a principled
> > >> standpoint, we should undo REEF-1827. Hostnames are supposed to be
> > >> case insensitive. However, clusters which don't adhere to that
> > >> standard
> > exist.
> > >> Hence, we might need some work-around for them.
> >
> > I think the best path forward is for us to put together a fix that
> > provides the former functionality along with a workaround for the
> > capitalization issue addressed in REEF-1827. I'd rather not roll back
> > REEF-1827 unless this turns out to be a difficult undertaking.
> >
> > Best,
> > Rogan
> >
> > On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <ma...@weimo.de> wrote:
> >
> > > This looks like a really nasty interaction between the cluster
> > > infrastructure and our code:
> > >
> > > REEF-1827 became necessary because some clusters have odd DNS setups
> > > where the capitalization of hostnames mattered.
> > > `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the
> > > same file as `hdfs://myfancynamenode/some/path.txt`. Stripping the
> > > protocol and host from the URL fixes that.
> > >
> > > However, that assumes that the relative path given then is evaluated
> > > with respect to the right host and protocol. This assumption is
> > > true, if it references a file on the *default* protocol and host of
> > > the
> > cluster.
> > > However, that default filesystem on HDI seems to be the local HDFS
> > > of the cluster, not the WASB filesystem.
> > >
> > > There is no pretty solution that comes to mind. From a principled
> > > standpoint, we should undo REEF-1827. Hostnames are supposed to be
> > > case insensitive. However, clusters which don't adhere to that
> > > standard
> > exist.
> > > Hence, we might need some work-around for them.
> > >
> > > Markus
> > >
> >
>

RE: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Shouheng Yi <sh...@microsoft.com.INVALID>.
+1

-----Original Message-----
From: Markus Weimer [mailto:markus@weimo.de] 
Sent: Monday, September 25, 2017 12:57 PM
To: REEF Developers Mailinglist <de...@reef.apache.org>
Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS

Hi,

maybe we can code around both requirements. Have a look at this code:

```
string host = "hOst";
string protocol = "http";
string path = "path/to/fIle.txt";
Uri foo = new Uri($"{protocol}://{host}/{path}");

Console.WriteLine($"Uri.ToString: {foo.ToString()}");
Console.WriteLine($"Uri.OriginalString: {foo.OriginalString}"); ```

It prints  as:

```
Uri.ToString: http://host/path/to/fIle.txt
Uri.OriginalString: http://hOst/path/to/fIle.txt ```

Hence, we can use the `OriginalString` method to fix this. However, we would loose the benefit of the `Uri` class normalizing that string. We can either add a configuration parameter for this or make its handling part of an exception handler when files can't be found.

WDYT?

Markus



On Mon, Sep 25, 2017 at 10:29 AM, Shouheng Yi <sh...@microsoft.com.invalid>
wrote:

> Hi Markus and Rogan,
>
> I proposed this REEF-1827, because some clusters have specific rules 
> on their hostname configuration - each letter in the hostname must be 
> correctly capitalized. Keeping or undoing REEF-1827 both have their 
> pros and cons. Here's an inconclusive list that I can summarize:
>
> Keeping REEF-1827
> Pros:
> 1. It caters to the clusters with strange DNS configs and majority of 
> the distributed file systems are based on HDFS.
> 2. It clarifies user's responsibility of specifying exact file path on 
> the distributed file system, with no room for interpretation.
> Cons:
> 1. It will not be compatible with wasb or other distributed file system.
>
> Undoing REEF-1827
> Pros:
> 1. It can infer the file system when doing "dfs."
> 2. It forces users to adopt correct naming convention of hostnames.
> However mostly likely hostnames come before applications, so it's 
> difficult for applications to change cluster setups.
> Cons:
> 1. We need to make forks for those strange clusters and provide 
> support for those forks.
>
> Fix:
> I think it's cluster users' responsibility to point correctly where 
> the file is. I believe if we do exactly what they typed in their 
> program, it will be easier for the users to debug. I think we can keep 
> REEF-1827 but also let user to specify what file system is being used. 
> We then can construct a file path and check before "dfs" is called to 
> make sure that the file path is valid.
>
> Best,
> Shouheng
>
> -----Original Message-----
> From: Rogan Carr [mailto:rogan.carr@gmail.com]
> Sent: Sunday, September 24, 2017 8:53 PM
> To: dev@reef.apache.org
> Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
>
> Hi Markus,
>
> >> There is no pretty solution that comes to mind. From a principled 
> >> standpoint, we should undo REEF-1827. Hostnames are supposed to be 
> >> case insensitive. However, clusters which don't adhere to that 
> >> standard
> exist.
> >> Hence, we might need some work-around for them.
>
> I think the best path forward is for us to put together a fix that 
> provides the former functionality along with a workaround for the 
> capitalization issue addressed in REEF-1827. I'd rather not roll back
> REEF-1827 unless this turns out to be a difficult undertaking.
>
> Best,
> Rogan
>
> On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <ma...@weimo.de> wrote:
>
> > This looks like a really nasty interaction between the cluster 
> > infrastructure and our code:
> >
> > REEF-1827 became necessary because some clusters have odd DNS setups 
> > where the capitalization of hostnames mattered.
> > `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the 
> > same file as `hdfs://myfancynamenode/some/path.txt`. Stripping the 
> > protocol and host from the URL fixes that.
> >
> > However, that assumes that the relative path given then is evaluated 
> > with respect to the right host and protocol. This assumption is 
> > true, if it references a file on the *default* protocol and host of 
> > the
> cluster.
> > However, that default filesystem on HDI seems to be the local HDFS 
> > of the cluster, not the WASB filesystem.
> >
> > There is no pretty solution that comes to mind. From a principled 
> > standpoint, we should undo REEF-1827. Hostnames are supposed to be 
> > case insensitive. However, clusters which don't adhere to that 
> > standard
> exist.
> > Hence, we might need some work-around for them.
> >
> > Markus
> >
>

Re: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Markus Weimer <ma...@weimo.de>.
Hi,

maybe we can code around both requirements. Have a look at this code:

```
string host = "hOst";
string protocol = "http";
string path = "path/to/fIle.txt";
Uri foo = new Uri($"{protocol}://{host}/{path}");

Console.WriteLine($"Uri.ToString: {foo.ToString()}");
Console.WriteLine($"Uri.OriginalString: {foo.OriginalString}");
```

It prints  as:

```
Uri.ToString: http://host/path/to/fIle.txt
Uri.OriginalString: http://hOst/path/to/fIle.txt
```

Hence, we can use the `OriginalString` method to fix this. However, we
would loose the benefit of the `Uri` class normalizing that string. We can
either add a configuration parameter for this or make its handling part of
an exception handler when files can't be found.

WDYT?

Markus



On Mon, Sep 25, 2017 at 10:29 AM, Shouheng Yi <sh...@microsoft.com.invalid>
wrote:

> Hi Markus and Rogan,
>
> I proposed this REEF-1827, because some clusters have specific rules on
> their hostname configuration - each letter in the hostname must be
> correctly capitalized. Keeping or undoing REEF-1827 both have their pros
> and cons. Here's an inconclusive list that I can summarize:
>
> Keeping REEF-1827
> Pros:
> 1. It caters to the clusters with strange DNS configs and majority of the
> distributed file systems are based on HDFS.
> 2. It clarifies user's responsibility of specifying exact file path on the
> distributed file system, with no room for interpretation.
> Cons:
> 1. It will not be compatible with wasb or other distributed file system.
>
> Undoing REEF-1827
> Pros:
> 1. It can infer the file system when doing "dfs."
> 2. It forces users to adopt correct naming convention of hostnames.
> However mostly likely hostnames come before applications, so it's difficult
> for applications to change cluster setups.
> Cons:
> 1. We need to make forks for those strange clusters and provide support
> for those forks.
>
> Fix:
> I think it's cluster users' responsibility to point correctly where the
> file is. I believe if we do exactly what they typed in their program, it
> will be easier for the users to debug. I think we can keep REEF-1827 but
> also let user to specify what file system is being used. We then can
> construct a file path and check before "dfs" is called to make sure that
> the file path is valid.
>
> Best,
> Shouheng
>
> -----Original Message-----
> From: Rogan Carr [mailto:rogan.carr@gmail.com]
> Sent: Sunday, September 24, 2017 8:53 PM
> To: dev@reef.apache.org
> Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS
>
> Hi Markus,
>
> >> There is no pretty solution that comes to mind. From a principled
> >> standpoint, we should undo REEF-1827. Hostnames are supposed to be
> >> case insensitive. However, clusters which don't adhere to that standard
> exist.
> >> Hence, we might need some work-around for them.
>
> I think the best path forward is for us to put together a fix that
> provides the former functionality along with a workaround for the
> capitalization issue addressed in REEF-1827. I'd rather not roll back
> REEF-1827 unless this turns out to be a difficult undertaking.
>
> Best,
> Rogan
>
> On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <ma...@weimo.de> wrote:
>
> > This looks like a really nasty interaction between the cluster
> > infrastructure and our code:
> >
> > REEF-1827 became necessary because some clusters have odd DNS setups
> > where the capitalization of hostnames mattered.
> > `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the same
> > file as `hdfs://myfancynamenode/some/path.txt`. Stripping the protocol
> > and host from the URL fixes that.
> >
> > However, that assumes that the relative path given then is evaluated
> > with respect to the right host and protocol. This assumption is true,
> > if it references a file on the *default* protocol and host of the
> cluster.
> > However, that default filesystem on HDI seems to be the local HDFS of
> > the cluster, not the WASB filesystem.
> >
> > There is no pretty solution that comes to mind. From a principled
> > standpoint, we should undo REEF-1827. Hostnames are supposed to be
> > case insensitive. However, clusters which don't adhere to that standard
> exist.
> > Hence, we might need some work-around for them.
> >
> > Markus
> >
>

RE: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Shouheng Yi <sh...@microsoft.com.INVALID>.
Hi Markus and Rogan,

I proposed this REEF-1827, because some clusters have specific rules on their hostname configuration - each letter in the hostname must be correctly capitalized. Keeping or undoing REEF-1827 both have their pros and cons. Here's an inconclusive list that I can summarize:

Keeping REEF-1827
Pros:
1. It caters to the clusters with strange DNS configs and majority of the distributed file systems are based on HDFS.
2. It clarifies user's responsibility of specifying exact file path on the distributed file system, with no room for interpretation.
Cons:
1. It will not be compatible with wasb or other distributed file system.

Undoing REEF-1827
Pros:
1. It can infer the file system when doing "dfs."
2. It forces users to adopt correct naming convention of hostnames. However mostly likely hostnames come before applications, so it's difficult for applications to change cluster setups.
Cons:
1. We need to make forks for those strange clusters and provide support for those forks.

Fix:
I think it's cluster users' responsibility to point correctly where the file is. I believe if we do exactly what they typed in their program, it will be easier for the users to debug. I think we can keep REEF-1827 but also let user to specify what file system is being used. We then can construct a file path and check before "dfs" is called to make sure that the file path is valid.

Best,
Shouheng

-----Original Message-----
From: Rogan Carr [mailto:rogan.carr@gmail.com] 
Sent: Sunday, September 24, 2017 8:53 PM
To: dev@reef.apache.org
Subject: Re: [REEF-1892] HDFS File Copy only uses local HDFS

Hi Markus,

>> There is no pretty solution that comes to mind. From a principled 
>> standpoint, we should undo REEF-1827. Hostnames are supposed to be 
>> case insensitive. However, clusters which don't adhere to that standard exist.
>> Hence, we might need some work-around for them.

I think the best path forward is for us to put together a fix that provides the former functionality along with a workaround for the capitalization issue addressed in REEF-1827. I'd rather not roll back REEF-1827 unless this turns out to be a difficult undertaking.

Best,
Rogan

On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <ma...@weimo.de> wrote:

> This looks like a really nasty interaction between the cluster 
> infrastructure and our code:
>
> REEF-1827 became necessary because some clusters have odd DNS setups 
> where the capitalization of hostnames mattered.
> `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the same 
> file as `hdfs://myfancynamenode/some/path.txt`. Stripping the protocol 
> and host from the URL fixes that.
>
> However, that assumes that the relative path given then is evaluated 
> with respect to the right host and protocol. This assumption is true, 
> if it references a file on the *default* protocol and host of the cluster.
> However, that default filesystem on HDI seems to be the local HDFS of 
> the cluster, not the WASB filesystem.
>
> There is no pretty solution that comes to mind. From a principled 
> standpoint, we should undo REEF-1827. Hostnames are supposed to be 
> case insensitive. However, clusters which don't adhere to that standard exist.
> Hence, we might need some work-around for them.
>
> Markus
>

Re: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Rogan Carr <ro...@gmail.com>.
Hi Markus,

>> There is no pretty solution that comes to mind. From a principled
>> standpoint, we should undo REEF-1827. Hostnames are supposed to be case
>> insensitive. However, clusters which don't adhere to that standard exist.
>> Hence, we might need some work-around for them.

I think the best path forward is for us to put together a fix that provides
the former functionality along with a workaround for the capitalization
issue addressed in REEF-1827. I'd rather not roll back REEF-1827 unless
this turns out to be a difficult undertaking.

Best,
Rogan

On Sun, Sep 24, 2017 at 10:10 AM, Markus Weimer <ma...@weimo.de> wrote:

> This looks like a really nasty interaction between the cluster
> infrastructure and our code:
>
> REEF-1827 became necessary because some clusters have odd DNS setups where
> the capitalization of hostnames mattered.
> `hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the same file
> as `hdfs://myfancynamenode/some/path.txt`. Stripping the protocol and host
> from the URL fixes that.
>
> However, that assumes that the relative path given then is evaluated with
> respect to the right host and protocol. This assumption is true, if it
> references a file on the *default* protocol and host of the cluster.
> However, that default filesystem on HDI seems to be the local HDFS of the
> cluster, not the WASB filesystem.
>
> There is no pretty solution that comes to mind. From a principled
> standpoint, we should undo REEF-1827. Hostnames are supposed to be case
> insensitive. However, clusters which don't adhere to that standard exist.
> Hence, we might need some work-around for them.
>
> Markus
>

Re: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Markus Weimer <ma...@weimo.de>.
This looks like a really nasty interaction between the cluster
infrastructure and our code:

REEF-1827 became necessary because some clusters have odd DNS setups where
the capitalization of hostnames mattered.
`hdfs://MyFaNcyNaMeNode/some/path.txt` would not evaluate to the same file
as `hdfs://myfancynamenode/some/path.txt`. Stripping the protocol and host
from the URL fixes that.

However, that assumes that the relative path given then is evaluated with
respect to the right host and protocol. This assumption is true, if it
references a file on the *default* protocol and host of the cluster.
However, that default filesystem on HDI seems to be the local HDFS of the
cluster, not the WASB filesystem.

There is no pretty solution that comes to mind. From a principled
standpoint, we should undo REEF-1827. Hostnames are supposed to be case
insensitive. However, clusters which don't adhere to that standard exist.
Hence, we might need some work-around for them.

Markus

Re: [REEF-1892] HDFS File Copy only uses local HDFS

Posted by Rogan Carr <ro...@gmail.com>.
And here is the PR for REEF-1827: https://github.com/apache/reef/pull/1331

On Sat, Sep 23, 2017 at 10:57 PM, Rogan Carr <ro...@gmail.com> wrote:

> Hi All,
>
> I have opened an issue, REEF-1892 because file IO to WASB for REEF 0.17.x
> is broken.
>
> In REEF-1827 [2], the URI used to specify remote and local files were
> changed to use the "AbsolutePath". [3]
>
> This means that a file specified as "hdfs://my/file" becomes "/my/file"
> and the hdfs:// is assumed by the `dfs` command.
>
> This is fine if you are using vanilla HDFS, but for cases like Blob
> Storage in Azure, there is a special prefix, `wasb://` that is used instead
> of `hdfs://`. This means that the AbsolutePath method trims off the "wasb",
> and this Copy() function instead attempts to download the file from the
> local HDFS instead of WASB.
>
> Best,
> Rogan
>
> [1] https://issues.apache.org/jira/browse/REEF-1892
>
> [2] https://issues.apache.org/jira/browse/REEF-1827
>
> [3] The code in question
> public void Copy(Uri sourceUri, Uri destinationUri)
> {
>
> - _commandRunner.Run("dfs -cp " + sourceUri + " " + destinationUri);
>
> + _commandRunner.Run("dfs -cp " + sourceUri.AbsolutePath + " " +
> destinationUri.AbsolutePath);
>
> }
>