You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Enno Shioji <es...@gmail.com> on 2014/12/23 13:06:36 UTC

ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

Is anybody experiencing this? It looks like a bug in JetS3t to me, but
thought I'd sanity check before filing an issue.


================
I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL
("s3://fake-test/1234").

The code does write to S3, but with double forward slashes (e.g.
"s3://fake-test//1234/-1419334280000/".

I did a debug and it seem like the culprit is
Jets3tFileSystemStore#pathToKey(path), which returns "/fake-test/1234/..."
for the input "s3://fake-test/1234/...". when it should hack off the first
forward slash. However, I couldn't find any bug report for JetS3t for this.

Am I missing something, or is this likely a JetS3t bug?
================


ᐧ

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

Posted by Jon Chase <jo...@gmail.com>.
I've had a lot of difficulties with using the s3:// prefix.  s3n:// seems
to work much better.  Can't find the link ATM, but seems I recall that
s3:// (Hadoop's original block format for s3) is no longer recommended for
use.  Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the
scenes.

On Tue, Dec 23, 2014 at 9:29 AM, Enno Shioji <es...@gmail.com> wrote:

> ᐧ
> I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely
> to be deprecated anyway in favor of s3n.
> Also the comment section notes that Amazon has implemented an
> EmrFileSystem for S3 which is built using AWS SDK rather than JetS3t.
>
>
>
>
> On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji <es...@gmail.com> wrote:
>
>> Hey Jay :)
>>
>> I tried "s3n" which uses the Jets3tNativeFileSystemStore, and the double
>> slash went away.
>> As far as I can see, it does look like a bug in hadoop-common; I'll file
>> a ticket for it.
>>
>> Hope you are doing well, by the way!
>>
>> PS:
>>  Jets3tNativeFileSystemStore's implementation of pathToKey is:
>> ======
>>   private static String pathToKey(Path path) {
>>     if (path.toUri().getScheme() != null &&
>> path.toUri().getPath().isEmpty()) {
>>       // allow uris without trailing slash after bucket to refer to root,
>>       // like s3n://mybucket
>>       return "";
>>     }
>>     if (!path.isAbsolute()) {
>>       throw new IllegalArgumentException("Path must be absolute: " +
>> path);
>>     }
>>     String ret = path.toUri().getPath().substring(1); // remove initial
>> slash
>>     if (ret.endsWith("/") && (ret.indexOf("/") != ret.length() - 1)) {
>>       ret = ret.substring(0, ret.length() -1);
>>   }
>>     return ret;
>>   }
>> ======
>>
>> whereas Jets3tFileSystemStore uses:
>> ======
>>   private String pathToKey(Path path) {
>>     if (!path.isAbsolute()) {
>>       throw new IllegalArgumentException("Path must be absolute: " +
>> path);
>>     }
>>     return path.toUri().getPath();
>>   }
>> ======
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 23, 2014 at 1:07 PM, Jay Vyas <ja...@gmail.com>
>> wrote:
>>
>>> Hi enno.  Might be worthwhile to cross post this on dev@hadoop...
>>> Obviously a simple spark way to test this would be to change the uri to
>>> write to hdfs:// or maybe you could do file:// , and confirm that the extra
>>> slash goes away.
>>>
>>> - if it's indeed a jets3t issue we should add a new unit test for this
>>> if the hcfs tests are passing for jets3tfilesystem, yet this error still
>>> exists.
>>>
>>> - To learn how to run HCFS tests against any FileSystem , see the wiki
>>> page : https://wiki.apache.org/hadoop/HCFS/Progress (see the July 14th
>>> entry on that page).
>>>
>>> - Is there another S3FileSystem implementation for AbstractFileSystem or
>>> is jets3t the only one?  That would be a easy  way to test this. And also a
>>> good workaround.
>>>
>>> I'm wondering, also why jets3tfilesystem is the AbstractFileSystem used
>>> by so many - is that the standard impl for storing using AbstractFileSystem
>>> interface?
>>>
>>> On Dec 23, 2014, at 6:06 AM, Enno Shioji <es...@gmail.com> wrote:
>>>
>>> Is anybody experiencing this? It looks like a bug in JetS3t to me, but
>>> thought I'd sanity check before filing an issue.
>>>
>>>
>>> ================
>>> I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3
>>> URL ("s3://fake-test/1234").
>>>
>>> The code does write to S3, but with double forward slashes (e.g.
>>> "s3://fake-test//1234/-1419334280000/".
>>>
>>> I did a debug and it seem like the culprit is
>>> Jets3tFileSystemStore#pathToKey(path), which returns "/fake-test/1234/..."
>>> for the input "s3://fake-test/1234/...". when it should hack off the first
>>> forward slash. However, I couldn't find any bug report for JetS3t for this.
>>>
>>> Am I missing something, or is this likely a JetS3t bug?
>>> ================
>>>
>>>
>>>
>>
>

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

Posted by Enno Shioji <es...@gmail.com>.
ᐧ
I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely
to be deprecated anyway in favor of s3n.
Also the comment section notes that Amazon has implemented an EmrFileSystem
for S3 which is built using AWS SDK rather than JetS3t.




On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji <es...@gmail.com> wrote:

> Hey Jay :)
>
> I tried "s3n" which uses the Jets3tNativeFileSystemStore, and the double
> slash went away.
> As far as I can see, it does look like a bug in hadoop-common; I'll file a
> ticket for it.
>
> Hope you are doing well, by the way!
>
> PS:
>  Jets3tNativeFileSystemStore's implementation of pathToKey is:
> ======
>   private static String pathToKey(Path path) {
>     if (path.toUri().getScheme() != null &&
> path.toUri().getPath().isEmpty()) {
>       // allow uris without trailing slash after bucket to refer to root,
>       // like s3n://mybucket
>       return "";
>     }
>     if (!path.isAbsolute()) {
>       throw new IllegalArgumentException("Path must be absolute: " + path);
>     }
>     String ret = path.toUri().getPath().substring(1); // remove initial
> slash
>     if (ret.endsWith("/") && (ret.indexOf("/") != ret.length() - 1)) {
>       ret = ret.substring(0, ret.length() -1);
>   }
>     return ret;
>   }
> ======
>
> whereas Jets3tFileSystemStore uses:
> ======
>   private String pathToKey(Path path) {
>     if (!path.isAbsolute()) {
>       throw new IllegalArgumentException("Path must be absolute: " + path);
>     }
>     return path.toUri().getPath();
>   }
> ======
>
>
>
>
>
>
> On Tue, Dec 23, 2014 at 1:07 PM, Jay Vyas <ja...@gmail.com>
> wrote:
>
>> Hi enno.  Might be worthwhile to cross post this on dev@hadoop...
>> Obviously a simple spark way to test this would be to change the uri to
>> write to hdfs:// or maybe you could do file:// , and confirm that the extra
>> slash goes away.
>>
>> - if it's indeed a jets3t issue we should add a new unit test for this if
>> the hcfs tests are passing for jets3tfilesystem, yet this error still
>> exists.
>>
>> - To learn how to run HCFS tests against any FileSystem , see the wiki
>> page : https://wiki.apache.org/hadoop/HCFS/Progress (see the July 14th
>> entry on that page).
>>
>> - Is there another S3FileSystem implementation for AbstractFileSystem or
>> is jets3t the only one?  That would be a easy  way to test this. And also a
>> good workaround.
>>
>> I'm wondering, also why jets3tfilesystem is the AbstractFileSystem used
>> by so many - is that the standard impl for storing using AbstractFileSystem
>> interface?
>>
>> On Dec 23, 2014, at 6:06 AM, Enno Shioji <es...@gmail.com> wrote:
>>
>> Is anybody experiencing this? It looks like a bug in JetS3t to me, but
>> thought I'd sanity check before filing an issue.
>>
>>
>> ================
>> I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3
>> URL ("s3://fake-test/1234").
>>
>> The code does write to S3, but with double forward slashes (e.g.
>> "s3://fake-test//1234/-1419334280000/".
>>
>> I did a debug and it seem like the culprit is
>> Jets3tFileSystemStore#pathToKey(path), which returns "/fake-test/1234/..."
>> for the input "s3://fake-test/1234/...". when it should hack off the first
>> forward slash. However, I couldn't find any bug report for JetS3t for this.
>>
>> Am I missing something, or is this likely a JetS3t bug?
>> ================
>>
>>
>>
>

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

Posted by Enno Shioji <es...@gmail.com>.
Hey Jay :)

I tried "s3n" which uses the Jets3tNativeFileSystemStore, and the double
slash went away.
As far as I can see, it does look like a bug in hadoop-common; I'll file a
ticket for it.

Hope you are doing well, by the way!

PS:
 Jets3tNativeFileSystemStore's implementation of pathToKey is:
======
  private static String pathToKey(Path path) {
    if (path.toUri().getScheme() != null &&
path.toUri().getPath().isEmpty()) {
      // allow uris without trailing slash after bucket to refer to root,
      // like s3n://mybucket
      return "";
    }
    if (!path.isAbsolute()) {
      throw new IllegalArgumentException("Path must be absolute: " + path);
    }
    String ret = path.toUri().getPath().substring(1); // remove initial
slash
    if (ret.endsWith("/") && (ret.indexOf("/") != ret.length() - 1)) {
      ret = ret.substring(0, ret.length() -1);
  }
    return ret;
  }
======

whereas Jets3tFileSystemStore uses:
======
  private String pathToKey(Path path) {
    if (!path.isAbsolute()) {
      throw new IllegalArgumentException("Path must be absolute: " + path);
    }
    return path.toUri().getPath();
  }
======





ᐧ

On Tue, Dec 23, 2014 at 1:07 PM, Jay Vyas <ja...@gmail.com>
wrote:

> Hi enno.  Might be worthwhile to cross post this on dev@hadoop...
> Obviously a simple spark way to test this would be to change the uri to
> write to hdfs:// or maybe you could do file:// , and confirm that the extra
> slash goes away.
>
> - if it's indeed a jets3t issue we should add a new unit test for this if
> the hcfs tests are passing for jets3tfilesystem, yet this error still
> exists.
>
> - To learn how to run HCFS tests against any FileSystem , see the wiki
> page : https://wiki.apache.org/hadoop/HCFS/Progress (see the July 14th
> entry on that page).
>
> - Is there another S3FileSystem implementation for AbstractFileSystem or
> is jets3t the only one?  That would be a easy  way to test this. And also a
> good workaround.
>
> I'm wondering, also why jets3tfilesystem is the AbstractFileSystem used by
> so many - is that the standard impl for storing using AbstractFileSystem
> interface?
>
> On Dec 23, 2014, at 6:06 AM, Enno Shioji <es...@gmail.com> wrote:
>
> Is anybody experiencing this? It looks like a bug in JetS3t to me, but
> thought I'd sanity check before filing an issue.
>
>
> ================
> I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL
> ("s3://fake-test/1234").
>
> The code does write to S3, but with double forward slashes (e.g.
> "s3://fake-test//1234/-1419334280000/".
>
> I did a debug and it seem like the culprit is
> Jets3tFileSystemStore#pathToKey(path), which returns "/fake-test/1234/..."
> for the input "s3://fake-test/1234/...". when it should hack off the first
> forward slash. However, I couldn't find any bug report for JetS3t for this.
>
> Am I missing something, or is this likely a JetS3t bug?
> ================
>
>
> ᐧ
>
>

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

Posted by Jay Vyas <ja...@gmail.com>.
Hi enno.  Might be worthwhile to cross post this on dev@hadoop... Obviously a simple spark way to test this would be to change the uri to write to hdfs:// or maybe you could do file:// , and confirm that the extra slash goes away.

- if it's indeed a jets3t issue we should add a new unit test for this if the hcfs tests are passing for jets3tfilesystem, yet this error still exists.

- To learn how to run HCFS tests against any FileSystem , see the wiki page : https://wiki.apache.org/hadoop/HCFS/Progress (see the July 14th entry on that page).

- Is there another S3FileSystem implementation for AbstractFileSystem or is jets3t the only one?  That would be a easy  way to test this. And also a good workaround.

I'm wondering, also why jets3tfilesystem is the AbstractFileSystem used by so many - is that the standard impl for storing using AbstractFileSystem interface?

> On Dec 23, 2014, at 6:06 AM, Enno Shioji <es...@gmail.com> wrote:
> 
> Is anybody experiencing this? It looks like a bug in JetS3t to me, but thought I'd sanity check before filing an issue.
> 
> 
> ================
> I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL ("s3://fake-test/1234").
> 
> The code does write to S3, but with double forward slashes (e.g. "s3://fake-test//1234/-1419334280000/".
> 
> I did a debug and it seem like the culprit is Jets3tFileSystemStore#pathToKey(path), which returns "/fake-test/1234/..." for the input "s3://fake-test/1234/...". when it should hack off the first forward slash. However, I couldn't find any bug report for JetS3t for this.
> 
> Am I missing something, or is this likely a JetS3t bug?
> ================
> 
> 
> ᐧ