You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Mike Smith (JIRA)" <ji...@apache.org> on 2007/03/04 22:47:50 UTC

[jira] Created: (HADOOP-1061) S3 listSubPaths bug

S3 listSubPaths bug
-------------------

Key: HADOOP-1061
URL: https://issues.apache.org/jira/browse/HADOOP-1061
Project: Hadoop
Issue Type: Bug
Components: fs
Affects Versions: 0.12.0, 0.11.2
Reporter: Mike Smith
Priority: Critical

I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders.

I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);

instead of

S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);

in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater!

But, the recursive problem is because the GET request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter.

Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?

example:

Original path is /user/root/folder and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

unsubscribe

Posted by Aled Jones <Al...@comtec-europe.co.uk>.

 

> -----Original Message-----
> From: Tom White (JIRA) [mailto:jira@apache.org] 
> Sent: 05 March 2007 22:01
> To: hadoop-dev@lucene.apache.org
> Subject: [jira] Commented: (HADOOP-1061) S3 listSubPaths bug
> 
> 
>     [ 
> https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atl
assian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_1247817
4 ] 
> 
> Tom White commented on HADOOP-1061:
> -----------------------------------
> 
> S3 files need to be written with a version number of the 
> client library that wrote them (as suggested in HADOOP-930). 
> If we did this now, then we can detect when there is a 
> mismatch and fail fast (and informatively). While it would be 
> possible (but tricky) to support both versions, I don't feel 
> that we should do that since there is a workaround for data 
> migration: copy your S3 data from the old file system to a 
> local file or HDFS (on EC2 preferably, but this isn't 
> necessary) using the old version of Hadoop, then copy it back 
> to a new S3 file system using a new version of Hadoop. I'd be 
> happy to write this.
> 
> (I'm not saying that version-aware code will never be needed, 
> just that it isn't yet since not that many people are using 
> this feature yet.)
> 
> Thoughts?
> 
> > S3 listSubPaths bug
> > -------------------
> >
> >                 Key: HADOOP-1061
> >                 URL: 
> https://issues.apache.org/jira/browse/HADOOP-1061
> >             Project: Hadoop
> >          Issue Type: Bug
> >          Components: fs
> >    Affects Versions: 0.11.2, 0.12.0
> >            Reporter: Mike Smith
> >            Priority: Critical
> >         Attachments: 1061-hadoop.patch
> >
> >
> > I had problem with the -ls command in s3 file system. It 
> was returning inconsistence number of "Found Items" if you 
> rerun it different times and more importantly it returns 
> recursive results (depth 1) for some folders. 
> > I looked into the code, the problem is caused by jets3t 
> library. The inconsistency problem will be solved if we use :
> > S3Object[] objects = s3Service.listObjects(bucket, prefix, 
> > PATH_DELIMITER); instead of S3Object[] objects = 
> > s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0); in 
> > listSubPaths of Jets3tFileSystemStore class (line 227)! 
> This change will let GET REST request to have a "max-key" 
> paramter with default value of 1000! It seems s3 GET request 
> is sensetive to this paramater!
> > But, the recursive problem is because the GET  request 
> doesn't execute the delimiter constraint correctly. The 
> response contains all the keys with the given prefix but they 
> don't stop at the path_delimiter. You can simply test this by 
> making couple folder on hadoop s3 filesystem and run -ls. I 
> followed the generated GET request and it looks all fine but 
> it is not executed correctly at the s3 server side.I still 
> don't know why the response doesn't stop at the path_delimiter. 
> > Possible casue: Jets3t library does URL encoding, why do we 
> need to do URL encoding in Jets3tFileSystemStore class!?
> > example:
> > Original path is   /user/root/folder  and it will be 
> encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore 
> class. Then, Jets3t will reencode this to make the REST 
> request. And it will be rewritten as 
> %252Fuser%252Froot%252Ffolder, so the the generated folder on 
> the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the 
> amazon side. Wouldn't be better to skip the encoding part on 
> Hadoop. This strange structure might be the reason that the 
> s3 doesn't stop at the path_delimiter. 
> 
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
###########################################

This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
For more information, connect to http://www.f-secure.com/

************************************************************************
This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. 

Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.

[jira] Updated: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1061:
------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I've just committed this.

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch, HADOOP-1061-v4.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Mike Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478625 ] 

Mike Smith commented on HADOOP-1061:
------------------------------------

Tom, 

I checked the second patch and it works fine and detects the version mismatch. Also, it seems "not stopping at the path_delimiter" problem is fixed after removing double encoding. 

 

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1061) S3 listSubPaths bug

Posted by "Mike Smith (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Smith updated HADOOP-1061:
-------------------------------

    Attachment: 1061-hadoop.patch

Tom, 

I made the patch that fixes the listSubPaths. Also, I removed the enconding from Jets3tFileSystemStore and there was no problem and it worked fine and there was no problem with root '/'. But, I didn't add that in the patch because that change is not backward compatible. If someone removes the encoding, then the previous built files are not readable anymore. It would be nice to make a tool to migrate the double encoded files to single '/' files when we make a patch to remove encoding from Jets3tFileSystemStore .


> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1061:
------------------------------

    Attachment: hadoop-1061-v2.patch

Patch for review.

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Mike Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477874 ] 

Mike Smith commented on HADOOP-1061:
------------------------------------

As a temporary solution I replaced the listSubPath method of Jets3tFileSystemStore class with this new mathod, which simply check if there is any tail after delimiter and also does not pass zero to s3Service.listObjects:

public Set<Path> listSubPaths(Path path) throws IOException {
    try {
      String prefix = pathToKey(path);
      if (!prefix.endsWith(PATH_DELIMITER)) {
        prefix += PATH_DELIMITER;
      }
      S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
      Set<Path> prefixes = new TreeSet<Path>();
      for (int i = 0; i < objects.length; i++) {
    	if(objects[i].getKey().replaceAll(prefix,"").indexOf(PATH_DELIMITER)<0)    	
        prefixes.add(keyToPath(objects[i].getKey()));
      }
      prefixes.remove(path);
      return prefixes;
    } catch (S3ServiceException e) {
      if (e.getCause() instanceof IOException) {
        throw (IOException) e.getCause();
      }
      throw new S3Exception(e);
    }
  }

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481537 ] 

Tom White commented on HADOOP-1061:
-----------------------------------

I haven't written a migration script yet, but I'll try to do so before 0.13.

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1061:
------------------------------

    Attachment: hadoop-1061-v3.patch

> It's not actually that hard to make it back-compatible

It's a bit harder than I thought when I wrote this, since we have changed the key format. So, you would have to check for both forms of the key before before saying the file or directory doesn't exist. I feel this would complicate the code somewhat.

I would suggest going with this new patch (v3) which additionally checks name and type. People will have to migrate old data as I originally described above. If this proved problematic, or there was demand, we could write a migration script. (the benefit of this is that it keeps the core S3FileSystem code unencumbered by version migration logic.)

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478330 ] 

Tom White commented on HADOOP-1061:
-----------------------------------

Yes, the patch should include

 * Fixing listObjects call
 * Removing URL encoding.
 * Adding version numbering metadata to files. This should use the S3 metadata feature so we can detect mismatches (and throw an exception) before reading the data.
 * Adding filesystem type metadata to files. Again using S3 metadata. This is easy to do at the same time as version and is a pre-requisite for HADOOP-930.

I'll have a go at this.

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492843 ] 

Hadoop QA commented on HADOOP-1061:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12356557/HADOOP-1061-v4.patch applied and successfully tested against trunk revision r533233.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/97/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/97/console

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch, HADOOP-1061-v4.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478174 ] 

Tom White commented on HADOOP-1061:
-----------------------------------

S3 files need to be written with a version number of the client library that wrote them (as suggested in HADOOP-930). If we did this now, then we can detect when there is a mismatch and fail fast (and informatively). While it would be possible (but tricky) to support both versions, I don't feel that we should do that since there is a workaround for data migration: copy your S3 data from the old file system to a local file or HDFS (on EC2 preferably, but this isn't necessary) using the old version of Hadoop, then copy it back to a new S3 file system using a new version of Hadoop. I'd be happy to write this.

(I'm not saying that version-aware code will never be needed, just that it isn't yet since not that many people are using this feature yet.)

Thoughts?

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492904 ] 

Doug Cutting commented on HADOOP-1061:
--------------------------------------

+1 This looks good to me.

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch, HADOOP-1061-v4.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Mike Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481359 ] 

Mike Smith commented on HADOOP-1061:
------------------------------------

Thanks Tom, 

I wondering if you've got a chance to write any migration script. I do have some data on the old scheme that I cannot afford to lose.


> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493344 ] 

Hadoop QA commented on HADOOP-1061:
-----------------------------------

Integrated in Hadoop-Nightly #77 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/77/)

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch, HADOOP-1061-v4.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478590 ] 

Tom White commented on HADOOP-1061:
-----------------------------------

No this is not back-compatible. It's not actually that hard to make it back-compatible - so I'll produce another patch. I'll check for name and type too.

(I was thinking a similar thing about the long-term support for HTTP.)

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1061:
------------------------------

    Attachment: HADOOP-1061-v4.patch

This patch (HADOOP-1061-v4.patch) includes MigrationTool to migrate the data written using an older version of S3FileSystem to the latest version. I have successfully tested it on an old filesystem.

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch, HADOOP-1061-v4.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478586 ] 

Doug Cutting commented on HADOOP-1061:
--------------------------------------

Should we check not just the FS version, but also the name and type?

Also, to be clear, this is not back-compatible, right?  Is that what we want?  Perhaps instead, if something has none of the metadata properties set, we could interpret it as the prior version, for back-compatibility, so folks can still read their S3 data.  We might later drop that, as long-term, we might make it so that objects with none of these properties set are treated as simple HTTP resources.

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1061:
------------------------------

    Fix Version/s: 0.13.0

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478181 ] 

Doug Cutting commented on HADOOP-1061:
--------------------------------------

We could start adding the version number property with this patch, so that one might write an automatic migration tool later.  Adding the version number ASAP is a good idea anyway.

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>         Attachments: 1061-hadoop.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1061:
------------------------------

    Status: Patch Available  (was: Open)

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.12.0, 0.11.2
>            Reporter: Mike Smith
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: 1061-hadoop.patch, hadoop-1061-v2.patch, hadoop-1061-v3.patch, HADOOP-1061-v4.patch
>
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1061) S3 listSubPaths bug

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478031 ] 

Tom White commented on HADOOP-1061:
-----------------------------------

Mike,

I agree we should use the listObjects call that doesn't take a maxListingLength. JetS3t retrieves them in batches of 1000 behind the scenes for us. (Note that listDeepSubPaths uses the correct form of the method.)

Does dropping the URL encoding in Jets3tFileSystemStore cause any problems with root ("/") - I seem to remember it may have. However, it's wrong to double encode, so I would suggest removing the encoding from Jets3tFileSystemStore and seeing if the tests pass. If they don't we should go from there (rather than using double encoding because it happens to work).

Would you be able to create a patch for this? Thanks!

> S3 listSubPaths bug
> -------------------
>
>                 Key: HADOOP-1061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1061
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.11.2, 0.12.0
>            Reporter: Mike Smith
>            Priority: Critical
>
> I had problem with the -ls command in s3 file system. It was returning inconsistence number of "Found Items" if you rerun it different times and more importantly it returns recursive results (depth 1) for some folders. 
> I looked into the code, the problem is caused by jets3t library. The inconsistency problem will be solved if we use :
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER);
> instead of 
> S3Object[] objects = s3Service.listObjects(bucket, prefix, PATH_DELIMITER , 0);
> in listSubPaths of Jets3tFileSystemStore class (line 227)! This change will let GET REST request to have a "max-key" paramter with default value of 1000! It seems s3 GET request is sensetive to this paramater! 
> But, the recursive problem is because the GET  request doesn't execute the delimiter constraint correctly. The response contains all the keys with the given prefix but they don't stop at the path_delimiter. You can simply test this by making couple folder on hadoop s3 filesystem and run -ls. I followed the generated GET request and it looks all fine but it is not executed correctly at the s3 server side.I still don't know why the response doesn't stop at the path_delimiter. 
> Possible casue: Jets3t library does URL encoding, why do we need to do URL encoding in Jets3tFileSystemStore class!?
> example:
> Original path is   /user/root/folder  and it will be encoded to %2Fuser%2Froot%2Ffolder is Jets3tFileSystemStore class. Then, Jets3t will reencode this to make the REST request. And it will be rewritten as %252Fuser%252Froot%252Ffolder, so the the generated folder on the S3 will be %2Fuser%2Froot%2Ffolder after decoding at the amazon side. Wouldn't be better to skip the encoding part on Hadoop. This strange structure might be the reason that the s3 doesn't stop at the path_delimiter. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.