You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Paul Sheer <pa...@gmail.com> on 2009/03/05 11:58:28 UTC

Hadoop with case-preservation and case-insensitivity

Hi there,

I have the requirement to use Hadoop with case-insensitivity and
case-preservation ala Windows.

Hadoop has such a clean class hierarchy it seems that the only change
needed is in INode.java,
(snippet below).

Can anyone help with the following question -

If I change only the methods below (to all work case-insensitively) is
this sufficient?

I.e. can I trust that all file/dir name comparison go through these
four methods.

Or will I get bitten by code elsewhere that does name comparisons or otherwise
requires case-sensitive behavior?

Many thanks for any comments.

-paul

=================

public abstract class INode implements Comparable<byte[]> {

    .............
    .............

  //
  // Comparable interface
  //
  public int compareTo(byte[] o) {
    return compareBytes(name, o);
  }

  public boolean equals(Object o) {
    if (!(o instanceof INode)) {
      return false;
    }
    return Arrays.equals(this.name, ((INode)o).name);
  }

  public int hashCode() {
    return Arrays.hashCode(this.name);
  }

  //
  // static methods
  //
  /**
   * Compare two byte arrays.
   *
   * @return a negative integer, zero, or a positive integer
   * as defined by {@link #compareTo(byte[])}.
   */
  static int compareBytes(byte[] a1, byte[] a2) {
    if (a1==a2)
        return 0;
    int len1 = (a1==null ? 0 : a1.length);
    int len2 = (a2==null ? 0 : a2.length);
    int n = Math.min(len1, len2);
    byte b1, b2;
    for (int i=0; i<n; i++) {
      b1 = a1[i];
      b2 = a2[i];
      if (b1 != b2)
        return b1 - b2;
    }
    return len1 - len2;
  }

    .............
    .............

}

Re: Hadoop with case-preservation and case-insensitivity

Posted by Steve Loughran <st...@apache.org>.
Doug Cutting wrote:
> Paul Sheer wrote:
>> I have the requirement to use Hadoop with case-insensitivity and
>> case-preservation ala Windows.
> 
> I think you may have difficultly convincing folks that Hadoop should 
> directly support this mode of operation, and it's also a bad idea to run 
> a hacked version of HDFS, since that will be hard to maintain.
> 
> The safest and simplest way to support this might be to layer it on top 
> of the standard API.  You can implement a FilterFileSystem that, when 
> opening files or listing directories, uses case-insensitive comparisons. 
>  So, to open "/foo/bar" you'd first list "/" looking for subdirectories 
> which case-insensitively match "foo", then, if one is found, list it 
> looking for a file which case-insensitively matches "bar".  Could this 
> suffice?
> 
> Doug

full windows case-logic is pretty bizarre, as you need to ignore case 
all file operations ;mv lower LOWER would result in a file called 
"lower" because of the rule that if there is a destination file whose 
case-insensitive name matches that of the target file, it becomes the 
destination name.
Other issues:
- it should be impossible to create two files in the same directory with 
the same case-insensitive name.
- you need to take locale into account when comparing case. Turkey is 
the testcase, as "I".toLower()!="i"; it's the place where you get the 
bugreps when your logic is broken.

I would stay very clear of it.

Re: Hadoop with case-preservation and case-insensitivity

Posted by Doug Cutting <cu...@apache.org>.
Paul Sheer wrote:
> Sorry if I gave the impression that Hadoop ought to support this feature 
> in general.
> No, I was only asking about my own setup and I'm happy to maintain my own
> private branch.

You didn't imply that Hadoop ought to support it.  But maintaining your 
own private branch is a bad idea long-term, and you'll not get a lot of 
help here for doing that, since the goal here is to build a shared 
version that we can all support together.

> Can you help by telling me if changes to INode.java are all the changes
> I need to make?

I don't know.  There's a good chance it's not the only change you'd need 
to make, and there's a good chance that folks might later make other 
changes that break your version in strange and hard-to-detect ways.  So, 
if you do decide to maintain your own branch, I strongly suggest you 
also write a thorough test suite for this feature.

Cheers,

Doug

Re: Hadoop with case-preservation and case-insensitivity

Posted by Paul Sheer <pa...@gmail.com>.
Thanks very much for the reply,

Sorry if I gave the impression that Hadoop ought to support this feature in
general.
No, I was only asking about my own setup and I'm happy to maintain my own
private branch.

Can you help by telling me if changes to INode.java are all the changes
I need to make?

The layer you describe is a great idea, so I will certainly consider this
option.

-paul


On Thu, Mar 5, 2009 at 8:48 PM, Doug Cutting <cu...@apache.org> wrote:

> Paul Sheer wrote:
>
>> I have the requirement to use Hadoop with case-insensitivity and
>> case-preservation ala Windows.
>>
>
> I think you may have difficultly convincing folks that Hadoop should
> directly support this mode of operation, and it's also a bad idea to run a
> hacked version of HDFS, since that will be hard to maintain.
>
> The safest and simplest way to support this might be to layer it on top of
> the standard API.  You can implement a FilterFileSystem that, when opening
> files or listing directories, uses case-insensitive comparisons.  So, to
> open "/foo/bar" you'd first list "/" looking for subdirectories which
> case-insensitively match "foo", then, if one is found, list it looking for a
> file which case-insensitively matches "bar".  Could this suffice?
>
> Doug
>

Re: Hadoop with case-preservation and case-insensitivity

Posted by Doug Cutting <cu...@apache.org>.
Paul Sheer wrote:
> I have the requirement to use Hadoop with case-insensitivity and
> case-preservation ala Windows.

I think you may have difficultly convincing folks that Hadoop should 
directly support this mode of operation, and it's also a bad idea to run 
a hacked version of HDFS, since that will be hard to maintain.

The safest and simplest way to support this might be to layer it on top 
of the standard API.  You can implement a FilterFileSystem that, when 
opening files or listing directories, uses case-insensitive comparisons. 
  So, to open "/foo/bar" you'd first list "/" looking for subdirectories 
which case-insensitively match "foo", then, if one is found, list it 
looking for a file which case-insensitively matches "bar".  Could this 
suffice?

Doug