You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2006/03/14 08:19:57 UTC

[jira] Created: (HADOOP-80) binary key

binary key
----------

         Key: HADOOP-80
         URL: http://issues.apache.org/jira/browse/HADOOP-80
     Project: Hadoop
        Type: New Feature
  Components: io  
    Versions: 0.1    
    Reporter: Owen O'Malley
 Assigned to: Owen O'Malley 
     Fix For: 0.1
 Attachments: binary-key.patch

I needed a binary key type, so I extended BytesWritable to be comparable also.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-80) binary key

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-80?page=comments#action_12370606 ] 

Doug Cutting commented on HADOOP-80:
------------------------------------

If we think that the default java hash function is bad, then we should probably switch it in all of the places where we hash, e.g., in UTF8, here, etc.  For now I will: (a) add a static hashBytes() method to WritableComparable, use it in BytesWritable().  Then, if we someday decide to change the hash function then we'll have a single place to do it.  Allocating a new digester per call to hashCode() seems expensive (although I have not benchmarked this).

MapReduce does use the hash function by default for partitioning, but HashMaps will also use it, so MapReduce is not the only client.

> binary key
> ----------
>
>          Key: HADOOP-80
>          URL: http://issues.apache.org/jira/browse/HADOOP-80
>      Project: Hadoop
>         Type: New Feature
>   Components: io
>     Versions: 0.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.1
>  Attachments: binary-key-2.patch, binary-key.patch
>
> I needed a binary key type, so I extended BytesWritable to be comparable also.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Closed: (HADOOP-80) binary key

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-80?page=all ]
     
Owen O'Malley closed HADOOP-80:
-------------------------------

    Resolution: Fixed

Doug committed this.

> binary key
> ----------
>
>          Key: HADOOP-80
>          URL: http://issues.apache.org/jira/browse/HADOOP-80
>      Project: Hadoop
>         Type: New Feature
>   Components: io
>     Versions: 0.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.1
>  Attachments: binary-key-2.patch, binary-key.patch
>
> I needed a binary key type, so I extended BytesWritable to be comparable also.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-80) binary key

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-80?page=comments#action_12370609 ] 

Doug Cutting commented on HADOOP-80:
------------------------------------

I just committed this as r386219.

> binary key
> ----------
>
>          Key: HADOOP-80
>          URL: http://issues.apache.org/jira/browse/HADOOP-80
>      Project: Hadoop
>         Type: New Feature
>   Components: io
>     Versions: 0.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.1
>  Attachments: binary-key-2.patch, binary-key.patch
>
> I needed a binary key type, so I extended BytesWritable to be comparable also.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-80) binary key

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-80?page=all ]

Owen O'Malley updated HADOOP-80:
--------------------------------

    Attachment: binary-key.patch

> binary key
> ----------
>
>          Key: HADOOP-80
>          URL: http://issues.apache.org/jira/browse/HADOOP-80
>      Project: Hadoop
>         Type: New Feature
>   Components: io
>     Versions: 0.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.1
>  Attachments: binary-key.patch
>
> I needed a binary key type, so I extended BytesWritable to be comparable also.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-80) binary key

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-80?page=comments#action_12370406 ] 

Doug Cutting commented on HADOOP-80:
------------------------------------

Overall this looks good.  A couple of questions:

1. Why call setSize(0) in read()?  This looks like a no-op.  Am I missing something?

2. Why bother to use md5 for hashCode()?  That could be expensive.  Why not implement this like java.util.Arrays.hashCode() and UTF8.hashCode():

  public int hashCode() {
    int hash = 1;
    for (int i = 0; i < size; i++)
      hash = (31 * hash) + (int)bytes[i];
    return hash;
  }



> binary key
> ----------
>
>          Key: HADOOP-80
>          URL: http://issues.apache.org/jira/browse/HADOOP-80
>      Project: Hadoop
>         Type: New Feature
>   Components: io
>     Versions: 0.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.1
>  Attachments: binary-key.patch
>
> I needed a binary key type, so I extended BytesWritable to be comparable also.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-80) binary key

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-80?page=all ]

Owen O'Malley updated HADOOP-80:
--------------------------------

    Attachment: binary-key-2.patch

This updated patch handles the case of shrinking the buffer's capacity. It also adds testing code for that situation into the unit test.

> binary key
> ----------
>
>          Key: HADOOP-80
>          URL: http://issues.apache.org/jira/browse/HADOOP-80
>      Project: Hadoop
>         Type: New Feature
>   Components: io
>     Versions: 0.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.1
>  Attachments: binary-key-2.patch, binary-key.patch
>
> I needed a binary key type, so I extended BytesWritable to be comparable also.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: [jira] Commented: (HADOOP-80) binary key

Posted by Andrzej Bialecki <ab...@getopt.org>.
Owen O'Malley (JIRA) wrote:
>> 2. Why bother to use md5 for hashCode()?  That could be expensive.  Why not implement this like 
>> java.util.Arrays.hashCode() and UTF8.hashCode():
>>     
>
> Yeah, I considered doing something lighter than md5, but using md5 prevents pathological cases from doing bad things. We also use md5 a lot around here, so it is a really useful default for us, but it might make sense to have a lighter hash alternative. However, since in map/reduce the hash function is only used for partitioning the map output, it seemed better to use a known good hash function than taking a chance on a fast but sloppy hash function.
>   

You can find an FNV hash implementation here: http://www.getopt.org 
(Apache license). Computationally it's similar in complexity to the 
above hashing schemes, but gives much better distribution. Perhaps worth 
a try.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (HADOOP-80) binary key

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-80?page=comments#action_12370477 ] 

Owen O'Malley commented on HADOOP-80:
-------------------------------------

> 1. Why call setSize(0) in read()? This looks like a no-op. Am I missing something?

Yeah, that is a little subtle and I should have commented it better. Basically, when I do the second setSize, it may call setCapacity, which will copy the current data. If the current size is 0, then it won't copy anything. It won't change the user visible behavior, but will save a useless copy.

> 2. Why bother to use md5 for hashCode()?  That could be expensive.  Why not implement this like 
> java.util.Arrays.hashCode() and UTF8.hashCode():

Yeah, I considered doing something lighter than md5, but using md5 prevents pathological cases from doing bad things. We also use md5 a lot around here, so it is a really useful default for us, but it might make sense to have a lighter hash alternative. However, since in map/reduce the hash function is only used for partitioning the map output, it seemed better to use a known good hash function than taking a chance on a fast but sloppy hash function.


> binary key
> ----------
>
>          Key: HADOOP-80
>          URL: http://issues.apache.org/jira/browse/HADOOP-80
>      Project: Hadoop
>         Type: New Feature
>   Components: io
>     Versions: 0.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.1
>  Attachments: binary-key.patch
>
> I needed a binary key type, so I extended BytesWritable to be comparable also.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira