You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Simeon Simeonov (JIRA)" <ji...@apache.org> on 2015/09/14 19:55:46 UTC

[jira] [Comment Edited] (SPARK-10574) HashingTF should use MurmurHash3

    [ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743922#comment-14743922 ] 

Simeon Simeonov edited comment on SPARK-10574 at 9/14/15 5:55 PM:
------------------------------------------------------------------

[~josephkb] this makes sense. There are a few decisions to make:

# Which data types should we support out-of-the-box hashing for? HashingTF is typically used on strings but the implementation is currently not restricted to that.
# If we choose to support them, how will the hashing happen for non-String types? For example, does a Double get converted to binary first (as the toString() representation may not be perfectly accurate) or do we call this a feature (a tiny bit of LSH to make reasoning more robust)? I believe our goal here should be to have a rock-solid, deterministic definition that works the same everywhere.
# How do we safely open the door for user-provided hashing functions? This could come at no extra cost through a simple trait and the parameter specifying the hashing function being a class name that must implement that trait, with Spark providing a murmur and a "native" implementation. I tend to prefer this simple pattern to hard-coded decision logic instantiating a finite set of classes. This would allow Spark users to experiment with xxHash, CityHash, minwise hashing and, if they so choose, even forms of locality-sensitive hashing. New hashes could be contributed to the project or become discoverable on spark-packages. It would also allow for analysis patterns down the road, e.g., a decorator class that analyzes the distribution of collisions as a side effect to help practitioners choose the right hashing function. 

I'd appreciate your thoughts.



was (Author: simeons):
[~josephkb] this makes sense. There are a few decisions to make:

# Which data types should we support out-of-the-box hashing for? HashingTF is typically used on strings but the implementation is currently not restricted to that.

# If we choose to support them, how will the hashing happen for non-String types? For example, does a Double get converted to binary first (as the toString() representation may not be perfectly accurate) or do we call this a feature (a tiny bit of LSH to make reasoning more robust)? I believe our goal here should be to have a rock-solid, deterministic definition that works the same everywhere.

# How do we safely open the door for user-provided hashing functions? This could come at no extra cost through a simple trait and the parameter specifying the hashing function being a class name that must implement that trait, with Spark providing a murmur and a "native" implementation. I tend to prefer this simple pattern to hard-coded decision logic instantiating a finite set of classes. This would allow Spark users to experiment with xxHash, CityHash, minwise hashing and, if they so choose, even forms of locality-sensitive hashing. New hashes could be contributed to the project or become discoverable on spark-packages. It would also allow for analysis patterns down the road, e.g., a decorator class that analyzes the distribution of collisions as a side effect to help practitioners choose the right hashing function. 

I'd appreciate your thoughts.


> HashingTF should use MurmurHash3
> --------------------------------
>
>                 Key: SPARK-10574
>                 URL: https://issues.apache.org/jira/browse/SPARK-10574
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Simeon Simeonov
>            Priority: Critical
>              Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are two significant problems with this.
> First, per the [Scala documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for {{hashCode}}, the implementation is platform specific. This means that feature vectors created on one platform may be different than vectors created on another platform. This can create significant problems when a model trained offline is used in another environment for online prediction. The problem is made harder by the fact that following a hashing transform features lose human-tractable meaning and a problem such as this may be extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, exhibiting [200-500% higher collision rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for example, [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] which is also included in the standard Scala libraries and is the hashing choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If Spark users apply {{HashingTF}} only to very short, dictionary-like strings the hashing function choice will not be a big problem but why have an implementation in MLlib with this limitation when there is a better implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a previous version would have to be re-trained. This introduces a problem that's orthogonal to breaking changes in APIs: breaking changes related to artifacts, e.g., a saved model, produced by a previous version. Is there a policy or best practice currently in effect about this? If not, perhaps we should come up with a few simple rules about how we communicate these in release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org