You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/09/14 19:07:45 UTC

[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

    [ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743834#comment-14743834 ] 

Joseph K. Bradley commented on SPARK-10574:
-------------------------------------------

I agree that switching to MurmurHash3 is a good idea.  As far as backwards compatibility, I feel like the best thing we can do is to provide a new parameter which lets the user choose the hashing method.  I would vote for having it default to MurmurHash3, with an option to switch to the old hashing method (but with proper warnings).

We have not really made promises about backwards compatibility for HashingTF, but we will need to start making such promises after adding save/load for Pipelines.  We can include a release note about this change.

> HashingTF should use MurmurHash3
> --------------------------------
>
>                 Key: SPARK-10574
>                 URL: https://issues.apache.org/jira/browse/SPARK-10574
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Simeon Simeonov
>            Priority: Critical
>              Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are two significant problems with this.
> First, per the [Scala documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for {{hashCode}}, the implementation is platform specific. This means that feature vectors created on one platform may be different than vectors created on another platform. This can create significant problems when a model trained offline is used in another environment for online prediction. The problem is made harder by the fact that following a hashing transform features lose human-tractable meaning and a problem such as this may be extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, exhibiting [200-500% higher collision rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for example, [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] which is also included in the standard Scala libraries and is the hashing choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If Spark users apply {{HashingTF}} only to very short, dictionary-like strings the hashing function choice will not be a big problem but why have an implementation in MLlib with this limitation when there is a better implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a previous version would have to be re-trained. This introduces a problem that's orthogonal to breaking changes in APIs: breaking changes related to artifacts, e.g., a saved model, produced by a previous version. Is there a policy or best practice currently in effect about this? If not, perhaps we should come up with a few simple rules about how we communicate these in release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org