You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2018/02/20 19:19:00 UTC
[jira] [Updated] (SPARK-23469) HashingTF should use corrected
MurmurHash3 implementation
[ https://issues.apache.org/jira/browse/SPARK-23469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-23469:
--------------------------------------
Description:
[SPARK-23381] added a corrected MurmurHash3 implementation but left the old implementation alone. In Spark 2.3 and earlier, HashingTF will use the old implementation. (We should not backport a fix for HashingTF since it would be a major change of behavior.) But we should correct HashingTF in Spark 2.4; this JIRA is for tracking this fix.
* Update HashingTF to use new implementation of MurmurHash3
* Ensure backwards compatibility for ML persistence by having HashingTF use the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded. We can add a Param to allow this.
Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I recommend we first migrate the code to spark.ml: [SPARK-21748]. We can leave spark.mllib alone and just fix MurmurHash3 in spark.ml.
was:
[SPARK-23381] added a corrected MurmurHash3 implementation but left the old implementation alone. In Spark 2.3 and earlier, HashingTF will use the old implementation. (We should not backport a fix for HashingTF since it would be a major change of behavior.) But we should correct HashingTF in Spark 2.4; this JIRA is for tracking this fix.
* Update HashingTF to use new implementation of MurmurHash3
* Ensure backwards compatibility for ML persistence by having HashingTF use the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded. We can add a Param to allow this.
Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I recommend we first migrate the code to spark.ml. We can leave spark.mllib alone and just fix MurmurHash3 in spark.ml. I will link a JIRA for this migration.
> HashingTF should use corrected MurmurHash3 implementation
> ---------------------------------------------------------
>
> Key: SPARK-23469
> URL: https://issues.apache.org/jira/browse/SPARK-23469
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.4.0
> Reporter: Joseph K. Bradley
> Priority: Major
>
> [SPARK-23381] added a corrected MurmurHash3 implementation but left the old implementation alone. In Spark 2.3 and earlier, HashingTF will use the old implementation. (We should not backport a fix for HashingTF since it would be a major change of behavior.) But we should correct HashingTF in Spark 2.4; this JIRA is for tracking this fix.
> * Update HashingTF to use new implementation of MurmurHash3
> * Ensure backwards compatibility for ML persistence by having HashingTF use the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded. We can add a Param to allow this.
> Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I recommend we first migrate the code to spark.ml: [SPARK-21748]. We can leave spark.mllib alone and just fix MurmurHash3 in spark.ml.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org