You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Pentreath (JIRA)" <ji...@apache.org> on 2018/09/13 07:59:00 UTC
[jira] [Resolved] (SPARK-25412) FeatureHasher would change the
value of output feature
[ https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Pentreath resolved SPARK-25412.
------------------------------------
Resolution: Not A Bug
> FeatureHasher would change the value of output feature
> ------------------------------------------------------
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.3.1
> Reporter: Vincent
> Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on the hashed value is used to determine the vector index, it's suggested to use a large integer value as the numFeature parameter
> we found several issues regarding current implementation:
> # Cannot get the feature name back by its index after featureHasher transform, for example. when getting feature importance from decision tree training followed by a FeatureHasher
> # when index conflict, which is a great chance to happen especially when 'numFeature' is relatively small, its value would be changed with a new valued (sum of current and old value)
> # to avoid confliction, we should set the 'numFeature' with a large number, highly sparse vector increase the computation complexity of model training
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org