You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Vincent (JIRA)" <ji...@apache.org> on 2018/09/07 06:24:00 UTC

[jira] [Created] (SPARK-25365) a better way to handle vector index and sparsity in FeatureHasher implementation ?

Vincent created SPARK-25365:
-------------------------------

             Summary: a better way to handle vector index and sparsity in FeatureHasher implementation ?
                 Key: SPARK-25365
                 URL: https://issues.apache.org/jira/browse/SPARK-25365
             Project: Spark
          Issue Type: Question
          Components: ML
    Affects Versions: 2.3.1
            Reporter: Vincent


In the current implementation of FeatureHasher.transform, a simple modulo on the hashed value is used to determine the vector index, it's suggested to use a large integer value as the numFeature parameter

we found several issues regarding current implementation: 
 # Cannot get the feature name back by its index after featureHasher transform, for example. when getting feature importance from decision tree training followed by a FeatureHasher
 # when index conflict, which is a great chance to happen especially when 'numFeature' is relatively small, its value would be updated with the sum of current and old value, ie, the value of the conflicted feature vector would be change by this module.
 #  to avoid confliction, we should set the 'numFeature' with a large number, highly sparse vector increase the computation complexity of model training

we are working on fixing these problems due to our business need, thinking it might or might not be an issue for others as well, we'd like to hear from the community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org