You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2015/03/18 15:54:38 UTC
[jira] [Created] (FLINK-1735) Add FeatureHasher to machine learning
library
Till Rohrmann created FLINK-1735:
------------------------------------
Summary: Add FeatureHasher to machine learning library
Key: FLINK-1735
URL: https://issues.apache.org/jira/browse/FLINK-1735
Project: Flink
Issue Type: Improvement
Components: Machine Learning Library
Reporter: Till Rohrmann
Using the hashing trick [1,2] is a common way to vectorize arbitrary feature values. The hash of the feature value is used to calculate its index for a vector entry. In order to mitigate possible collisions, a second hashing function is used to calculate the sign for the update value which is added to the vector entry. This way, it is likely that collision will simply cancel out.
A feature hasher would also be helpful for NLP problems where it could be used to vectorize bag of words or ngrams feature vectors.
Resources:
[1] [https://en.wikipedia.org/wiki/Feature_hashing]
[2] [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)