You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/06/29 10:15:04 UTC

[jira] [Commented] (FLINK-1735) Add FeatureHasher to machine learning library

    [ https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605260#comment-14605260 ] 

ASF GitHub Bot commented on FLINK-1735:
---------------------------------------

Github user ChristophAl commented on the pull request:

    https://github.com/apache/flink/pull/665#issuecomment-116517528
  
    Hi,
    
    after I rebased it on master and implemented the new pipeline interface, I have some followup questions regarding the types we should accept for feature hashing.
    
    I can think of Iterable[String] and Iterable[(String, Int)] for documents as well as (Int, Iterable[String]) and (Int, Iterable[(String, Int)]) for documents having some kind of index. So I'm not sure if it is required to hash arbitrary types by using .hashcode()?
    Also note, in case the nonNegative parameter is set to true, the output can be used as TF in TF-IDF.


> Add FeatureHasher to machine learning library
> ---------------------------------------------
>
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Felix Neutatz
>              Labels: ML
>
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature values. The hash of the feature value is used to calculate its index for a vector entry. In order to mitigate possible collisions, a second hashing function is used to calculate the sign for the update value which is added to the vector entry. This way, it is likely that collision will simply cancel out.
> A feature hasher would also be helpful for NLP problems where it could be used to vectorize bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)