You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2016/06/07 13:56:21 UTC
[jira] [Updated] (FLINK-1736) Add CountVectorizer to machine
learning library
[ https://issues.apache.org/jira/browse/FLINK-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann updated FLINK-1736:
---------------------------------
Assignee: ROSHANI NAGMOTE (was: Alexander Alexandrov)
> Add CountVectorizer to machine learning library
> -----------------------------------------------
>
> Key: FLINK-1736
> URL: https://issues.apache.org/jira/browse/FLINK-1736
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Assignee: ROSHANI NAGMOTE
> Labels: ML, Starter
>
> A {{CountVectorizer}} feature extractor [1] assigns each occurring word in a corpus an unique identifier. With this mapping it can vectorize models such as bag of words or ngrams in a efficient way. The unique identifier assigned to a word acts as the index of a vector. The number of word occurrences is represented as a vector value at a specific index.
> The advantage of the {{CountVectorizer}} compared to the FeatureHasher is that the mapping of words to indices can be obtained which makes it easier to understand the resulting feature vectors.
> The {{CountVectorizer}} could be generalized to support arbitrary feature values.
> The {{CountVectorizer}} should be implemented as a {{Transfomer}}.
> Resources:
> [1] [http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)