You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marco Gaido (JIRA)" <ji...@apache.org> on 2018/10/30 12:51:00 UTC
[jira] [Commented] (SPARK-25441) calculate term frequency in
CountVectorizer()
[ https://issues.apache.org/jira/browse/SPARK-25441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668654#comment-16668654 ]
Marco Gaido commented on SPARK-25441:
-------------------------------------
TF has an appropriate transformer. I think this can be closed as Invalid.
> calculate term frequency in CountVectorizer()
> ---------------------------------------------
>
> Key: SPARK-25441
> URL: https://issues.apache.org/jira/browse/SPARK-25441
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Affects Versions: 2.3.1
> Reporter: Xinyong Tian
> Priority: Major
>
> currently CountVectorizer() can not output TF (term frequency). I hope there will be such option.
> TF defined as https://en.m.wikipedia.org/wiki/Tf–idf
>
> example,
> >>> df = spark.createDataFrame( ... [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ... ["label", "raw"])
> >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> >>> model = cv.fit(df)
> >>> model.transform(df).limit(1).show(truncate=False)
> label raw vectors
> 0 [a, b, c] (3,[0,1,2],[1.0,1.0,1.0])
>
> instead I want
> 0 [a, b, c] (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector devided by by its sum, here 3, so sum of new vector will 1,for every row(document)
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org