You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/06/24 10:06:42 UTC

[jira] [Resolved] (SPARK-8565) TF-IDF drops records

     [ https://issues.apache.org/jira/browse/SPARK-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-8565.
------------------------------
    Resolution: Not A Problem

I can't find that bit of the docs but I assume it would refer to something done by the TF-IDF process. If you count the source (or some other transformation of the source), and then later apply TF-IDF, even if that caches something, it's already caching a different view.

> TF-IDF drops records
> --------------------
>
>                 Key: SPARK-8565
>                 URL: https://issues.apache.org/jira/browse/SPARK-8565
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.3.1
>            Reporter: PJ Van Aeken
>
> When applying TFIDF on an RDD[Seq[String]] with 1213 records, I get an RDD[Vector] back with only 1204 records. This prevents me from zipping it with the original so I can reattach the document ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org