You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jatin Puri (Jira)" <ji...@apache.org> on 2020/08/19 16:48:00 UTC

[jira] [Created] (SPARK-32662) CountVectorizerModel: Remove requirement for minimum vocabulary size

Jatin Puri created SPARK-32662:
----------------------------------

             Summary: CountVectorizerModel: Remove requirement for minimum vocabulary size
                 Key: SPARK-32662
                 URL: https://issues.apache.org/jira/browse/SPARK-32662
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib
    Affects Versions: 3.0.0
            Reporter: Jatin Puri


Currently `CountVectorizer.scala` has the following requirement:
{code:java}
require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary."){code}
But this is not a necessary constraint. It should be able to function even for empty vocabulary case.

This also gives the ability to run the model over empty datasets. HashingTF works fine in such scenarios. CountVectorizer doesn't.

 

spark-user discussion reference: [http://apache-spark-user-list.1001560.n3.nabble.com/Ability-to-have-CountVectorizerModel-vocab-as-empty-td38396.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org