You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jatin Puri <pu...@gmail.com> on 2020/08/19 08:11:16 UTC

Ability to have CountVectorizerModel vocab as empty

Hello,

This is wrt
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244

require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF
as necessary.")

Currently, if `CountVectorizer` is trained on an empty dataset results in
the following exception. But it is perfectly valid use case to send it
empty data (or if minDF filters everything).
HashingTF works fine in such scenarios. CountVectorizer doesn't.

Can we remove this constraint? Happy to send a pull-request

java.lang.IllegalArgumentException: requirement failed: The vocabulary
size should be > 0. Lower minDF as necessary.	at
scala.Predef$.require(Predef.scala:224)	at
org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)	at
org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)	at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)	at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)	at
scala.collection.Iterator$class.foreach(Iterator.scala:891)	at
scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

Re: Ability to have CountVectorizerModel vocab as empty

Posted by Jatin Puri <pu...@gmail.com>.

Thanks Sean for the quick response.

Logged a Jira: https://issues.apache.org/jira/browse/SPARK-32662

Will send a pull request shortly.

Regards,
Jatin

On Wed, Aug 19, 2020 at 6:58 PM Sean Owen <sr...@gmail.com> wrote:

> I think that's true. You're welcome to open a pull request / JIRA to
> remove that requirement.
>
> On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri <pu...@gmail.com> wrote:
> >
> > Hello,
> >
> > This is wrt
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244
> >
> > require(vocab.length > 0, "The vocabulary size should be > 0. Lower
> minDF as necessary.")
> >
> > Currently, if `CountVectorizer` is trained on an empty dataset results
> in the following exception. But it is perfectly valid use case to send it
> empty data (or if minDF filters everything).
> > HashingTF works fine in such scenarios. CountVectorizer doesn't.
> >
> > Can we remove this constraint? Happy to send a pull-request
> >
> > java.lang.IllegalArgumentException: requirement failed: The vocabulary
> size should be > 0. Lower minDF as necessary.
> > at scala.Predef$.require(Predef.scala:224)
> > at
> org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)
> > at
> org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)
> > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
> > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>


-- 
Jatin Puri
http://jatinpuri.com <http://www.jatinpuri.com>

Re: Ability to have CountVectorizerModel vocab as empty

Posted by Sean Owen <sr...@gmail.com>.

I think that's true. You're welcome to open a pull request / JIRA to
remove that requirement.

On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri <pu...@gmail.com> wrote:
>
> Hello,
>
> This is wrt https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244
>
> require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")
>
> Currently, if `CountVectorizer` is trained on an empty dataset results in the following exception. But it is perfectly valid use case to send it empty data (or if minDF filters everything).
> HashingTF works fine in such scenarios. CountVectorizer doesn't.
>
> Can we remove this constraint? Happy to send a pull-request
>
> java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. Lower minDF as necessary.
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org