You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shivani Rao <ra...@gmail.com> on 2014/11/26 03:09:42 UTC

IDF model error

Hello Spark fans,

I am trying to use the IDF model available in the spark mllib to create an
tf-idf representation of a n RDD[Vectors]. Below i have attached my MWE

I get the following error

"java.lang.IndexOutOfBoundsException: 7 not in [-4,4)
at breeze.linalg.DenseVector.apply$mcI$sp(DenseVector.scala:70)
at breeze.linalg.DenseVector.apply(DenseVector.scala:69)
at
org.apache.spark.mllib.feature.IDF$DocumentFrequencyAggregator.add(IDF.scala:81)
"

Any ideas?

Regards,
Shivani

import org.apache.spark.mllib.feature.VectorTransformer

import com.box.analytics.ml.dms.vector.{SparkSparseVector,SparkDenseVector}

import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector =>
SSV}

import org.apache.spark.mllib.linalg.{Vector => SparkVector}

import org.apache.spark.mllib.linalg.distributed.{IndexedRow,
IndexedRowMatrix}

import org.apache.spark.mllib.feature._


    val doc1s = new IndexedRow(1L, new SSV(4, Array(1, 3, 5, 7),Array(1.0,
1.0, 0.0, 5.0)))

    val doc2s = new IndexedRow(2L, new SSV(4, Array(1, 2, 4, 13),
Array(0.0, 1.0, 2.0, 0.0)))

    val doc3s = new IndexedRow(3L, new SSV(4, Array(10, 14, 20,
21),Array(2.0, 0.0, 2.0, 1.0)))

    val doc4s = new IndexedRow(4L, new SSV(4, Array(3, 7, 13,
20),Array(2.0, 0.0, 2.0, 1.0)))

 val indata = sc.parallelize(List(doc1s,doc2s,doc3s,doc4s)).map(e=>e.vector)

(new IDF()).fit(indata).idf

-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: IDF model error

Posted by Yanbo Liang <ya...@gmail.com>.

Hi Shivani,

In Spark transformations are lazy operations that define a new RDD, while
actions launch computation to return value or write data to external
storage.
So your code will start execute when it reach an action operation when
call DocumentFrequencyAggregator
in org.apache.spark.mllib.feature.IDF. And the exception will be throw and
print the error stack at that point.

2014-11-27 10:30 GMT+08:00 Shivani Rao <ra...@gmail.com>:

> Thanks Yanbo,
>
> I wonder why does SSV does not complain when i create using " new SSV(4,
> Array(1, 3, 5, 7)"? Is there no error check for this even in the breeze
> sparse vector's constructor? That is very strange
>
> Shivani
>
> On Tue, Nov 25, 2014 at 7:25 PM, Yanbo Liang <ya...@gmail.com> wrote:
>
>> Hi Shivani,
>>
>> You misunderstand the parameter of SparseVector.
>>
>> class SparseVector(
>>     override val size: Int,
>>     val indices: Array[Int],
>>     val values: Array[Double]) extends Vector {
>> }
>>
>> The first parameter is the total length of the Vector rather than the
>> length of non-zero elements.
>> So it need greater than the maximum non-zero element index which is 21 in
>> your case.
>> The following code can work:
>>
>> val doc1s = new IndexedRow(1L, new SSV(22, Array(1, 3, 5, 7),Array(1.0,
>> 1.0, 0.0, 5.0)))
>> val doc2s = new IndexedRow(2L, new SSV(22, Array(1, 2, 4, 13),
>> Array(0.0, 1.0, 2.0, 0.0)))
>> val doc3s = new IndexedRow(3L, new SSV(22, Array(10, 14, 20,
>> 21),Array(2.0, 0.0, 2.0, 1.0)))
>> val doc4s = new IndexedRow(4L, new SSV(22, Array(3, 7, 13,
>> 20),Array(2.0, 0.0, 2.0, 1.0)))
>>
>> 2014-11-26 10:09 GMT+08:00 Shivani Rao <ra...@gmail.com>:
>>
>>> Hello Spark fans,
>>>
>>> I am trying to use the IDF model available in the spark mllib to create
>>> an tf-idf representation of a n RDD[Vectors]. Below i have attached my MWE
>>>
>>> I get the following error
>>>
>>> "java.lang.IndexOutOfBoundsException: 7 not in [-4,4)
>>> at breeze.linalg.DenseVector.apply$mcI$sp(DenseVector.scala:70)
>>> at breeze.linalg.DenseVector.apply(DenseVector.scala:69)
>>> at
>>> org.apache.spark.mllib.feature.IDF$DocumentFrequencyAggregator.add(IDF.scala:81)
>>> "
>>>
>>> Any ideas?
>>>
>>> Regards,
>>> Shivani
>>>
>>> import org.apache.spark.mllib.feature.VectorTransformer
>>>
>>> import
>>> com.box.analytics.ml.dms.vector.{SparkSparseVector,SparkDenseVector}
>>>
>>> import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector
>>> => SSV}
>>>
>>> import org.apache.spark.mllib.linalg.{Vector => SparkVector}
>>>
>>> import org.apache.spark.mllib.linalg.distributed.{IndexedRow,
>>> IndexedRowMatrix}
>>>
>>> import org.apache.spark.mllib.feature._
>>>
>>>
>>>     val doc1s = new IndexedRow(1L, new SSV(4, Array(1, 3, 5,
>>> 7),Array(1.0, 1.0, 0.0, 5.0)))
>>>
>>>     val doc2s = new IndexedRow(2L, new SSV(4, Array(1, 2, 4, 13),
>>> Array(0.0, 1.0, 2.0, 0.0)))
>>>
>>>     val doc3s = new IndexedRow(3L, new SSV(4, Array(10, 14, 20,
>>> 21),Array(2.0, 0.0, 2.0, 1.0)))
>>>
>>>     val doc4s = new IndexedRow(4L, new SSV(4, Array(3, 7, 13,
>>> 20),Array(2.0, 0.0, 2.0, 1.0)))
>>>
>>>  val indata =
>>> sc.parallelize(List(doc1s,doc2s,doc3s,doc4s)).map(e=>e.vector)
>>>
>>> (new IDF()).fit(indata).idf
>>>
>>> --
>>> Software Engineer
>>> Analytics Engineering Team@ Box
>>> Mountain View, CA
>>>
>>
>>
>
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA
>

Re: IDF model error

Posted by Shivani Rao <ra...@gmail.com>.

Thanks Yanbo,

I wonder why does SSV does not complain when i create using " new SSV(4,
Array(1, 3, 5, 7)"? Is there no error check for this even in the breeze
sparse vector's constructor? That is very strange

Shivani

On Tue, Nov 25, 2014 at 7:25 PM, Yanbo Liang <ya...@gmail.com> wrote:

> Hi Shivani,
>
> You misunderstand the parameter of SparseVector.
>
> class SparseVector(
>     override val size: Int,
>     val indices: Array[Int],
>     val values: Array[Double]) extends Vector {
> }
>
> The first parameter is the total length of the Vector rather than the
> length of non-zero elements.
> So it need greater than the maximum non-zero element index which is 21 in
> your case.
> The following code can work:
>
> val doc1s = new IndexedRow(1L, new SSV(22, Array(1, 3, 5, 7),Array(1.0,
> 1.0, 0.0, 5.0)))
> val doc2s = new IndexedRow(2L, new SSV(22, Array(1, 2, 4, 13), Array(0.0,
> 1.0, 2.0, 0.0)))
> val doc3s = new IndexedRow(3L, new SSV(22, Array(10, 14, 20,
> 21),Array(2.0, 0.0, 2.0, 1.0)))
> val doc4s = new IndexedRow(4L, new SSV(22, Array(3, 7, 13, 20),Array(2.0,
> 0.0, 2.0, 1.0)))
>
> 2014-11-26 10:09 GMT+08:00 Shivani Rao <ra...@gmail.com>:
>
>> Hello Spark fans,
>>
>> I am trying to use the IDF model available in the spark mllib to create
>> an tf-idf representation of a n RDD[Vectors]. Below i have attached my MWE
>>
>> I get the following error
>>
>> "java.lang.IndexOutOfBoundsException: 7 not in [-4,4)
>> at breeze.linalg.DenseVector.apply$mcI$sp(DenseVector.scala:70)
>> at breeze.linalg.DenseVector.apply(DenseVector.scala:69)
>> at
>> org.apache.spark.mllib.feature.IDF$DocumentFrequencyAggregator.add(IDF.scala:81)
>> "
>>
>> Any ideas?
>>
>> Regards,
>> Shivani
>>
>> import org.apache.spark.mllib.feature.VectorTransformer
>>
>> import
>> com.box.analytics.ml.dms.vector.{SparkSparseVector,SparkDenseVector}
>>
>> import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector =>
>> SSV}
>>
>> import org.apache.spark.mllib.linalg.{Vector => SparkVector}
>>
>> import org.apache.spark.mllib.linalg.distributed.{IndexedRow,
>> IndexedRowMatrix}
>>
>> import org.apache.spark.mllib.feature._
>>
>>
>>     val doc1s = new IndexedRow(1L, new SSV(4, Array(1, 3, 5,
>> 7),Array(1.0, 1.0, 0.0, 5.0)))
>>
>>     val doc2s = new IndexedRow(2L, new SSV(4, Array(1, 2, 4, 13),
>> Array(0.0, 1.0, 2.0, 0.0)))
>>
>>     val doc3s = new IndexedRow(3L, new SSV(4, Array(10, 14, 20,
>> 21),Array(2.0, 0.0, 2.0, 1.0)))
>>
>>     val doc4s = new IndexedRow(4L, new SSV(4, Array(3, 7, 13,
>> 20),Array(2.0, 0.0, 2.0, 1.0)))
>>
>>  val indata =
>> sc.parallelize(List(doc1s,doc2s,doc3s,doc4s)).map(e=>e.vector)
>>
>> (new IDF()).fit(indata).idf
>>
>> --
>> Software Engineer
>> Analytics Engineering Team@ Box
>> Mountain View, CA
>>
>
>


-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: IDF model error

Posted by Yanbo Liang <ya...@gmail.com>.

Hi Shivani,

You misunderstand the parameter of SparseVector.

class SparseVector(
    override val size: Int,
    val indices: Array[Int],
    val values: Array[Double]) extends Vector {
}

The first parameter is the total length of the Vector rather than the
length of non-zero elements.
So it need greater than the maximum non-zero element index which is 21 in
your case.
The following code can work:

val doc1s = new IndexedRow(1L, new SSV(22, Array(1, 3, 5, 7),Array(1.0,
1.0, 0.0, 5.0)))
val doc2s = new IndexedRow(2L, new SSV(22, Array(1, 2, 4, 13), Array(0.0,
1.0, 2.0, 0.0)))
val doc3s = new IndexedRow(3L, new SSV(22, Array(10, 14, 20, 21),Array(2.0,
0.0, 2.0, 1.0)))
val doc4s = new IndexedRow(4L, new SSV(22, Array(3, 7, 13, 20),Array(2.0,
0.0, 2.0, 1.0)))

2014-11-26 10:09 GMT+08:00 Shivani Rao <ra...@gmail.com>:

> Hello Spark fans,
>
> I am trying to use the IDF model available in the spark mllib to create an
> tf-idf representation of a n RDD[Vectors]. Below i have attached my MWE
>
> I get the following error
>
> "java.lang.IndexOutOfBoundsException: 7 not in [-4,4)
> at breeze.linalg.DenseVector.apply$mcI$sp(DenseVector.scala:70)
> at breeze.linalg.DenseVector.apply(DenseVector.scala:69)
> at
> org.apache.spark.mllib.feature.IDF$DocumentFrequencyAggregator.add(IDF.scala:81)
> "
>
> Any ideas?
>
> Regards,
> Shivani
>
> import org.apache.spark.mllib.feature.VectorTransformer
>
> import com.box.analytics.ml.dms.vector.{SparkSparseVector,SparkDenseVector}
>
> import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector =>
> SSV}
>
> import org.apache.spark.mllib.linalg.{Vector => SparkVector}
>
> import org.apache.spark.mllib.linalg.distributed.{IndexedRow,
> IndexedRowMatrix}
>
> import org.apache.spark.mllib.feature._
>
>
>     val doc1s = new IndexedRow(1L, new SSV(4, Array(1, 3, 5, 7),Array(1.0,
> 1.0, 0.0, 5.0)))
>
>     val doc2s = new IndexedRow(2L, new SSV(4, Array(1, 2, 4, 13),
> Array(0.0, 1.0, 2.0, 0.0)))
>
>     val doc3s = new IndexedRow(3L, new SSV(4, Array(10, 14, 20,
> 21),Array(2.0, 0.0, 2.0, 1.0)))
>
>     val doc4s = new IndexedRow(4L, new SSV(4, Array(3, 7, 13,
> 20),Array(2.0, 0.0, 2.0, 1.0)))
>
>  val indata =
> sc.parallelize(List(doc1s,doc2s,doc3s,doc4s)).map(e=>e.vector)
>
> (new IDF()).fit(indata).idf
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA
>