You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Lizhengbing (bing, BIPA)" <zh...@huawei.com> on 2014/07/08 08:29:55 UTC

Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

1)  I download the imdb data from http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test LBFGS
When I run examples referencing http://spark.apache.org/docs/latest/mllib-optimization.html,  an error occus.
4/07/07 08:37:27 ERROR Executor: Exception in task ID 2
java.lang.ArrayIndexOutOfBoundsException: -1
         at breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:231)
         at breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:216)
         at breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60)
         at breeze.linalg.VectorOps$$anon$178.apply(Vector.scala:391)
         at breeze.linalg.NumericOps$class.dot(NumericOps.scala:83)
         at breeze.linalg.DenseVector.dot(DenseVector.scala:47)
..................

2)  I find the imdb data are zero-based-index data
0 0:1 3:1 6208:1 8936:1 8959:1 16434:1 29840:1 29843:1 30274:1 32092:1 63727:1 109302:1 114311:1 114336:1 119637:1 121867:1 143744:1 145106:1 186951:1 216401:1 228548:1 248919:1 251691:1 294713:1 302316:1 307685:1 316421:1 316556:1 317062:1 321771:1 327174:1 364381:1 384514:1 404531:1 414947:1 434235:1 434250:1 462625:1 471013:1 503923:1 511725:1 514582:1 514635:1 519251:1 524274:1 540734:1 556018:1 559036:1 559037:1 559039:1 559341:1 609032:1 644534:1 650763:1 659114:1 666864:1 669778:1 669783:1 669787:1 673083:1

3) If change code "val index = indexAndValue(0).toInt - 1" to "val index = indexAndValue(0).toInt - offset" (offset equals 0 or 1 based on user's selection), then MLUtils.loadLibSVMFile will support both zero-based-index data and one-based-index data.
  That also means the interface of MLUtils.loadLibSVMFile will be changed

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

Posted by "Evan R. Sparks" <ev...@gmail.com>.

As Sean mentions, if you can change the data to the standard format, that's
probably a good idea. If you'd rather read the data raw, then writing your
own version of loadLibSVMFile - then you could make your own loader
function which is very similar to the existing one with a few characters
removed:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L81

You will also likely need to change the logic where it determines the
number of features (currently line 95)

On Tue, Jul 8, 2014 at 12:22 AM, Sean Owen <so...@cloudera.com> wrote:

> On Tue, Jul 8, 2014 at 7:29 AM, Lizhengbing (bing, BIPA) <
> zhengbing.li@huawei.com> wrote:
>
> >
> > 1)  I download the imdb data from
> > http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test
> > LBFGS
> > 2)  I find the imdb data are zero-based-index data
> >
>
> Since the method is for parsing the LIBSVM format, and its labels are
> always 1-indexed IIUC, I don't think it would make sense to read 0-indexed
> labels. It sounds like that input is not properly formatted, unless anyone
> knows to the contrary?
>

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

Posted by Sean Owen <so...@cloudera.com>.

On Tue, Jul 8, 2014 at 7:29 AM, Lizhengbing (bing, BIPA) <
zhengbing.li@huawei.com> wrote:

>
> 1)  I download the imdb data from
> http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test
> LBFGS
> 2)  I find the imdb data are zero-based-index data
>

Since the method is for parsing the LIBSVM format, and its labels are
always 1-indexed IIUC, I don't think it would make sense to read 0-indexed
labels. It sounds like that input is not properly formatted, unless anyone
knows to the contrary?