You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "John Canny (JIRA)" <ji...@apache.org> on 2015/04/20 17:57:59 UTC

[jira] [Closed] (SPARK-6864) Spark's Multilabel Classifier runs out of memory on small datasets

     [ https://issues.apache.org/jira/browse/SPARK-6864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Canny closed SPARK-6864.
-----------------------------
    Resolution: Not A Problem

> Spark's Multilabel Classifier runs out of memory on small datasets
> ------------------------------------------------------------------
>
>                 Key: SPARK-6864
>                 URL: https://issues.apache.org/jira/browse/SPARK-6864
>             Project: Spark
>          Issue Type: Test
>          Components: MLlib
>    Affects Versions: 1.2.1
>         Environment: EC2 with 8-96 instances up to r3.4xlarge
> The test fails on every configuration
>            Reporter: John Canny
>             Fix For: 1.2.1
>
>
> When trying to run Spark's MultiLabel classifier (LogisticRegressionWithLBFGS) on the RCV1 V2 dataset (about 0.5GB, 100 labels), the classifier runs out of memory. The number of tasks per executor doesnt seem to matter. It happens even with a single task per 120 GB executor. The dataset is the concatenation of the test files from the "rcv1v2 (topics; full sets)" group here:
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html
> Here's the code:
> import org.apache.spark.SparkContext
> import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
> import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
> import org.apache.spark.mllib.optimization.L1Updater
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.mllib.linalg.Vectors
> import org.apache.spark.mllib.util.MLUtils
> import scala.compat.Platform._ 
> val nnodes = 8
> val t0=currentTime
> // Load training data in LIBSVM format.
> val train = MLUtils.loadLibSVMFile(sc, "s3n://bidmach/RCV1train.libsvm", true, 276544, nnodes)
> val test = MLUtils.loadLibSVMFile(sc, "s3n://bidmach/RCV1test.libsvm", true, 276544, nnodes)
> val t1=currentTime;
> val lrAlg = new LogisticRegressionWithLBFGS()
> lrAlg.setNumClasses(100).optimizer.
>   setNumIterations(10).
>   setRegParam(1e-10).
>   setUpdater(new L1Updater)
> // Run training algorithm to build the model
> val model = lrAlg.run(train)
> val t2=currentTime



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org