You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kyle Ellrott <ke...@soe.ucsc.edu> on 2014/06/19 20:21:24 UTC

Parallel LogisticRegression?

I'm working on a problem learning several different sets of responses
against the same set of training features. Right now I've written the
program to cycle through all of the different label sets, attached them to
the training data and run LogisticRegressionWithSGD on each of them. ie

foreach curResponseSet in allResponses:
     currentRDD : RDD[LabeledPoints] = curResponseSet joined with
trainingData
     LogisticRegressionWithSGD.train(currentRDD)


Each of the different training runs are independent. It seems like I should
be parallelize them as well.
Is there a better way to do this?


Kyle

Re: Parallel LogisticRegression?

Posted by Kyle Ellrott <ke...@soe.ucsc.edu>.

I looks like I was running into
https://issues.apache.org/jira/browse/SPARK-2204
The issues went away when I changed to spark.mesos.coarse.

Kyle


On Fri, Jun 20, 2014 at 10:36 AM, Kyle Ellrott <ke...@soe.ucsc.edu>
wrote:

> I've tried to parallelize the separate regressions using
> allResponses.toParArray.map( x=> do logistic regression against labels in x)
> But I start to see messages like
> 14/06/20 10:10:26 WARN scheduler.TaskSetManager: Lost TID 4193 (task
> 363.0:4)
> 14/06/20 10:10:27 WARN scheduler.TaskSetManager: Loss was due to fetch
> failure from null
> and finally
> 14/06/20 10:10:26 ERROR scheduler.TaskSetManager: Task 363.0:4 failed 4
> times; aborting job
>
> Then
> 14/06/20 10:10:26 ERROR scheduler.DAGSchedulerActorSupervisor:
> eventProcesserActor failed due to the error null; shutting down SparkContext
> 14/06/20 10:10:26 ERROR actor.OneForOneStrategy:
> java.lang.UnsupportedOperationException
> at
> org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
> at
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
>  at
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
>
>
> This doesn't happen when I don't use toParArray. I read that spark was
> thread safe, but I seem to be running into problems. Am I doing something
> wrong?
>
> Kyle
>
>
>
> On Thu, Jun 19, 2014 at 11:21 AM, Kyle Ellrott <ke...@soe.ucsc.edu>
> wrote:
>
>>
>> I'm working on a problem learning several different sets of responses
>> against the same set of training features. Right now I've written the
>> program to cycle through all of the different label sets, attached them to
>> the training data and run LogisticRegressionWithSGD on each of them. ie
>>
>> foreach curResponseSet in allResponses:
>>      currentRDD : RDD[LabeledPoints] = curResponseSet joined with
>> trainingData
>>      LogisticRegressionWithSGD.train(currentRDD)
>>
>>
>> Each of the different training runs are independent. It seems like I
>> should be parallelize them as well.
>> Is there a better way to do this?
>>
>>
>> Kyle
>>
>
>

Re: Parallel LogisticRegression?

Posted by Kyle Ellrott <ke...@soe.ucsc.edu>.

I've tried to parallelize the separate regressions using
allResponses.toParArray.map( x=> do logistic regression against labels in x)
But I start to see messages like
14/06/20 10:10:26 WARN scheduler.TaskSetManager: Lost TID 4193 (task
363.0:4)
14/06/20 10:10:27 WARN scheduler.TaskSetManager: Loss was due to fetch
failure from null
and finally
14/06/20 10:10:26 ERROR scheduler.TaskSetManager: Task 363.0:4 failed 4
times; aborting job

Then
14/06/20 10:10:26 ERROR scheduler.DAGSchedulerActorSupervisor:
eventProcesserActor failed due to the error null; shutting down SparkContext
14/06/20 10:10:26 ERROR actor.OneForOneStrategy:
java.lang.UnsupportedOperationException
at
org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
at
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)


This doesn't happen when I don't use toParArray. I read that spark was
thread safe, but I seem to be running into problems. Am I doing something
wrong?

Kyle



On Thu, Jun 19, 2014 at 11:21 AM, Kyle Ellrott <ke...@soe.ucsc.edu>
wrote:

>
> I'm working on a problem learning several different sets of responses
> against the same set of training features. Right now I've written the
> program to cycle through all of the different label sets, attached them to
> the training data and run LogisticRegressionWithSGD on each of them. ie
>
> foreach curResponseSet in allResponses:
>      currentRDD : RDD[LabeledPoints] = curResponseSet joined with
> trainingData
>      LogisticRegressionWithSGD.train(currentRDD)
>
>
> Each of the different training runs are independent. It seems like I
> should be parallelize them as well.
> Is there a better way to do this?
>
>
> Kyle
>