You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sa...@wellsfargo.com on 2015/07/13 23:30:01 UTC

MLLIB RDD segmentation for logistic regression

Hello all,

I have one big RDD, in which there is a column of groups A1, A2, B1, B2, B3, C1, D1, ..., XY.
Out of it, I am using map() to transform into RDD[LabeledPoint] with dense vectors for later use into Logistic Regression, which takes RDD[LabeledPoint]
I would like to run a logistic regression for each one of this N groups (which is NOT part of any features used in the model itself), but I could not find a proper way.

1.      Can't programatically create sub RDDs with a loop: org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations;

2.      Can't create RDDs manually with split() since unknown and large number of groups

3.      Pair RDDs seemed a tempting choice with some reduce/combine/values bykey functions, but non of them return a data-type valuable as a RDD[LabeledPoint] which is lately an input for Logistic Regressions. Any programatical way to get sub-RDDs get me back to item 1.

The logit is a simple binary dependant variable out of n features, I just need to run one logit for each group.
There may be some mathematical equivalent to run this in one big regression, but so far, im out of ideas.

Saif