You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Josh Goldsborough <jo...@gmail.com> on 2018/03/26 20:46:07 UTC

[Spark R]: Linear Mixed-Effects Models in Spark R

The company I work for is trying to do some mixed-effects regression
modeling in our new big data platform including SparkR.

We can run via SparkR's support of native R & use lme4.  But it runs single
threaded.  So we're looking for tricks/techniques to process large data
sets.


This was asked a couple years ago:
https://stackoverflow.com/questions/39790820/mixed-effects-models-in-spark-or-other-technology

But I wanted to ask again, in case anyone had an answer now.

Thanks,
Josh Goldsborough

Re: [Spark R]: Linear Mixed-Effects Models in Spark R

Posted by Felix Cheung <fe...@hotmail.com>.

If your data can be split into groups and you can call into your favorite R package on each group of data (in parallel):

https://spark.apache.org/docs/latest/sparkr.html#run-a-given-function-on-a-large-dataset-grouping-by-input-columns-and-using-gapply-or-gapplycollect

________________________________
From: Nisha Muktewar <ni...@cloudera.com>
Sent: Monday, March 26, 2018 2:27:52 PM
To: Josh Goldsborough
Cc: user
Subject: Re: [Spark R]: Linear Mixed-Effects Models in Spark R

Look at LinkedIn's Photon ML package: https://github.com/linkedin/photon-ml

One of the caveats is/was that the input data has to be in Avro in a specific format.

On Mon, Mar 26, 2018 at 1:46 PM, Josh Goldsborough <jo...@gmail.com>> wrote:
The company I work for is trying to do some mixed-effects regression modeling in our new big data platform including SparkR.

We can run via SparkR's support of native R & use lme4.  But it runs single threaded.  So we're looking for tricks/techniques to process large data sets.

This was asked a couple years ago:
https://stackoverflow.com/questions/39790820/mixed-effects-models-in-spark-or-other-technology

But I wanted to ask again, in case anyone had an answer now.

Thanks,
Josh Goldsborough

Re: [Spark R]: Linear Mixed-Effects Models in Spark R

Posted by Nisha Muktewar <ni...@cloudera.com>.

Look at LinkedIn's Photon ML package: https://github.com/linkedin/photon-ml

One of the caveats is/was that the input data has to be in Avro in a
specific format.

On Mon, Mar 26, 2018 at 1:46 PM, Josh Goldsborough <
joshgoldsboroughster@gmail.com> wrote:

> The company I work for is trying to do some mixed-effects regression
> modeling in our new big data platform including SparkR.
>
> We can run via SparkR's support of native R & use lme4.  But it runs
> single threaded.  So we're looking for tricks/techniques to process large
> data sets.
>
>
> This was asked a couple years ago:
> https://stackoverflow.com/questions/39790820/mixed-
> effects-models-in-spark-or-other-technology
>
> But I wanted to ask again, in case anyone had an answer now.
>
> Thanks,
> Josh Goldsborough
>

Re: [Spark R]: Linear Mixed-Effects Models in Spark R

Posted by Jörn Franke <jo...@gmail.com>.

SparkR does not mean all libraries of R are executed by magic in a distributed fashion that scales with the data. In fact that is similar to many other analytical software. They have the possibility to run things in parallel but the libraries themselves are not using them. Reason is that it is very hard to write ml algorithms correctly that scale in a distributed fashion.
What choices do you have now?
1) Work only with a random small sample from the population to train your model. This way is anyway recommended for large Dataset because usually you have to evaluate many different algorithms and parameters in parallel. If you do this all the time on the full Dataset you bring your platform to the limit. Golden rule is that at maximum you evaluate only good models (tested before on a smaller dataset) on the larger Dataset. Note that this approach is only possible if this is possible for your use case (some algorithms simply require a lot of data so they work).
2) use the r bindings for Spark Ml Lib and implemented your model yourself by leveraging some of the functionality there
https://spark.apache.org/docs/latest/ml-guide.html

> On 26. Mar 2018, at 22:46, Josh Goldsborough <jo...@gmail.com> wrote:
> 
> The company I work for is trying to do some mixed-effects regression modeling in our new big data platform including SparkR.
> 
> We can run via SparkR's support of native R & use lme4.  But it runs single threaded.  So we're looking for tricks/techniques to process large data sets.
> 
> 
> This was asked a couple years ago:
> https://stackoverflow.com/questions/39790820/mixed-effects-models-in-spark-or-other-technology
> 
> But I wanted to ask again, in case anyone had an answer now.
> 
> Thanks,
> Josh Goldsborough