You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Robineast <Ro...@xense.co.uk> on 2016/11/03 18:07:04 UTC

Re: mLIb solving linear regression with sparse inputs

Any reason why you can’t use built in linear regression e.g. http://spark.apache.org/docs/latest/ml-classification-regression.html#regression or http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression?

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 3 Nov 2016, at 16:08, im281 [via Apache Spark User List] <ml...@n3.nabble.com> wrote:
> 
> I want to solve the linear regression problem using spark with huge martrices: 
> 
> Ax = b 
> using least squares: 
> x = Inverse(A-transpose) * A)*A-transpose *b 
> 
> The A matrix is a large sparse matrix (as is the b vector). 
> 
> I have pondered several solutions to the Ax = b problem including: 
> 
> 1) directly solving the problem above where the matrix is transposed, multiplied by itself, the inverse is taken and then multiplied by A-transpose and then multiplied by b which will give the solution vector x 
> 
> 2) iterative solver (no need to take the inverse) 
> 
> My question is:
> 
> What is the best way to solve this problem using the MLib libraries, in JAVA and using RDD and spark? 
> 
> Is there any code as an example? Has anyone done this? 
> 
> 
> 
> 
> 
> The code to take in data represented as a coordinate matrix and perform transposition and multiplication is shown below but I need to take the inverse if I use this strategy: 
> 
> //Read coordinate matrix from text or database 
>                 JavaRDD<String> fileA = sc.textFile(file); 
> 
>                 //map text file with coordinate data (sparse matrix) to JavaRDD<MatrixEntry>
>                 JavaRDD<MatrixEntry> matrixA = fileA.map(new Function<String, MatrixEntry>() { 
>                     public MatrixEntry call(String x){ 
>                         String[] indeceValue = x.split(","); 
>                         long i = Long.parseLong(indeceValue[0]); 
>                         long j = Long.parseLong(indeceValue[1]); 
>                         double value = Double.parseDouble(indeceValue[2]); 
>                         return new MatrixEntry(i, j, value ); 
>                     } 
>                 }); 
>                 
>                 //coordinate matrix from sparse data 
>                 CoordinateMatrix cooMatrixA = new CoordinateMatrix(matrixA.rdd()); 
>                 
>                 //create block matrix 
>                 BlockMatrix matA = cooMatrixA.toBlockMatrix(); 
>                 
>                 //create block matrix after matrix multiplication (square matrix) 
>                 BlockMatrix ata = matA.transpose().multiply(matA); 
>                 
>                 //print out the original dense matrix 
>                 System.out.println(matA.toLocalMatrix().toString()); 
>                 
>                 //print out the transpose of the dense matrix 
>                 System.out.println(matA.transpose().toLocalMatrix().toString()); 
>                 
>                 //print out the square matrix (after multiplication) 
>                 System.out.println(ata.toLocalMatrix().toString()); 
>                 
>                 JavaRDD<MatrixEntry> entries = ata.toCoordinateMatrix().entries().toJavaRDD(); 
> 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006.html <http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006.html>
> To start a new topic under Apache Spark User List, email ml-node+s1001560n1h36@n3.nabble.com 
> To unsubscribe from Apache Spark User List, click here <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=Um9iaW4uZWFzdEB4ZW5zZS5jby51a3wxfDIzMzQzMDUyNg==>.
> NAML <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




-----
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: mLIb solving linear regression with sparse inputs

Posted by Robineast <Ro...@xense.co.uk>.

Well I did eventually write this code in Java, and it was very long! see 
https://github.com/insidedctm/sparse-linear-regression
<https://github.com/insidedctm/sparse-linear-regression>  



-----
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: mLIb solving linear regression with sparse inputs

Posted by im281 <im...@gmail.com>.


Thank you! Would happen to have this code in Java?.

This is extremely helpful!

Iman






On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User List]" <ml...@n3.nabble.com> wrote:












	Here’s a way of creating sparse vectors in MLLib:
import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.rdd.RDD
val rdd = sc.textFile("A.txt").map(line => line.split(",")).     map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))
val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))
val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3))val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)
val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))
If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.
One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.

-------------------------------------------------------------------------------Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action






On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email]> wrote:


	I would like to use it. But how do I do the following

1) Read sparse data (from text or database)

2) pass the sparse data to the linearRegression class?


For example:


Sparse matrix A

row, column, value

0,0,.42

0,1,.28

0,2,.89

1,0,.83

1,1,.34

1,2,.42

2,0,.23

3,0,.42

3,1,.98

3,2,.88

4,0,.23

4,1,.36

4,2,.97


Sparse vector b

row, column, value

0,2,.89

1,2,.42

3,2,.88

4,2,.97


Solve Ax = b???




	
	
	
	

	

	
	
		If you reply to this email, your message will be added to the discussion below:
		http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
	
	
		To start a new topic under Apache Spark User List, email [hidden email] 

		To unsubscribe from Apache Spark User List, click here.

		NAML
	


	
	
	
				Robin East 

Spark GraphX in Action Michael Malak and Robin East 

Manning Publications Co. 

http://www.manning.com/books/spark-graphx-in-action

			
	

	

	
	
		If you reply to this email, your message will be added to the discussion below:
		http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
	
	
		
		To unsubscribe from mLIb solving linear regression with sparse inputs, click here.

		NAML
	








--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28028.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: mLIb solving linear regression with sparse inputs

Posted by im281 <im...@gmail.com>.

Hi Robin,
It looks like the linear regression model takes in a dataset not a matrix?
It would be helpful for this example if you could set up the whole problem
end to end using one of the columns of the matrix as b. So A is a sparse
matrix and b is a sparse vector
Best regards.
Iman

On Sun, Nov 6, 2016 at 6:43 AM <im...@gmail.com> wrote:

> Thank you! Would happen to have this code in Java?.
> This is extremely helpful!
>
>
> Iman
>
>
>
>
> On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User
> List]" <ml...@n3.nabble.com> wrote:
>
> Here’s a way of creating sparse vectors in MLLib:
>
> import org.apache.spark.mllib.linalg.Vectors
> import org.apache.spark.rdd.RDD
>
> val rdd = sc.textFile("A.txt").map(line => line.split(",")).
>      map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))
>
> val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))
>
> val create = (first: (Int, Int, Double)) => (Array(first._2),
> Array(first._3))
> val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int,
> Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)
> val merge = (a: (Array[Int], Array[Double]), b: (Array[Int],
> Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)
>
> val A = pairRdd.combineByKey(create,combine,merge).map(el =>
> Vectors.sparse(3,el._2._1,el._2._2))
>
> If you have a separate file of b’s then you would need to manipulate this
> slightly to join the b’s to the A RDD and then create LabeledPoints. I
> guess there is a way of doing this using the newer ML interfaces but it’s
> not particularly obvious to me how.
>
> One point: In the example you give the b’s are exactly the same as col 2
> in the A matrix. I presume this is just a quick hacked together example
> because that would give a trivial result.
>
>
> -------------------------------------------------------------------------------
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden
> email] <http:///user/SendEmail.jtp?type=node&node=28027&i=0>> wrote:
>
> I would like to use it. But how do I do the following
> 1) Read sparse data (from text or database)
> 2) pass the sparse data to the linearRegression class?
>
> For example:
>
> Sparse matrix A
> row, column, value
> 0,0,.42
> 0,1,.28
> 0,2,.89
> 1,0,.83
> 1,1,.34
> 1,2,.42
> 2,0,.23
> 3,0,.42
> 3,1,.98
> 3,2,.88
> 4,0,.23
> 4,1,.36
> 4,2,.97
>
> Sparse vector b
> row, column, value
> 0,2,.89
> 1,2,.42
> 3,2,.88
> 4,2,.97
>
> Solve Ax = b???
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
> To start a new topic under Apache Spark User List, email [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=28027&i=1>
> To unsubscribe from Apache Spark User List, click here.
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
> To unsubscribe from mLIb solving linear regression with sparse inputs, click
> here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=28006&code=aW1hbi5tb2h0YXNoZW1pQGdtYWlsLmNvbXwyODAwNnwtMTc1OTAxNjQz>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28029.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: mLIb solving linear regression with sparse inputs

Posted by im281 <im...@gmail.com>.

Also in Java as well. Thanks again!
Iman

On Sun, Nov 6, 2016 at 8:28 AM Iman Mohtashemi <im...@gmail.com>
wrote:

Hi Robin,
It looks like the linear regression model takes in a dataset not a matrix?
It would be helpful for this example if you could set up the whole problem
end to end using one of the columns of the matrix as b. So A is a sparse
matrix and b is a sparse vector
Best regards.
Iman

On Sun, Nov 6, 2016 at 6:43 AM <im...@gmail.com> wrote:

Thank you! Would happen to have this code in Java?.
This is extremely helpful!


Iman




On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User
List]" <ml...@n3.nabble.com> wrote:

Here’s a way of creating sparse vectors in MLLib:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD

val rdd = sc.textFile("A.txt").map(line => line.split(",")).
     map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))

val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))

val create = (first: (Int, Int, Double)) => (Array(first._2),
Array(first._3))
val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double))
=> (head._1 :+ tail._2, head._2 :+ tail._3)
val merge = (a: (Array[Int], Array[Double]), b: (Array[Int],
Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)

val A = pairRdd.combineByKey(create,combine,merge).map(el =>
Vectors.sparse(3,el._2._1,el._2._2))

If you have a separate file of b’s then you would need to manipulate this
slightly to join the b’s to the A RDD and then create LabeledPoints. I
guess there is a way of doing this using the newer ML interfaces but it’s
not particularly obvious to me how.

One point: In the example you give the b’s are exactly the same as col 2 in
the A matrix. I presume this is just a quick hacked together example
because that would give a trivial result.

-------------------------------------------------------------------------------
Robin East
*Spark GraphX in Action* Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action





On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email]
<http:///user/SendEmail.jtp?type=node&node=28027&i=0>> wrote:

I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?

For example:

Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97

Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97

Solve Ax = b???



------------------------------
If you reply to this email, your message will be added to the discussion
below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
To start a new topic under Apache Spark User List, email [hidden email]
<http:///user/SendEmail.jtp?type=node&node=28027&i=1>
To unsubscribe from Apache Spark User List, click here.
NAML
<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action


------------------------------
If you reply to this email, your message will be added to the discussion
below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
To unsubscribe from mLIb solving linear regression with sparse inputs, click
here
<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=28006&code=aW1hbi5tb2h0YXNoZW1pQGdtYWlsLmNvbXwyODAwNnwtMTc1OTAxNjQz>
.
NAML
<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: mLIb solving linear regression with sparse inputs

Posted by Robineast <Ro...@xense.co.uk>.

Here’s a way of creating sparse vectors in MLLib:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD

val rdd = sc.textFile("A.txt").map(line => line.split(",")).
     map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))

val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))

val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3))
val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)
val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)

val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))

If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.

One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <ml...@n3.nabble.com> wrote:
> 
> I would like to use it. But how do I do the following 
> 1) Read sparse data (from text or database) 
> 2) pass the sparse data to the linearRegression class? 
> 
> For example: 
> 
> Sparse matrix A 
> row, column, value 
> 0,0,.42 
> 0,1,.28 
> 0,2,.89 
> 1,0,.83 
> 1,1,.34 
> 1,2,.42 
> 2,0,.23 
> 3,0,.42 
> 3,1,.98 
> 3,2,.88 
> 4,0,.23 
> 4,1,.36 
> 4,2,.97 
> 
> Sparse vector b 
> row, column, value 
> 0,2,.89 
> 1,2,.42 
> 3,2,.88 
> 4,2,.97 
> 
> Solve Ax = b??? 
> 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html <http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html>
> To start a new topic under Apache Spark User List, email ml-node+s1001560n1h36@n3.nabble.com 
> To unsubscribe from Apache Spark User List, click here <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=Um9iaW4uZWFzdEB4ZW5zZS5jby51a3wxfDIzMzQzMDUyNg==>.
> NAML <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




-----
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.