You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jshao <ja...@gmail.com> on 2014/01/31 19:49:35 UTC

MLLib Sparse Input

Hi,

Spark is absolutely amazing for machine learning as its iterative process is
super fast. However one big issue that I realized was that the MLLib API
isn't suitable for sparse inputs at all because it requires the feature
vector to be a dense array.

For example, I currently want to run a logistic regression on data that is
wide and sparse (each data point might have 3 million fields with most of
them being 0). It is impossible to represent each data point as an array of
length 3 million.

Can I expect/contribute to any changes that might handle sparse inputs?

Thanks,
Jason



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: MLLib Sparse Input

Posted by Xiangrui Meng <me...@gmail.com>.

Hi,

I created a JIRA for discussion and track the progress:

https://spark-project.atlassian.net/browse/MLLIB-18

Let us move our discussion there.

Best,
Xiangrui

On Wed, Feb 5, 2014 at 3:35 PM, Debasish Das <de...@gmail.com> wrote:
> Hi Xiangrui,
>
> We are also adding support for sparse format in mllib...if you have a pull
> request or jira link could you please point to it ? Jblas does not implememt
> sparse formats the last time I looked at it but colt had sparse formats
> which could be reused...
>
> Thanks.
> Deb
>
> On Jan 31, 2014 11:15 AM, "Xiangrui Meng" <me...@gmail.com> wrote:
>>
>> Hi Jason,
>>
>> Sorry, I didn't see this message before I replied in another thread.
>> So the following is copy-and-paste:
>>
>> We are currently working on the sparse data support, one of the
>> highest priority features for MLlib. All existing algorithms will
>> support sparse input. We will open a JIRA ticket for progress tracking
>> and discussions.
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Jan 31, 2014 at 10:49 AM, jshao <ja...@gmail.com> wrote:
>> > Hi,
>> >
>> > Spark is absolutely amazing for machine learning as its iterative
>> > process is
>> > super fast. However one big issue that I realized was that the MLLib API
>> > isn't suitable for sparse inputs at all because it requires the feature
>> > vector to be a dense array.
>> >
>> > For example, I currently want to run a logistic regression on data that
>> > is
>> > wide and sparse (each data point might have 3 million fields with most
>> > of
>> > them being 0). It is impossible to represent each data point as an array
>> > of
>> > length 3 million.
>> >
>> > Can I expect/contribute to any changes that might handle sparse inputs?
>> >
>> > Thanks,
>> > Jason
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: MLLib Sparse Input

Posted by Debasish Das <de...@gmail.com>.

Hi Xiangrui,

We are also adding support for sparse format in mllib...if you have a pull
request or jira link could you please point to it ? Jblas does not
implememt sparse formats the last time I looked at it but colt had sparse
formats which could be reused...

Thanks.
Deb
 On Jan 31, 2014 11:15 AM, "Xiangrui Meng" <me...@gmail.com> wrote:

> Hi Jason,
>
> Sorry, I didn't see this message before I replied in another thread.
> So the following is copy-and-paste:
>
> We are currently working on the sparse data support, one of the
> highest priority features for MLlib. All existing algorithms will
> support sparse input. We will open a JIRA ticket for progress tracking
> and discussions.
>
> Best,
> Xiangrui
>
> On Fri, Jan 31, 2014 at 10:49 AM, jshao <ja...@gmail.com> wrote:
> > Hi,
> >
> > Spark is absolutely amazing for machine learning as its iterative
> process is
> > super fast. However one big issue that I realized was that the MLLib API
> > isn't suitable for sparse inputs at all because it requires the feature
> > vector to be a dense array.
> >
> > For example, I currently want to run a logistic regression on data that
> is
> > wide and sparse (each data point might have 3 million fields with most of
> > them being 0). It is impossible to represent each data point as an array
> of
> > length 3 million.
> >
> > Can I expect/contribute to any changes that might handle sparse inputs?
> >
> > Thanks,
> > Jason
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: MLLib Sparse Input

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Imran,

Yes, for better performance in computation, we should use CSC or CSR
format. I believe that we are moving towards that direction. But let
us first discuss the format for sparse input. Do you mind moving our
discussion to the JIRA I created?

https://spark-project.atlassian.net/browse/MLLIB-18

Xiangrui

On Wed, Feb 5, 2014 at 2:11 PM, Imran Rashid <im...@quantifind.com> wrote:
> Hi,
>
> I was hoping to have some discussion about how sparse matrices are
> represented in MLLib.  I noticed a few commits to add a basic MatrixEntry
> object:
>
> https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/MatrixEntry.scala
>
> I know that MatrixEntry is a really nice abstraction, but from my experience
> putting all your data in objects can really destroy performance with
> numerical computation.  You either waste a lot of memory with that
> representation, or you waste a ton of compute time serializing &
> deserializing.  Either way, you also lose cache-locality for your data.
>
> I'm thinking we'd want the basic record type to be a sparse vector:
>
> class SparseVector(colIds: Array[Int], colValues: Array[Double])
>
> where the two arrays are aligned.  This is compact & lets you iterate over
> all the values easily & efficiently.  Or actually for really sparse data,
> you might even want to group multiple rows together into one set of arrays:
>
> class SparseMatrixBlock(int rowStartIdx: Array[Int], colIds: Array[Int],
> colValues: Array[Double])
>
> which would store a group of rows (or cols), but saving space & getting
> cache locality by not using a separate array for each one.
>
> I'm speaking just from experience with non-distributed implementations, but
> the savings in both speed & memory from using a format like this are pretty
> humongous, I'd imagine they apply to spark also.  (And if not, it probably
> means spark is introducing some overhead that we should try to eliminate.)
>
> I know the format is not as pleasant to work with, but I think its
> definitely worth it.  And now is really the time to get it right, before a
> lot of stuff is written.  We could write a bunch of helper methods for
> SparseMatrixBlock to make it easier to work with.
>
> I know this is a big decision that will effect a lot of code.  Not sure if
> anybody else has some work in progress.  I could potentially try to get some
> performance numbers for comparison and point at some code I've used in the
> past, though it's all very rough.
>
> thanks,
> Imran
>
>
>
> On Fri, Jan 31, 2014 at 1:14 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> Hi Jason,
>>
>> Sorry, I didn't see this message before I replied in another thread.
>> So the following is copy-and-paste:
>>
>> We are currently working on the sparse data support, one of the
>> highest priority features for MLlib. All existing algorithms will
>> support sparse input. We will open a JIRA ticket for progress tracking
>> and discussions.
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Jan 31, 2014 at 10:49 AM, jshao <ja...@gmail.com> wrote:
>> > Hi,
>> >
>> > Spark is absolutely amazing for machine learning as its iterative
>> > process is
>> > super fast. However one big issue that I realized was that the MLLib API
>> > isn't suitable for sparse inputs at all because it requires the feature
>> > vector to be a dense array.
>> >
>> > For example, I currently want to run a logistic regression on data that
>> > is
>> > wide and sparse (each data point might have 3 million fields with most
>> > of
>> > them being 0). It is impossible to represent each data point as an array
>> > of
>> > length 3 million.
>> >
>> > Can I expect/contribute to any changes that might handle sparse inputs?
>> >
>> > Thanks,
>> > Jason
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: MLLib Sparse Input

Posted by Imran Rashid <im...@quantifind.com>.

Hi,

I was hoping to have some discussion about how sparse matrices are
represented in MLLib.  I noticed a few commits to add a basic MatrixEntry
object:

https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/MatrixEntry.scala

I know that MatrixEntry is a really nice abstraction, but from my
experience putting all your data in objects can really destroy performance
with numerical computation.  You either waste a lot of memory with that
representation, or you waste a ton of compute time serializing &
deserializing.  Either way, you also lose cache-locality for your data.

I'm thinking we'd want the basic record type to be a sparse vector:

class SparseVector(colIds: Array[Int], colValues: Array[Double])

where the two arrays are aligned.  This is compact & lets you iterate over
all the values easily & efficiently.  Or actually for really sparse data,
you might even want to group multiple rows together into one set of arrays:

class SparseMatrixBlock(int rowStartIdx: Array[Int], colIds: Array[Int],
colValues: Array[Double])

which would store a group of rows (or cols), but saving space & getting
cache locality by not using a separate array for each one.

I'm speaking just from experience with non-distributed implementations, but
the savings in both speed & memory from using a format like this are pretty
humongous, I'd imagine they apply to spark also.  (And if not, it probably
means spark is introducing some overhead that we should try to eliminate.)

I know the format is not as pleasant to work with, but I think its
definitely worth it.  And now is really the time to get it right, before a
lot of stuff is written.  We could write a bunch of helper methods for
SparseMatrixBlock to make it easier to work with.

I know this is a big decision that will effect a lot of code.  Not sure if
anybody else has some work in progress.  I could potentially try to get
some performance numbers for comparison and point at some code I've used in
the past, though it's all very rough.

thanks,
Imran

On Fri, Jan 31, 2014 at 1:14 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Jason,
>
> Sorry, I didn't see this message before I replied in another thread.
> So the following is copy-and-paste:
>
> We are currently working on the sparse data support, one of the
> highest priority features for MLlib. All existing algorithms will
> support sparse input. We will open a JIRA ticket for progress tracking
> and discussions.
>
> Best,
> Xiangrui
>
> On Fri, Jan 31, 2014 at 10:49 AM, jshao <ja...@gmail.com> wrote:
> > Hi,
> >
> > Spark is absolutely amazing for machine learning as its iterative
> process is
> > super fast. However one big issue that I realized was that the MLLib API
> > isn't suitable for sparse inputs at all because it requires the feature
> > vector to be a dense array.
> >
> > For example, I currently want to run a logistic regression on data that
> is
> > wide and sparse (each data point might have 3 million fields with most of
> > them being 0). It is impossible to represent each data point as an array
> of
> > length 3 million.
> >
> > Can I expect/contribute to any changes that might handle sparse inputs?
> >
> > Thanks,
> > Jason
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: MLLib Sparse Input

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Jason,

Sorry, I didn't see this message before I replied in another thread.
So the following is copy-and-paste:

We are currently working on the sparse data support, one of the
highest priority features for MLlib. All existing algorithms will
support sparse input. We will open a JIRA ticket for progress tracking
and discussions.

Best,
Xiangrui

On Fri, Jan 31, 2014 at 10:49 AM, jshao <ja...@gmail.com> wrote:
> Hi,
>
> Spark is absolutely amazing for machine learning as its iterative process is
> super fast. However one big issue that I realized was that the MLLib API
> isn't suitable for sparse inputs at all because it requires the feature
> vector to be a dense array.
>
> For example, I currently want to run a logistic regression on data that is
> wide and sparse (each data point might have 3 million fields with most of
> them being 0). It is impossible to represent each data point as an array of
> length 3 million.
>
> Can I expect/contribute to any changes that might handle sparse inputs?
>
> Thanks,
> Jason
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.