You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2013/04/18 22:41:44 UTC

Vector and Matrices - The Next Gen

Next obvious speedups ideas I can think of are:

1) Batch insert into OpenIntDoubleHashMap(OIDHM) and
OrderedIntDoubleMapping(OIDM). This way mutable operations like plus() or
minus() can iterate on the Intersection elements and add the difference in
one go. Can anyone think of a smart way to rehash based on new input
elements ?

2) Speed up aggregate and assign methods(Dan is doing that with)

3) Generalize caching framework of derived properties like
getLengthSquared() and extend it into other things, like commons norms (L1,
L2), numNonZeros(),

4) Parallelize operations: Use a consistent sharding function to trivially
parallelize certain iterative operations across multiple threads.

6) Replace current DenseVector and/or encapsulate JBlas inside it.

7) Improve exception handling.

All these can be independent projects. I know I wont get time to get to
this, I am more than happy to review

Re: Vector and Matrices - The Next Gen

Posted by Robin Anil <ro...@gmail.com>.
Obviously we need to benchmark with caching disabled. But I will do that as
part of Caliper integration. There could be benchmarks which increased only
due to the caching. A clean layer for caching these derived attributes can
help a lot with higher level algorithms.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sat, May 4, 2013 at 10:32 PM, Robin Anil <ro...@gmail.com> wrote:

> Caching layer would trivially give us a lot of benefit, especially for
> repeat calls. I think thats a very low hanging fruit.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sat, May 4, 2013 at 10:31 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> Here is the overall speedup so far since 0.7
>>
>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=3
>>
>> I back-ported the current benchmark code (Disabled SerializationBenchmark
>> which doesn't seem to work with 0.7) and ran against 0.7.
>> There is one regression. But rest have been pretty positive.
>>
>>
>> On Sun, Apr 21, 2013 at 12:37 PM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> On Sun, Apr 21, 2013 at 10:27 AM, Dan Filimon
>>> <da...@gmail.com>wrote:
>>>
>>> > > But multi-threaded assign would be very dangerous.  Even if you
>>> assign
>>> > > different parts of the vector to different threads, you have to worry
>>> > about
>>> > > cache line alignment which is generally not visible to Java without
>>> very
>>> > > special effort.
>>> > >
>>> >
>>> > I'm terrible at explaining what I mean.
>>> > So, rather than have the threads assign chunks of a vector (which would
>>> > only really work if the underlying Vector was an array of doubles),
>>> each
>>> > thread would return an OrderedIntDoubleMapping, and they would be
>>> merged
>>> > into a Vector by a single thread at the end.
>>> >
>>> > I wonder, even talking about cache alignment worries in Java makes me
>>> > wonder whether we'd be trying to outwit the JVM. It feels kind of
>>> wrong, as
>>> > I'm certain that the people writing Hotspot are better at optimizing
>>> the
>>> > code than me. :)
>>> >
>>>
>>> Yeah... the single thread updater is pretty much what I meant when I
>>> talked
>>> about a threaded map-reduce.  Everybody produces new data in parallel and
>>> then a single thread per vector makes it all coherent.
>>>
>>> This is actually kind of similar to the way that very large HPC matrix
>>> libraries work.  Message passing can be more efficient as an idiom for
>>> communicating data like this than shared memory even when efficient
>>> shared
>>> memory is available.
>>>
>>
>>
>

Re: Vector and Matrices - The Next Gen

Posted by Robin Anil <ro...@gmail.com>.
Caching layer would trivially give us a lot of benefit, especially for
repeat calls. I think thats a very low hanging fruit.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sat, May 4, 2013 at 10:31 PM, Robin Anil <ro...@gmail.com> wrote:

> Here is the overall speedup so far since 0.7
>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=3
>
> I back-ported the current benchmark code (Disabled SerializationBenchmark
> which doesn't seem to work with 0.7) and ran against 0.7.
> There is one regression. But rest have been pretty positive.
>
>
> On Sun, Apr 21, 2013 at 12:37 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> On Sun, Apr 21, 2013 at 10:27 AM, Dan Filimon
>> <da...@gmail.com>wrote:
>>
>> > > But multi-threaded assign would be very dangerous.  Even if you assign
>> > > different parts of the vector to different threads, you have to worry
>> > about
>> > > cache line alignment which is generally not visible to Java without
>> very
>> > > special effort.
>> > >
>> >
>> > I'm terrible at explaining what I mean.
>> > So, rather than have the threads assign chunks of a vector (which would
>> > only really work if the underlying Vector was an array of doubles), each
>> > thread would return an OrderedIntDoubleMapping, and they would be merged
>> > into a Vector by a single thread at the end.
>> >
>> > I wonder, even talking about cache alignment worries in Java makes me
>> > wonder whether we'd be trying to outwit the JVM. It feels kind of
>> wrong, as
>> > I'm certain that the people writing Hotspot are better at optimizing the
>> > code than me. :)
>> >
>>
>> Yeah... the single thread updater is pretty much what I meant when I
>> talked
>> about a threaded map-reduce.  Everybody produces new data in parallel and
>> then a single thread per vector makes it all coherent.
>>
>> This is actually kind of similar to the way that very large HPC matrix
>> libraries work.  Message passing can be more efficient as an idiom for
>> communicating data like this than shared memory even when efficient shared
>> memory is available.
>>
>
>

Re: Vector and Matrices - The Next Gen

Posted by Robin Anil <ro...@gmail.com>.
Here is the overall speedup so far since 0.7
https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=3

I back-ported the current benchmark code (Disabled SerializationBenchmark
which doesn't seem to work with 0.7) and ran against 0.7.
There is one regression. But rest have been pretty positive.


On Sun, Apr 21, 2013 at 12:37 PM, Ted Dunning <te...@gmail.com> wrote:

> On Sun, Apr 21, 2013 at 10:27 AM, Dan Filimon
> <da...@gmail.com>wrote:
>
> > > But multi-threaded assign would be very dangerous.  Even if you assign
> > > different parts of the vector to different threads, you have to worry
> > about
> > > cache line alignment which is generally not visible to Java without
> very
> > > special effort.
> > >
> >
> > I'm terrible at explaining what I mean.
> > So, rather than have the threads assign chunks of a vector (which would
> > only really work if the underlying Vector was an array of doubles), each
> > thread would return an OrderedIntDoubleMapping, and they would be merged
> > into a Vector by a single thread at the end.
> >
> > I wonder, even talking about cache alignment worries in Java makes me
> > wonder whether we'd be trying to outwit the JVM. It feels kind of wrong,
> as
> > I'm certain that the people writing Hotspot are better at optimizing the
> > code than me. :)
> >
>
> Yeah... the single thread updater is pretty much what I meant when I talked
> about a threaded map-reduce.  Everybody produces new data in parallel and
> then a single thread per vector makes it all coherent.
>
> This is actually kind of similar to the way that very large HPC matrix
> libraries work.  Message passing can be more efficient as an idiom for
> communicating data like this than shared memory even when efficient shared
> memory is available.
>

Re: Vector and Matrices - The Next Gen

Posted by Ted Dunning <te...@gmail.com>.
On Sun, Apr 21, 2013 at 10:27 AM, Dan Filimon
<da...@gmail.com>wrote:

> > But multi-threaded assign would be very dangerous.  Even if you assign
> > different parts of the vector to different threads, you have to worry
> about
> > cache line alignment which is generally not visible to Java without very
> > special effort.
> >
>
> I'm terrible at explaining what I mean.
> So, rather than have the threads assign chunks of a vector (which would
> only really work if the underlying Vector was an array of doubles), each
> thread would return an OrderedIntDoubleMapping, and they would be merged
> into a Vector by a single thread at the end.
>
> I wonder, even talking about cache alignment worries in Java makes me
> wonder whether we'd be trying to outwit the JVM. It feels kind of wrong, as
> I'm certain that the people writing Hotspot are better at optimizing the
> code than me. :)
>

Yeah... the single thread updater is pretty much what I meant when I talked
about a threaded map-reduce.  Everybody produces new data in parallel and
then a single thread per vector makes it all coherent.

This is actually kind of similar to the way that very large HPC matrix
libraries work.  Message passing can be more efficient as an idiom for
communicating data like this than shared memory even when efficient shared
memory is available.

Re: Vector and Matrices - The Next Gen

Posted by Dan Filimon <da...@gmail.com>.
On Sun, Apr 21, 2013 at 8:06 PM, Ted Dunning <te...@gmail.com> wrote:

> On Sun, Apr 21, 2013 at 1:14 AM, Dan Filimon <dangeorge.filimon@gmail.com
> >wrote:
>
> > Would this not work?
>
>
> [special loop to support multiple reader paradigm similar to map-reduce in
> a threaded model]
>
> Yes.  What you suggest would work.
>
> But multi-threaded assign would be very dangerous.  Even if you assign
> different parts of the vector to different threads, you have to worry about
> cache line alignment which is generally not visible to Java without very
> special effort.
>

I'm terrible at explaining what I mean.
So, rather than have the threads assign chunks of a vector (which would
only really work if the underlying Vector was an array of doubles), each
thread would return an OrderedIntDoubleMapping, and they would be merged
into a Vector by a single thread at the end.

I wonder, even talking about cache alignment worries in Java makes me
wonder whether we'd be trying to outwit the JVM. It feels kind of wrong, as
I'm certain that the people writing Hotspot are better at optimizing the
code than me. :)

Re: Vector and Matrices - The Next Gen

Posted by Ted Dunning <te...@gmail.com>.
On Sun, Apr 21, 2013 at 1:14 AM, Dan Filimon <da...@gmail.com>wrote:

> Would this not work?


[special loop to support multiple reader paradigm similar to map-reduce in
a threaded model]

Yes.  What you suggest would work.

But multi-threaded assign would be very dangerous.  Even if you assign
different parts of the vector to different threads, you have to worry about
cache line alignment which is generally not visible to Java without very
special effort.

Re: Vector and Matrices - The Next Gen

Posted by Dan Filimon <da...@gmail.com>.
On Sat, Apr 20, 2013 at 10:05 PM, Ted Dunning <te...@gmail.com> wrote:

> On Sat, Apr 20, 2013 at 9:45 AM, Dan Filimon <dangeorge.filimon@gmail.com
> >wrote:
>
> > Here is a preliminary version:
> > https://reviews.apache.org/r/10669/diff/#index_header
> >
> > Regarding the parallelization, the results would be valid as long as the
> > aggregating function is both commutative and associative (which we can
> now
> > check) but it adding the parallelization here might be too much work.
> >
>
> Actually, associativity and commutativity don't make threads safe.
>

I imagined the way it works is the most basic one – that each thread is
responsible on some chunk of the vector. Then, no two threads access the
same elements in the vector, but if you have n threads you need to combine
the results at the end.

If the operation on the vector is an aggregate + combine, I think that
associativity and commutativity of the aggregating function are enough
because it guarantees that we can process the chunks in any order and put
them together at the end resulting in the same value as if we had just gone
through the elements sequentially.

The problem arises from NUMA memory models.  The problem is that different
> threads can easily wind up accessing old versions of memory that other
> threads are working on.
>

Yeah, I agree that dealing with writes from other threads can cause
problems, which is why I think that when possible, the threads should not
share the data they're working on and definitely not read and write to it
concurrently.

One of the most important things that synchronization boundaries do is to
> force a partial ordering on memory operations as seen by all threads.
>  Different threads may see different orderings, but all of the orderings
> will be consistent with the weak ordering specified by the synch
> boundaries.
>
> It is possible to do lockless thread-safe code, but it requires very clever
> design.  Our matrices and vectors definitely are not designed that way.
>

I think that for performing parallel operations (at least on Vectors) if we
were to do this in a multithreaded way, we'd chunk up the vector (maybe
with a special iterator?) and do whatever we need to by only reading from
the initial vector and putting intermediate results in something like an
OrderedIntDoubleMapping. The resulting mappings could then be merge to form
the final vector.
This would be pretty similar to what we do since our plus/minus/... don't
mutate the original vector (however we'd do 1 more copy when merging the
chunks).

Would this not work?

Re: Vector and Matrices - The Next Gen

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Apr 20, 2013 at 9:45 AM, Dan Filimon <da...@gmail.com>wrote:

> Here is a preliminary version:
> https://reviews.apache.org/r/10669/diff/#index_header
>
> Regarding the parallelization, the results would be valid as long as the
> aggregating function is both commutative and associative (which we can now
> check) but it adding the parallelization here might be too much work.
>

Actually, associativity and commutativity don't make threads safe.

The problem arises from NUMA memory models.  The problem is that different
threads can easily wind up accessing old versions of memory that other
threads are working on.

One of the most important things that synchronization boundaries do is to
force a partial ordering on memory operations as seen by all threads.
 Different threads may see different orderings, but all of the orderings
will be consistent with the weak ordering specified by the synch boundaries.

It is possible to do lockless thread-safe code, but it requires very clever
design.  Our matrices and vectors definitely are not designed that way.

Re: Vector and Matrices - The Next Gen

Posted by Dan Filimon <da...@gmail.com>.
On Thu, Apr 18, 2013 at 11:41 PM, Robin Anil <ro...@gmail.com> wrote:

> Next obvious speedups ideas I can think of are:
>
> 1) Batch insert into OpenIntDoubleHashMap(OIDHM) and
> OrderedIntDoubleMapping(OIDM). This way mutable operations like plus() or
> minus() can iterate on the Intersection elements and add the difference in
> one go. Can anyone think of a smart way to rehash based on new input
> elements ?
>
> 2) Speed up aggregate and assign methods(Dan is doing that with)
>

Regarding this, I'm testing the code to see if anything breaks and then
want to see what the performance is like.
I'm experiment with making every operation a variant of aggregate() or
assign().
This is useful because there's just one code path to look and we can focus
on high-level optimizations that apply to a larger class of functions.

Here is a preliminary version:
https://reviews.apache.org/r/10669/diff/#index_header

Regarding the parallelization, the results would be valid as long as the
aggregating function is both commutative and associative (which we can now
check) but it adding the parallelization here might be too much work.


> 3) Generalize caching framework of derived properties like
> getLengthSquared() and extend it into other things, like commons norms (L1,
> L2), numNonZeros(),
>
> 4) Parallelize operations: Use a consistent sharding function to trivially
> parallelize certain iterative operations across multiple threads.
>
> 6) Replace current DenseVector and/or encapsulate JBlas inside it.
>
> 7) Improve exception handling.
>
> All these can be independent projects. I know I wont get time to get to
> this, I am more than happy to review
>

Re: Vector and Matrices - The Next Gen

Posted by Ted Dunning <te...@gmail.com>.
Threading higher in the stack usually avoids the need for much thread safety for matrices.  I can usually us a message passing style to push around updates and avoid the question entirely.  

Sent from my iPhone

On Apr 19, 2013, at 11:32, Robin Anil <ro...@gmail.com> wrote:

> Yes, that may be a better approach. Do we in any part of our code share the
> vectors across threads? Our Vector implementations are thread-unsafe. Maybe
> making it threadsafe or having a threadsafe version maybe one more thing we
> can build.
> 
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> 
> 
> On Thu, Apr 18, 2013 at 6:37 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> On Thu, Apr 18, 2013 at 1:41 PM, Robin Anil <ro...@gmail.com> wrote:
>> 
>>> 4) Parallelize operations: Use a consistent sharding function to
>> trivially
>>> parallelize certain iterative operations across multiple threads.
>>> 
>> 
>> I have done this with good results, but I find it is usually better to
>> thread higher up the stack because coarse threading is often easier than
>> fine grain threading.
>> 

Re: Vector and Matrices - The Next Gen

Posted by Jake Mannix <ja...@gmail.com>.
They're very unsafe, and it gets really complicated to make them
both highly performant and thread safe, and like Ted says: just synchronize
at a higher level.  You're never dealing with one Vector and want to max out
all 8 cores on that one vector, you're looking at millions of vectors - give
1/8th to each core, and don't bother with making them threadsafe in the
first place.

But we should probably document them all as being thread unsafe (although
anyone familiar with usual collections knows that unless you have special
purpose concurrent/syncrhonized collections, they're not thread safe).

On Fri, Apr 19, 2013 at 11:32 AM, Robin Anil <ro...@gmail.com> wrote:

> Yes, that may be a better approach. Do we in any part of our code share the
> vectors across threads? Our Vector implementations are thread-unsafe. Maybe
> making it threadsafe or having a threadsafe version maybe one more thing we
> can build.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Thu, Apr 18, 2013 at 6:37 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Thu, Apr 18, 2013 at 1:41 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > 4) Parallelize operations: Use a consistent sharding function to
> > trivially
> > > parallelize certain iterative operations across multiple threads.
> > >
> >
> > I have done this with good results, but I find it is usually better to
> > thread higher up the stack because coarse threading is often easier than
> > fine grain threading.
> >
>



-- 

  -jake

Re: Vector and Matrices - The Next Gen

Posted by Robin Anil <ro...@gmail.com>.
Yes, that may be a better approach. Do we in any part of our code share the
vectors across threads? Our Vector implementations are thread-unsafe. Maybe
making it threadsafe or having a threadsafe version maybe one more thing we
can build.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Thu, Apr 18, 2013 at 6:37 PM, Ted Dunning <te...@gmail.com> wrote:

> On Thu, Apr 18, 2013 at 1:41 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > 4) Parallelize operations: Use a consistent sharding function to
> trivially
> > parallelize certain iterative operations across multiple threads.
> >
>
> I have done this with good results, but I find it is usually better to
> thread higher up the stack because coarse threading is often easier than
> fine grain threading.
>

Re: Vector and Matrices - The Next Gen

Posted by Ted Dunning <te...@gmail.com>.
On Thu, Apr 18, 2013 at 1:41 PM, Robin Anil <ro...@gmail.com> wrote:

> 4) Parallelize operations: Use a consistent sharding function to trivially
> parallelize certain iterative operations across multiple threads.
>

I have done this with good results, but I find it is usually better to
thread higher up the stack because coarse threading is often easier than
fine grain threading.