You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Niketan Pansare <np...@us.ibm.com> on 2016/05/03 22:26:19 UTC

Discussion on GPU backend


Hi all,

I have updated the design document for our GPU backend in the JIRA
https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
details are based on the prototype I created and is available in PR
https://github.com/apache/incubator-systemml/pull/131. Once we are done
with the discussion, I can clean up and separate out the GPU backend in a
separate PR for easier review :)

Here are key design points:
A GPU backend would implement two abstract classes:
   1.	GPUContext
   2.	GPUObject



The GPUContext is responsible for GPU memory management and gets call-backs
from SystemML's bufferpool on following methods:
   1.	void acquireRead(MatrixObject mo)
   2.	void acquireModify(MatrixObject mo)
   3.	void release(MatrixObject mo, boolean isGPUCopyModified)
   4.	void exportData(MatrixObject mo)
   5.	void evict(MatrixObject mo)



A GPUObject (like RDDObject and BroadcastObject) is stored in CacheableData
object. It contains following methods that are called back from the
corresponding GPUContext:
   1.	void allocateMemoryOnDevice()
   2.	void deallocateMemoryOnDevice()
   3.	long getSizeOnDevice()
   4.	void copyFromHostToDevice()
   5.	void copyFromDeviceToHost()



In the initial implementation, we will add JCudaContext and JCudaPointer
that will extend the above abstract classes respectively. The JCudaContext
will be created by ExecutionContextFactory depending on the user-specified
accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: GPU and
implement GPU instructions.

The above design is general enough so that other people can implement
custom accelerators (for example: OpenCL) and also follows the design
principles of our CP bufferpool.

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

Re: Discussion on GPU backend

Posted by Luciano Resende <lu...@gmail.com>.
Well, there is no requirement for building platform specific binaries, so
one option would be to build for "Linux" and provide user documentation in
case there is interest around platform specific binaries (e.g. this is how
Hadoop does with regards of Windows support and required DLLs)

On Wed, May 18, 2016 at 10:51 AM, Deron Eriksson <de...@gmail.com>
wrote:

> Hi,
>
> I'm wondering what would be a good way to handle JCuda in terms of the
> build release packages. Currently we have 11 artifacts that we are
> building:
>    systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
>    systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
>    systemml-0.10.0-incubating-SNAPSHOT-sources.jar
>    systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
>    systemml-0.10.0-incubating-SNAPSHOT-src.zip
>    systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
>    systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
>    systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
>    systemml-0.10.0-incubating-SNAPSHOT.jar
>    systemml-0.10.0-incubating-SNAPSHOT.tar.gz
>    systemml-0.10.0-incubating-SNAPSHOT.zip
>
> It looks like JCuda is platform-specific, so you typically need different
> jars/dlls/sos/etc for each platform. If I'm understanding things correctly,
> if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
> artifacts for JCuda, we'd potentially have an enormous number of artifacts.
>
> Is this something that could be potentially handled by specific profiles in
> the pom so that a user might be able to do something like "mvn clean
> package -P jcuda-windows" so that a user could be responsible for building
> the platform-specific SystemML jar for jcuda? Or is this something that
> could be handled differently, by putting the platform-specific jcuda jar on
> the classpath and any dlls or other needed libraries on the path?
>
> Deron
>
>
>
> On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Luciano,
> >
> > Like all our backends, there is no change in the programming model. The
> > user submits a DML script and specifies whether she wants to use an
> > accelerator. Assuming that we compile jcuda jars into SystemML.jar, the
> > user can use GPU backend using following command:
> > spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> > hybrid_spark
> >
> > The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or so
> > files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> > <https://issues.apache.org/jira/browse/SPARK-1720> ... For example: the
> > user can add following to spark-env.sh
> > export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH
> >
> > The first version of GPU backend will only accelerate CP. In this case,
> we
> > have four types of instructions:
> > 1. CP
> > 2. GPU (requires GPU on the driver)
> > 3. SPARK
> > 4. MR
> >
> > Note, the first version will require the CUDA/JCuda dependency to be
> > installed on the driver only.
> >
> > The next version will accelerate our distributed instructions as well. In
> > this case, we will have six types of instructions:
> > 1. CP
> > 2. GPU
> > 3. SPARK
> > 4. MR
> > 5. SPARK-GPU (requires GPU cluster)
> > 6. MR-GPU (requires GPU cluster)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Luciano Resende ---05/17/2016 09:13:24
> > PM---Great to see detailed information on this topic Niketan,]Luciano
> > Resende ---05/17/2016 09:13:24 PM---Great to see detailed information on
> > this topic Niketan, I guess I have missed when you posted it in
> >
> > From: Luciano Resende <lu...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 05/17/2016 09:13 PM
> > Subject: Re: Discussion on GPU backend
> > ------------------------------
> >
> >
> >
> > Great to see detailed information on this topic Niketan, I guess I have
> > missed when you posted it initially.
> >
> > Could you elaborate a little more on what is the programming model for
> when
> > the user wants to leverage GPU ? Also, today I can submit a job to spark
> > using --jars and it will handle copying the dependencies to the worker
> > nodes. If my application wants to leverage GPU, what extras dependencies
> > will be required on the worker nodes, and how they are going to be
> > installed/updated on the Spark cluster ?
> >
> >
> >
> > On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > >
> > >
> > > Hi all,
> > >
> > > I have updated the design document for our GPU backend in the JIRA
> > > https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
> > > details are based on the prototype I created and is available in PR
> > > https://github.com/apache/incubator-systemml/pull/131. Once we are
> done
> > > with the discussion, I can clean up and separate out the GPU backend
> in a
> > > separate PR for easier review :)
> > >
> > > Here are key design points:
> > > A GPU backend would implement two abstract classes:
> > >    1.   GPUContext
> > >    2.   GPUObject
> > >
> > >
> > >
> > > The GPUContext is responsible for GPU memory management and gets
> > call-backs
> > > from SystemML's bufferpool on following methods:
> > >    1.   void acquireRead(MatrixObject mo)
> > >    2.   void acquireModify(MatrixObject mo)
> > >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> > >    4.   void exportData(MatrixObject mo)
> > >    5.   void evict(MatrixObject mo)
> > >
> > >
> > >
> > > A GPUObject (like RDDObject and BroadcastObject) is stored in
> > CacheableData
> > > object. It contains following methods that are called back from the
> > > corresponding GPUContext:
> > >    1.   void allocateMemoryOnDevice()
> > >    2.   void deallocateMemoryOnDevice()
> > >    3.   long getSizeOnDevice()
> > >    4.   void copyFromHostToDevice()
> > >    5.   void copyFromDeviceToHost()
> > >
> > >
> > >
> > > In the initial implementation, we will add JCudaContext and
> JCudaPointer
> > > that will extend the above abstract classes respectively. The
> > JCudaContext
> > > will be created by ExecutionContextFactory depending on the
> > user-specified
> > > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: GPU
> and
> > > implement GPU instructions.
> > >
> > > The above design is general enough so that other people can implement
> > > custom accelerators (for example: OpenCL) and also follows the design
> > > principles of our CP bufferpool.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
> >
> >
> >
> >
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Discussion on GPU backend

Posted by Niketan Pansare <np...@us.ibm.com>.
Luciano: Yes, there was a bit of confusion and hence wanted to iron things
out to foster collaboration and community feedback on GPU backend.

There are multiple issues:
1. Any work on smaller GPU PRs is dependent on the initial PR getting into
the master (as the initial PR contains the buffer-pool integration logic).
Or at least, everyone agreeing with the design of the initial PR.
2. There is an interest in collaborating on the initial PR itself and I
would like to see collaboration from early on (See
https://issues.apache.org/jira/browse/SYSTEMML-701 and
https://issues.apache.org/jira/browse/SYSTEMML-702).
3. The policy of squashing the PR into one commit essentially means only
one person's work will be acknowledged. I am little uneasy on asking people
to collaborate and not acknowledging their work. For example: Mike's
commit/references was lost when the
https://github.com/apache/incubator-systemml/commit/c334c2c85bc9cbb343e63b5b28ff3a1c5098c7fa
 was delivered.
4. The initial PR is waiting for following items:
- SystemML 0.10 released (as we agreed not to include GPU backend in 0.10
release).
- The unknowns and concerns regarding GPU backend are discussed and
addressed.
- So as to resolve dev dependency issue pointed by Matthias, jcu*.jar needs
to be hosted on local mvn repo. There are few alternative I have already
explored in this direction:
(a) Filed a PR against mavenized jcuda:
https://github.com/MysterionRise/mavenized-jcuda/pull/15
(b) Hosted mvn repo using github mvn plugin. Here is how we can resolve
system scope:
<repositories>
        <repository>
          <id>central</id>
          <url>https://repo1.maven.org/maven2</url>
          <releases>
            <enabled>true</enabled>
          </releases>
        </repository>
        <repository>
            <id>mavenized-jcuda-mvn-repo</id>
            <url>https://raw.github.com/niketanpansare/mavenized-jcuda/mvn-
repo/</url>
            <snapshots>
                <enabled>true</enabled>
                <updatePolicy>always</updatePolicy>
            </snapshots>
        </repository>
    </repositories>
....
<dependencies>
        <dependency>
            <groupId>org.mystic</groupId>
            <artifactId>mavenized-jcuda</artifactId>
            <version>0.7.5b</version>
            <scope>provided</scope>
        </dependency>
</dependencies>


Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Luciano Resende <lu...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	05/25/2016 10:06 AM
Subject:	Re: Discussion on GPU backend



But, from the original question, I was under the impression that creating
and merging multiple small prs were not a possible direction. If that is
ok, then it's regular development practice.

On Wed, May 25, 2016 at 9:20 AM, <du...@gmail.com> wrote:

> In my opinion, the problem with using a separate branch with longer-term
> work, rather than smaller PRs into the master, is that after several
> commits, say 10 or 20, it becomes much more difficult to rebase without
> running into nasty merge conflicts, especially when those conflicts are
on
> an intermediate commit so one would have to remember what the code looked
> like at that point in time to properly fix the conflicts. To me, this
> invites issues such as duplicated code and slower progress.
>
> --
>
> Mike Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
>
> Sent from my iPhone.
>
>
> > On May 25, 2016, at 9:01 AM, Luciano Resende <lu...@gmail.com>
> wrote:
> >
> > On Wed, May 25, 2016 at 6:03 AM, Berthold Reinwald
<re...@us.ibm.com>
> > wrote:
> >
> >> the discussion is less about (1), (2), or (3). As practiced so far,
(3)
> is
> >> the way to go.
> >>
> >> The question is about (A) or (B). Curious was the Apache suggested
> >> practice is.
> > Apache is key on fostering open collaboration, so specifically about
> > branching, having a SystemML branch that is used for
> > collaboration/experimentation is probably preferable, as it gives
> > visibility to others on the community, enables iterative development
> trough
> > review of small patches, while shield the trunk of issues these
> experiments
> > can cause.
> >
> > I would just recommend to avoid making the branch stale, and keep
> rebasing
> > it with latest master, which will make integration much easier in the
> > future.
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
>



--
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/



Re: Discussion on GPU backend

Posted by Luciano Resende <lu...@gmail.com>.
But, from the original question, I was under the impression that creating
and merging multiple small prs were not a possible direction. If that is
ok, then it's regular development practice.

On Wed, May 25, 2016 at 9:20 AM, <du...@gmail.com> wrote:

> In my opinion, the problem with using a separate branch with longer-term
> work, rather than smaller PRs into the master, is that after several
> commits, say 10 or 20, it becomes much more difficult to rebase without
> running into nasty merge conflicts, especially when those conflicts are on
> an intermediate commit so one would have to remember what the code looked
> like at that point in time to properly fix the conflicts. To me, this
> invites issues such as duplicated code and slower progress.
>
> --
>
> Mike Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
>
> Sent from my iPhone.
>
>
> > On May 25, 2016, at 9:01 AM, Luciano Resende <lu...@gmail.com>
> wrote:
> >
> > On Wed, May 25, 2016 at 6:03 AM, Berthold Reinwald <re...@us.ibm.com>
> > wrote:
> >
> >> the discussion is less about (1), (2), or (3). As practiced so far, (3)
> is
> >> the way to go.
> >>
> >> The question is about (A) or (B). Curious was the Apache suggested
> >> practice is.
> > Apache is key on fostering open collaboration, so specifically about
> > branching, having a SystemML branch that is used for
> > collaboration/experimentation is probably preferable, as it gives
> > visibility to others on the community, enables iterative development
> trough
> > review of small patches, while shield the trunk of issues these
> experiments
> > can cause.
> >
> > I would just recommend to avoid making the branch stale, and keep
> rebasing
> > it with latest master, which will make integration much easier in the
> > future.
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Discussion on GPU backend

Posted by du...@gmail.com.
In my opinion, the problem with using a separate branch with longer-term work, rather than smaller PRs into the master, is that after several commits, say 10 or 20, it becomes much more difficult to rebase without running into nasty merge conflicts, especially when those conflicts are on an intermediate commit so one would have to remember what the code looked like at that point in time to properly fix the conflicts. To me, this invites issues such as duplicated code and slower progress.

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On May 25, 2016, at 9:01 AM, Luciano Resende <lu...@gmail.com> wrote:
> 
> On Wed, May 25, 2016 at 6:03 AM, Berthold Reinwald <re...@us.ibm.com>
> wrote:
> 
>> the discussion is less about (1), (2), or (3). As practiced so far, (3) is
>> the way to go.
>> 
>> The question is about (A) or (B). Curious was the Apache suggested
>> practice is.
> Apache is key on fostering open collaboration, so specifically about
> branching, having a SystemML branch that is used for
> collaboration/experimentation is probably preferable, as it gives
> visibility to others on the community, enables iterative development trough
> review of small patches, while shield the trunk of issues these experiments
> can cause.
> 
> I would just recommend to avoid making the branch stale, and keep rebasing
> it with latest master, which will make integration much easier in the
> future.
> 
> 
> 
> -- 
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

Re: Discussion on GPU backend

Posted by Luciano Resende <lu...@gmail.com>.
On Wed, May 25, 2016 at 6:03 AM, Berthold Reinwald <re...@us.ibm.com>
wrote:

> the discussion is less about (1), (2), or (3). As practiced so far, (3) is
> the way to go.
>
> The question is about (A) or (B). Curious was the Apache suggested
> practice is.
>
>
Apache is key on fostering open collaboration, so specifically about
branching, having a SystemML branch that is used for
collaboration/experimentation is probably preferable, as it gives
visibility to others on the community, enables iterative development trough
review of small patches, while shield the trunk of issues these experiments
can cause.

I would just recommend to avoid making the branch stale, and keep rebasing
it with latest master, which will make integration much easier in the
future.



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Discussion on GPU backend

Posted by du...@gmail.com.
Yeah to do this in the most "Apache Way (TM)", as well as to maintain sanity, we should definitely use JIRA issues (ideally actual "sub tasks") and PRs to split up major features. It would also be great to split it up into chunks of varying complexity that do not block others, so that we could gather more contributors of various SystemML experience levels. The JIRA issues should be used to divvy up tasks, and PRs should be used to propose an implementation for that task, which would be followed by the usual comments from other contributors. 

As for a few other best practices with PRs, the PRs should also be merged with a "Closes #172." line appended to the end, where the number reflects the GitHub PR number, so that the conversations on a PR are linked to the final merged commit. Also, any necessary rebasing on a PR should be done by simply overwriting that PR branch (which exists on the contributor's fork of SystemML), which allows GitHub to keep the same PR open, and thus the entire conversation can be followed. 

Excited about the GPU work!

-Mike

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On May 25, 2016, at 8:08 AM, Niketan Pansare <np...@us.ibm.com> wrote:
> 
> Thanks Berthold and Matthias for your suggestions. It is important to note whether we go with (A) or (B), the initial PR will be squashed in one commit and individual commits by external contributor will be lost in the process. However, since we are planning to go with option (3), the impact won't be too severe.
> 
> Matthias: Here are my thoughts regarding the unknowns for GPU backend:
> 1. Handling of native libraries:
> Both JCuda and Nvidia provide shared libraries/DLL for most OS/platforms along with installation instructions.
> 
> For deployment:
> As per the previous email, the native libraries will be treated as an external dependency, just like hadoop/spark. For example: if someone executes: "hadoop jar SystemML.jar -f test.dml -exec hybrid_spark", she will get "Class Not Found" exception. In similar fashion, if the user doesnot include JCu*.jar or provide native libraries (JCu*.dll/so or CUDA or CuDNN) and supplies "-accelerator" flag, a "Class not found" or "Cannot load .." exception will be thrown respectively. If user doesnot supply "-accelerator" flag, SystemML will proceed will normal execution as it does today. 
> 
> For dev:
> We are planning to host jcu*.jar into one of maven repository. Once that's done, the "system" scope in pom will be replaced by "provided" scope and the jcu*.jars will be deleted from PR. Like deployment, it is responsibility of the developer to install native libraries if she intends to work on GPU backend.
> 
> For testing:
> The user can set the environment variable "CUDA_PATH" and set TEST_GPU flag to enable GPU tests (Please see https://github.com/apache/incubator-systemml/pull/165/files#diff-bcda036e4c3ff62cb2648acbbd19f61aR113). The PR will be accompanied by additional tests which will be enabled only when TEST_GPU is set. Having TEST_GPU flag allows users without Nvidia GPU to run the integration test. Like deployment, it is responsibility of the developer to install native libraries for testing with TEST_GPU flag. 
> 
> The first version will not contain custom native kernels. 
> 
> 2. I can add the summary of the performance comparisons in the PR :)
> 
> Thanks,
> 
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> 
> Berthold Reinwald---05/25/2016 06:03:55 AM---the discussion is less about (1), (2), or (3). As practiced so far, (3) is the way to go.
> 
> From: Berthold Reinwald/Almaden/IBM@IBMUS
> To: dev@systemml.incubator.apache.org
> Date: 05/25/2016 06:03 AM
> Subject: Re: Discussion on GPU backend
> 
> 
> 
> 
> the discussion is less about (1), (2), or (3). As practiced so far, (3) is 
> the way to go.
> 
> The question is about (A) or (B). Curious was the Apache suggested 
> practice is.
> 
> Regards,
> Berthold Reinwald
> IBM Almaden Research Center
> office: (408) 927 2208; T/L: 457 2208
> e-mail: reinwald@us.ibm.com
> 
> 
> 
> From:   Matthias Boehm/Almaden/IBM@IBMUS
> To:     dev@systemml.incubator.apache.org
> Date:   05/24/2016 09:10 PM
> Subject:        Re: Discussion on GPU backend
> 
> 
> 
> Generally, I think we should really stick to (3) as done in the past, 
> i.e., bring up major features in the roadmap discussions, create jira 
> epics and try to break them into rather isolated tasks. This works for 
> almost any major/minor feature. The only exception are features, where it 
> is initially unknown if the potential benefits outweigh the increased 
> complexity (or other disadvantages). Here, prototypes are required but 
> everybody should be free to choose a way of maintaining them. I also don't 
> expect too much collaboration here because of the unknown status. Once the 
> initial unknowns are resolved, we should come back to (3) tough.
> 
> Regarding the GPU backend, the unknowns to resolve are (1) the handling of 
> native libraries/kernels for deployment/test/dev, and (2) performance 
> comparisons on selected algorithms (prototypes, not fully integrated), 
> data sizes, and platforms. Once we have answers to these questions, we can 
> create all the tasks for optimizer/runtime integration. 
> 
> Regards,
> Matthias 
> 
> 
> Niketan Pansare---05/24/2016 11:55:19 AM---Hi all, Since there is interest 
> in collaborating on GPU backend, I wanted to know
> 
> From: Niketan Pansare/Almaden/IBM@IBMUS
> To: dev@systemml.incubator.apache.org
> Date: 05/24/2016 11:55 AM
> Subject: Re: Discussion on GPU backend
> 
> 
> 
> Hi all,
> 
> Since there is interest in collaborating on GPU backend, I wanted to know 
> what is the preferred way to go ahead with a new feature (i.e. GPU 
> backend) ? This discussion is also generally applicable to other major 
> features (for example: Flink backend, Deep Learning support, OpenCL 
> backend, new data types, new built-in functions, new algorithms, etc).
> 
> The first point of discussion is what would qualify as a "major feature" 
> and how we integrate it into SystemML ? Here are three options that could 
> serve as a potential requirement:
> 1. The feature has to be fully functional and fully optimized. For 
> example: in the case of additional backends, the PR can only be merged in 
> if and only if, all the instructions (CP or distributed) has been 
> implemented and is at least as optimized as our existing alternate 
> backends. In the case of algorithms or the built-in functions, the PR can 
> only be merged in if and only if, it runs on all the backends for all 
> datasets and is comparable in performance and accuracy with an external ML 
> libraries.
> 2. The feature has to be fully functional. In this case, the PR can only 
> be merged in if and only if all the instructions (CP or distributed) has 
> been implemented. However, the first version of the new backend need not 
> perform faster than our existing alternate backends.
> 3. Increment addition but with unit testcases that addresses quality and 
> stability concerns. In this case, a PR can be merged if a subset of 
> instructions has been implemented along with set of unit test cases 
> suggested by our committers. The main benefit here is quick-feedback 
> iterations from our committers, whereas the main drawback is an 
> intermediate state where we don't fully support the given backend for 
> certain scenario. 
> 
> If we decide to go with option 1 or 2, then potentially there will be a 
> lot of code to review at the end and ideally we should give opportunity 
> for our committers to provide early review comments on the feature. This 
> will mitigate the risk of having to re-implement the entire feature. The 
> options here are:
> A. Create a branch on https://github.com/apache/incubator-systemml. This 
> allows people to collaborate as well as allows committers to look at the 
> code.
> B. Create a branch on a fork and have a PR up to allow committers to raise 
> concerns and provide suggestions. This is done for 
> https://github.com/apache/incubator-systemml/pull/165 and 
> https://github.com/apache/incubator-systemml/pull/119. To collaborate, the 
> person creating PR will act as committer for the feature and will accept 
> PR on its branch and will be responsible for resolving conflicts and 
> keeping the PR in sync with the master.
> 
> If we decide to go with the option 3 (i.e. incremental addition), the 
> option B seems to be logical choice as we already do this for other 
> features.
> 
> My goal here is not to create a formal process but instead to avoid any 
> potential misunderstanding/confusion and also to follow recommended Apache 
> practices.
> 
> Please email back with your thoughts :)
> 
> Thanks,
> 
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> 
> Deron Eriksson ---05/18/2016 11:22:26 AM---Hi Niketan, Good idea, I think 
> that would be the cleanest solution for now. Since JCuda
> 
> From: Deron Eriksson <de...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 05/18/2016 11:22 AM
> Subject: Re: Discussion on GPU backend
> 
> 
> 
> Hi Niketan,
> 
> Good idea, I think that would be the cleanest solution for now. Since 
> JCuda
> doesn't appear to be in a public maven repo, it adds a layer of difficulty
> to clean integration via maven builds.
> 
> Deron
> 
> 
> On Wed, May 18, 2016 at 10:55 AM, Niketan Pansare <np...@us.ibm.com>
> wrote:
> 
> > Hi Deron,
> >
> > Good points. I vote that we keep JCUDA and other accelerators we add as 
> an
> > external dependency. This means the user will have to ensure JCuda.jar 
> in
> > the class path and JCuda.DLL/JCuda.so in the LD_LIBRARY_PATH.
> >
> > I don't think JCuda.jar is platform-specific.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Deron Eriksson ---05/18/2016 10:51:17
> > AM---Hi, I'm wondering what would be a good way to handle JCuda]Deron
> > Eriksson ---05/18/2016 10:51:17 AM---Hi, I'm wondering what would be a 
> good
> > way to handle JCuda in terms of the
> >
> > From: Deron Eriksson <de...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 05/18/2016 10:51 AM
> > Subject: Re: Discussion on GPU backend
> > ------------------------------
> >
> >
> >
> > Hi,
> >
> > I'm wondering what would be a good way to handle JCuda in terms of the
> > build release packages. Currently we have 11 artifacts that we are
> > building:
> >   systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
> >   systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
> >   systemml-0.10.0-incubating-SNAPSHOT-sources.jar
> >   systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
> >   systemml-0.10.0-incubating-SNAPSHOT-src.zip
> >   systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
> >   systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
> >   systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
> >   systemml-0.10.0-incubating-SNAPSHOT.jar
> >   systemml-0.10.0-incubating-SNAPSHOT.tar.gz
> >   systemml-0.10.0-incubating-SNAPSHOT.zip
> >
> > It looks like JCuda is platform-specific, so you typically need 
> different
> > jars/dlls/sos/etc for each platform. If I'm understanding things 
> correctly,
> > if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
> > artifacts for JCuda, we'd potentially have an enormous number of 
> artifacts.
> >
> > Is this something that could be potentially handled by specific profiles 
> in
> > the pom so that a user might be able to do something like "mvn clean
> > package -P jcuda-windows" so that a user could be responsible for 
> building
> > the platform-specific SystemML jar for jcuda? Or is this something that
> > could be handled differently, by putting the platform-specific jcuda jar 
> on
> > the classpath and any dlls or other needed libraries on the path?
> >
> > Deron
> >
> >
> >
> > On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > > Hi Luciano,
> > >
> > > Like all our backends, there is no change in the programming model. 
> The
> > > user submits a DML script and specifies whether she wants to use an
> > > accelerator. Assuming that we compile jcuda jars into SystemML.jar, 
> the
> > > user can use GPU backend using following command:
> > > spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> > > hybrid_spark
> > >
> > > The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or 
> so
> > > files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> > > <https://issues.apache.org/jira/browse/SPARK-1720> ... For example: 
> the
> >
> > > user can add following to spark-env.sh
> > > export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH
> > >
> > > The first version of GPU backend will only accelerate CP. In this 
> case,
> > we
> > > have four types of instructions:
> > > 1. CP
> > > 2. GPU (requires GPU on the driver)
> > > 3. SPARK
> > > 4. MR
> > >
> > > Note, the first version will require the CUDA/JCuda dependency to be
> > > installed on the driver only.
> > >
> > > The next version will accelerate our distributed instructions as well. 
> In
> > > this case, we will have six types of instructions:
> > > 1. CP
> > > 2. GPU
> > > 3. SPARK
> > > 4. MR
> > > 5. SPARK-GPU (requires GPU cluster)
> > > 6. MR-GPU (requires GPU cluster)
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > >
> > > [image: Inactive hide details for Luciano Resende ---05/17/2016 
> 09:13:24
> > > PM---Great to see detailed information on this topic Niketan,]Luciano
> > > Resende ---05/17/2016 09:13:24 PM---Great to see detailed information 
> on
> > > this topic Niketan, I guess I have missed when you posted it in
> > >
> > > From: Luciano Resende <lu...@gmail.com>
> > > To: dev@systemml.incubator.apache.org
> > > Date: 05/17/2016 09:13 PM
> > > Subject: Re: Discussion on GPU backend
> > > ------------------------------
> >
> > >
> > >
> > >
> > > Great to see detailed information on this topic Niketan, I guess I 
> have
> > > missed when you posted it initially.
> > >
> > > Could you elaborate a little more on what is the programming model for
> > when
> > > the user wants to leverage GPU ? Also, today I can submit a job to 
> spark
> > > using --jars and it will handle copying the dependencies to the worker
> > > nodes. If my application wants to leverage GPU, what extras 
> dependencies
> > > will be required on the worker nodes, and how they are going to be
> > > installed/updated on the Spark cluster ?
> > >
> > >
> > >
> > > On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com>
> > > wrote:
> > >
> > > >
> > > >
> > > > Hi all,
> > > >
> > > > I have updated the design document for our GPU backend in the JIRA
> > > >
> > https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
> >
> > > > details are based on the prototype I created and is available in PR
> > > >
> > https://github.com/apache/incubator-systemml/pull/131. Once we are done
> >
> > > > with the discussion, I can clean up and separate out the GPU backend
> > in a
> > > > separate PR for easier review :)
> > > >
> > > > Here are key design points:
> > > > A GPU backend would implement two abstract classes:
> > > >    1.   GPUContext
> > > >    2.   GPUObject
> > > >
> > > >
> > > >
> > > > The GPUContext is responsible for GPU memory management and gets
> > > call-backs
> > > > from SystemML's bufferpool on following methods:
> > > >    1.   void acquireRead(MatrixObject mo)
> > > >    2.   void acquireModify(MatrixObject mo)
> > > >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> > > >    4.   void exportData(MatrixObject mo)
> > > >    5.   void evict(MatrixObject mo)
> > > >
> > > >
> > > >
> > > > A GPUObject (like RDDObject and BroadcastObject) is stored in
> > > CacheableData
> > > > object. It contains following methods that are called back from the
> > > > corresponding GPUContext:
> > > >    1.   void allocateMemoryOnDevice()
> > > >    2.   void deallocateMemoryOnDevice()
> > > >    3.   long getSizeOnDevice()
> > > >    4.   void copyFromHostToDevice()
> > > >    5.   void copyFromDeviceToHost()
> > > >
> > > >
> > > >
> > > > In the initial implementation, we will add JCudaContext and
> > JCudaPointer
> > > > that will extend the above abstract classes respectively. The
> > > JCudaContext
> > > > will be created by ExecutionContextFactory depending on the
> > > user-specified
> > > > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: 
> GPU
> > and
> > > > implement GPU instructions.
> > > >
> > > > The above design is general enough so that other people can 
> implement
> > > > custom accelerators (for example: OpenCL) and also follows the 
> design
> > > > principles of our CP bufferpool.
> > > >
> > > > Thanks,
> > > >
> > > > Niketan Pansare
> > > > IBM Almaden Research Center
> > > > E-mail: npansar At us.ibm.com
> > > >
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > > >
> > >
> > >
> > >
> > > --
> > > Luciano Resende
> > > http://twitter.com/lresende1975
> > > http://lresende.blogspot.com/
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

Re: Discussion on GPU backend

Posted by Niketan Pansare <np...@us.ibm.com>.
Thanks Berthold and Matthias for your suggestions. It is important to note
whether we go with (A) or (B), the initial PR will be squashed in one
commit and individual commits by external contributor will be lost in the
process. However, since we are planning to go with option (3), the impact
won't be too severe.

Matthias: Here are my thoughts regarding the unknowns for GPU backend:
1. Handling of native libraries:
Both JCuda and Nvidia provide shared libraries/DLL for most OS/platforms
along with installation instructions.

For deployment:
As per the previous email, the native libraries will be treated as an
external dependency, just like hadoop/spark. For example: if someone
executes: "hadoop jar SystemML.jar -f test.dml -exec hybrid_spark", she
will get "Class Not Found" exception. In similar fashion, if the user
doesnot include JCu*.jar or provide native libraries (JCu*.dll/so or CUDA
or CuDNN) and supplies "-accelerator" flag, a "Class not found" or "Cannot
load .." exception will be thrown respectively. If user doesnot supply
"-accelerator" flag, SystemML will proceed will normal execution as it does
today.

For dev:
We are planning to host jcu*.jar into one of maven repository. Once that's
done, the "system" scope in pom will be replaced by "provided" scope and
the jcu*.jars will be deleted from PR. Like deployment, it is
responsibility of the developer to install native libraries if she intends
to work on GPU backend.

For testing:
The user can set the environment variable "CUDA_PATH" and set TEST_GPU flag
to enable GPU tests (Please see
https://github.com/apache/incubator-systemml/pull/165/files#diff-bcda036e4c3ff62cb2648acbbd19f61aR113
). The PR will be accompanied by additional tests which will be enabled
only when TEST_GPU is set. Having TEST_GPU flag allows users without Nvidia
GPU to run the integration test. Like deployment, it is responsibility of
the developer to install native libraries for testing with TEST_GPU flag.

The first version will not contain custom native kernels.

2. I can add the summary of the performance comparisons in the PR :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Berthold Reinwald/Almaden/IBM@IBMUS
To:	dev@systemml.incubator.apache.org
Date:	05/25/2016 06:03 AM
Subject:	Re: Discussion on GPU backend



the discussion is less about (1), (2), or (3). As practiced so far, (3) is
the way to go.

The question is about (A) or (B). Curious was the Apache suggested
practice is.

Regards,
Berthold Reinwald
IBM Almaden Research Center
office: (408) 927 2208; T/L: 457 2208
e-mail: reinwald@us.ibm.com



From:   Matthias Boehm/Almaden/IBM@IBMUS
To:     dev@systemml.incubator.apache.org
Date:   05/24/2016 09:10 PM
Subject:        Re: Discussion on GPU backend



Generally, I think we should really stick to (3) as done in the past,
i.e., bring up major features in the roadmap discussions, create jira
epics and try to break them into rather isolated tasks. This works for
almost any major/minor feature. The only exception are features, where it
is initially unknown if the potential benefits outweigh the increased
complexity (or other disadvantages). Here, prototypes are required but
everybody should be free to choose a way of maintaining them. I also don't
expect too much collaboration here because of the unknown status. Once the
initial unknowns are resolved, we should come back to (3) tough.

Regarding the GPU backend, the unknowns to resolve are (1) the handling of
native libraries/kernels for deployment/test/dev, and (2) performance
comparisons on selected algorithms (prototypes, not fully integrated),
data sizes, and platforms. Once we have answers to these questions, we can
create all the tasks for optimizer/runtime integration.

Regards,
Matthias


Niketan Pansare---05/24/2016 11:55:19 AM---Hi all, Since there is interest
in collaborating on GPU backend, I wanted to know

From: Niketan Pansare/Almaden/IBM@IBMUS
To: dev@systemml.incubator.apache.org
Date: 05/24/2016 11:55 AM
Subject: Re: Discussion on GPU backend



Hi all,

Since there is interest in collaborating on GPU backend, I wanted to know
what is the preferred way to go ahead with a new feature (i.e. GPU
backend) ? This discussion is also generally applicable to other major
features (for example: Flink backend, Deep Learning support, OpenCL
backend, new data types, new built-in functions, new algorithms, etc).

The first point of discussion is what would qualify as a "major feature"
and how we integrate it into SystemML ? Here are three options that could
serve as a potential requirement:
1. The feature has to be fully functional and fully optimized. For
example: in the case of additional backends, the PR can only be merged in
if and only if, all the instructions (CP or distributed) has been
implemented and is at least as optimized as our existing alternate
backends. In the case of algorithms or the built-in functions, the PR can
only be merged in if and only if, it runs on all the backends for all
datasets and is comparable in performance and accuracy with an external ML
libraries.
2. The feature has to be fully functional. In this case, the PR can only
be merged in if and only if all the instructions (CP or distributed) has
been implemented. However, the first version of the new backend need not
perform faster than our existing alternate backends.
3. Increment addition but with unit testcases that addresses quality and
stability concerns. In this case, a PR can be merged if a subset of
instructions has been implemented along with set of unit test cases
suggested by our committers. The main benefit here is quick-feedback
iterations from our committers, whereas the main drawback is an
intermediate state where we don't fully support the given backend for
certain scenario.

If we decide to go with option 1 or 2, then potentially there will be a
lot of code to review at the end and ideally we should give opportunity
for our committers to provide early review comments on the feature. This
will mitigate the risk of having to re-implement the entire feature. The
options here are:
A. Create a branch on https://github.com/apache/incubator-systemml. This
allows people to collaborate as well as allows committers to look at the
code.
B. Create a branch on a fork and have a PR up to allow committers to raise
concerns and provide suggestions. This is done for
https://github.com/apache/incubator-systemml/pull/165 and
https://github.com/apache/incubator-systemml/pull/119. To collaborate, the
person creating PR will act as committer for the feature and will accept
PR on its branch and will be responsible for resolving conflicts and
keeping the PR in sync with the master.

If we decide to go with the option 3 (i.e. incremental addition), the
option B seems to be logical choice as we already do this for other
features.

My goal here is not to create a formal process but instead to avoid any
potential misunderstanding/confusion and also to follow recommended Apache
practices.

Please email back with your thoughts :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

Deron Eriksson ---05/18/2016 11:22:26 AM---Hi Niketan, Good idea, I think
that would be the cleanest solution for now. Since JCuda

From: Deron Eriksson <de...@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 05/18/2016 11:22 AM
Subject: Re: Discussion on GPU backend



Hi Niketan,

Good idea, I think that would be the cleanest solution for now. Since
JCuda
doesn't appear to be in a public maven repo, it adds a layer of difficulty
to clean integration via maven builds.

Deron


On Wed, May 18, 2016 at 10:55 AM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Deron,
>
> Good points. I vote that we keep JCUDA and other accelerators we add as
an
> external dependency. This means the user will have to ensure JCuda.jar
in
> the class path and JCuda.DLL/JCuda.so in the LD_LIBRARY_PATH.
>
> I don't think JCuda.jar is platform-specific.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Deron Eriksson ---05/18/2016 10:51:17
> AM---Hi, I'm wondering what would be a good way to handle JCuda]Deron
> Eriksson ---05/18/2016 10:51:17 AM---Hi, I'm wondering what would be a
good
> way to handle JCuda in terms of the
>
> From: Deron Eriksson <de...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 05/18/2016 10:51 AM
> Subject: Re: Discussion on GPU backend
> ------------------------------
>
>
>
> Hi,
>
> I'm wondering what would be a good way to handle JCuda in terms of the
> build release packages. Currently we have 11 artifacts that we are
> building:
>   systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
>   systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
>   systemml-0.10.0-incubating-SNAPSHOT-sources.jar
>   systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-src.zip
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
>   systemml-0.10.0-incubating-SNAPSHOT.jar
>   systemml-0.10.0-incubating-SNAPSHOT.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT.zip
>
> It looks like JCuda is platform-specific, so you typically need
different
> jars/dlls/sos/etc for each platform. If I'm understanding things
correctly,
> if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
> artifacts for JCuda, we'd potentially have an enormous number of
artifacts.
>
> Is this something that could be potentially handled by specific profiles
in
> the pom so that a user might be able to do something like "mvn clean
> package -P jcuda-windows" so that a user could be responsible for
building
> the platform-specific SystemML jar for jcuda? Or is this something that
> could be handled differently, by putting the platform-specific jcuda jar
on
> the classpath and any dlls or other needed libraries on the path?
>
> Deron
>
>
>
> On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Luciano,
> >
> > Like all our backends, there is no change in the programming model.
The
> > user submits a DML script and specifies whether she wants to use an
> > accelerator. Assuming that we compile jcuda jars into SystemML.jar,
the
> > user can use GPU backend using following command:
> > spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> > hybrid_spark
> >
> > The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or
so
> > files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> > <https://issues.apache.org/jira/browse/SPARK-1720> ... For example:
the
>
> > user can add following to spark-env.sh
> > export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH
> >
> > The first version of GPU backend will only accelerate CP. In this
case,
> we
> > have four types of instructions:
> > 1. CP
> > 2. GPU (requires GPU on the driver)
> > 3. SPARK
> > 4. MR
> >
> > Note, the first version will require the CUDA/JCuda dependency to be
> > installed on the driver only.
> >
> > The next version will accelerate our distributed instructions as well.
In
> > this case, we will have six types of instructions:
> > 1. CP
> > 2. GPU
> > 3. SPARK
> > 4. MR
> > 5. SPARK-GPU (requires GPU cluster)
> > 6. MR-GPU (requires GPU cluster)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> >
> > [image: Inactive hide details for Luciano Resende ---05/17/2016
09:13:24
> > PM---Great to see detailed information on this topic Niketan,]Luciano
> > Resende ---05/17/2016 09:13:24 PM---Great to see detailed information
on
> > this topic Niketan, I guess I have missed when you posted it in
> >
> > From: Luciano Resende <lu...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 05/17/2016 09:13 PM
> > Subject: Re: Discussion on GPU backend
> > ------------------------------
>
> >
> >
> >
> > Great to see detailed information on this topic Niketan, I guess I
have
> > missed when you posted it initially.
> >
> > Could you elaborate a little more on what is the programming model for
> when
> > the user wants to leverage GPU ? Also, today I can submit a job to
spark
> > using --jars and it will handle copying the dependencies to the worker
> > nodes. If my application wants to leverage GPU, what extras
dependencies
> > will be required on the worker nodes, and how they are going to be
> > installed/updated on the Spark cluster ?
> >
> >
> >
> > On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > >
> > >
> > > Hi all,
> > >
> > > I have updated the design document for our GPU backend in the JIRA
> > >
> https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
>
> > > details are based on the prototype I created and is available in PR
> > >
> https://github.com/apache/incubator-systemml/pull/131. Once we are done
>
> > > with the discussion, I can clean up and separate out the GPU backend
> in a
> > > separate PR for easier review :)
> > >
> > > Here are key design points:
> > > A GPU backend would implement two abstract classes:
> > >    1.   GPUContext
> > >    2.   GPUObject
> > >
> > >
> > >
> > > The GPUContext is responsible for GPU memory management and gets
> > call-backs
> > > from SystemML's bufferpool on following methods:
> > >    1.   void acquireRead(MatrixObject mo)
> > >    2.   void acquireModify(MatrixObject mo)
> > >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> > >    4.   void exportData(MatrixObject mo)
> > >    5.   void evict(MatrixObject mo)
> > >
> > >
> > >
> > > A GPUObject (like RDDObject and BroadcastObject) is stored in
> > CacheableData
> > > object. It contains following methods that are called back from the
> > > corresponding GPUContext:
> > >    1.   void allocateMemoryOnDevice()
> > >    2.   void deallocateMemoryOnDevice()
> > >    3.   long getSizeOnDevice()
> > >    4.   void copyFromHostToDevice()
> > >    5.   void copyFromDeviceToHost()
> > >
> > >
> > >
> > > In the initial implementation, we will add JCudaContext and
> JCudaPointer
> > > that will extend the above abstract classes respectively. The
> > JCudaContext
> > > will be created by ExecutionContextFactory depending on the
> > user-specified
> > > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType:
GPU
> and
> > > implement GPU instructions.
> > >
> > > The above design is general enough so that other people can
implement
> > > custom accelerators (for example: OpenCL) and also follows the
design
> > > principles of our CP bufferpool.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> > >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
> >
> >
> >
> >
>
>
>
>










Re: Discussion on GPU backend

Posted by Berthold Reinwald <re...@us.ibm.com>.
the discussion is less about (1), (2), or (3). As practiced so far, (3) is 
the way to go.

The question is about (A) or (B). Curious was the Apache suggested 
practice is.

Regards,
Berthold Reinwald
IBM Almaden Research Center
office: (408) 927 2208; T/L: 457 2208
e-mail: reinwald@us.ibm.com



From:   Matthias Boehm/Almaden/IBM@IBMUS
To:     dev@systemml.incubator.apache.org
Date:   05/24/2016 09:10 PM
Subject:        Re: Discussion on GPU backend



Generally, I think we should really stick to (3) as done in the past, 
i.e., bring up major features in the roadmap discussions, create jira 
epics and try to break them into rather isolated tasks. This works for 
almost any major/minor feature. The only exception are features, where it 
is initially unknown if the potential benefits outweigh the increased 
complexity (or other disadvantages). Here, prototypes are required but 
everybody should be free to choose a way of maintaining them. I also don't 
expect too much collaboration here because of the unknown status. Once the 
initial unknowns are resolved, we should come back to (3) tough.

Regarding the GPU backend, the unknowns to resolve are (1) the handling of 
native libraries/kernels for deployment/test/dev, and (2) performance 
comparisons on selected algorithms (prototypes, not fully integrated), 
data sizes, and platforms. Once we have answers to these questions, we can 
create all the tasks for optimizer/runtime integration. 

Regards,
Matthias 


Niketan Pansare---05/24/2016 11:55:19 AM---Hi all, Since there is interest 
in collaborating on GPU backend, I wanted to know

From: Niketan Pansare/Almaden/IBM@IBMUS
To: dev@systemml.incubator.apache.org
Date: 05/24/2016 11:55 AM
Subject: Re: Discussion on GPU backend



Hi all,

Since there is interest in collaborating on GPU backend, I wanted to know 
what is the preferred way to go ahead with a new feature (i.e. GPU 
backend) ? This discussion is also generally applicable to other major 
features (for example: Flink backend, Deep Learning support, OpenCL 
backend, new data types, new built-in functions, new algorithms, etc).

The first point of discussion is what would qualify as a "major feature" 
and how we integrate it into SystemML ? Here are three options that could 
serve as a potential requirement:
1. The feature has to be fully functional and fully optimized. For 
example: in the case of additional backends, the PR can only be merged in 
if and only if, all the instructions (CP or distributed) has been 
implemented and is at least as optimized as our existing alternate 
backends. In the case of algorithms or the built-in functions, the PR can 
only be merged in if and only if, it runs on all the backends for all 
datasets and is comparable in performance and accuracy with an external ML 
libraries.
2. The feature has to be fully functional. In this case, the PR can only 
be merged in if and only if all the instructions (CP or distributed) has 
been implemented. However, the first version of the new backend need not 
perform faster than our existing alternate backends.
3. Increment addition but with unit testcases that addresses quality and 
stability concerns. In this case, a PR can be merged if a subset of 
instructions has been implemented along with set of unit test cases 
suggested by our committers. The main benefit here is quick-feedback 
iterations from our committers, whereas the main drawback is an 
intermediate state where we don't fully support the given backend for 
certain scenario. 

If we decide to go with option 1 or 2, then potentially there will be a 
lot of code to review at the end and ideally we should give opportunity 
for our committers to provide early review comments on the feature. This 
will mitigate the risk of having to re-implement the entire feature. The 
options here are:
A. Create a branch on https://github.com/apache/incubator-systemml. This 
allows people to collaborate as well as allows committers to look at the 
code.
B. Create a branch on a fork and have a PR up to allow committers to raise 
concerns and provide suggestions. This is done for 
https://github.com/apache/incubator-systemml/pull/165 and 
https://github.com/apache/incubator-systemml/pull/119. To collaborate, the 
person creating PR will act as committer for the feature and will accept 
PR on its branch and will be responsible for resolving conflicts and 
keeping the PR in sync with the master.

If we decide to go with the option 3 (i.e. incremental addition), the 
option B seems to be logical choice as we already do this for other 
features.

My goal here is not to create a formal process but instead to avoid any 
potential misunderstanding/confusion and also to follow recommended Apache 
practices.

Please email back with your thoughts :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

Deron Eriksson ---05/18/2016 11:22:26 AM---Hi Niketan, Good idea, I think 
that would be the cleanest solution for now. Since JCuda

From: Deron Eriksson <de...@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 05/18/2016 11:22 AM
Subject: Re: Discussion on GPU backend



Hi Niketan,

Good idea, I think that would be the cleanest solution for now. Since 
JCuda
doesn't appear to be in a public maven repo, it adds a layer of difficulty
to clean integration via maven builds.

Deron


On Wed, May 18, 2016 at 10:55 AM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Deron,
>
> Good points. I vote that we keep JCUDA and other accelerators we add as 
an
> external dependency. This means the user will have to ensure JCuda.jar 
in
> the class path and JCuda.DLL/JCuda.so in the LD_LIBRARY_PATH.
>
> I don't think JCuda.jar is platform-specific.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Deron Eriksson ---05/18/2016 10:51:17
> AM---Hi, I'm wondering what would be a good way to handle JCuda]Deron
> Eriksson ---05/18/2016 10:51:17 AM---Hi, I'm wondering what would be a 
good
> way to handle JCuda in terms of the
>
> From: Deron Eriksson <de...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 05/18/2016 10:51 AM
> Subject: Re: Discussion on GPU backend
> ------------------------------
>
>
>
> Hi,
>
> I'm wondering what would be a good way to handle JCuda in terms of the
> build release packages. Currently we have 11 artifacts that we are
> building:
>   systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
>   systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
>   systemml-0.10.0-incubating-SNAPSHOT-sources.jar
>   systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-src.zip
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
>   systemml-0.10.0-incubating-SNAPSHOT.jar
>   systemml-0.10.0-incubating-SNAPSHOT.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT.zip
>
> It looks like JCuda is platform-specific, so you typically need 
different
> jars/dlls/sos/etc for each platform. If I'm understanding things 
correctly,
> if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
> artifacts for JCuda, we'd potentially have an enormous number of 
artifacts.
>
> Is this something that could be potentially handled by specific profiles 
in
> the pom so that a user might be able to do something like "mvn clean
> package -P jcuda-windows" so that a user could be responsible for 
building
> the platform-specific SystemML jar for jcuda? Or is this something that
> could be handled differently, by putting the platform-specific jcuda jar 
on
> the classpath and any dlls or other needed libraries on the path?
>
> Deron
>
>
>
> On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Luciano,
> >
> > Like all our backends, there is no change in the programming model. 
The
> > user submits a DML script and specifies whether she wants to use an
> > accelerator. Assuming that we compile jcuda jars into SystemML.jar, 
the
> > user can use GPU backend using following command:
> > spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> > hybrid_spark
> >
> > The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or 
so
> > files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> > <https://issues.apache.org/jira/browse/SPARK-1720> ... For example: 
the
>
> > user can add following to spark-env.sh
> > export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH
> >
> > The first version of GPU backend will only accelerate CP. In this 
case,
> we
> > have four types of instructions:
> > 1. CP
> > 2. GPU (requires GPU on the driver)
> > 3. SPARK
> > 4. MR
> >
> > Note, the first version will require the CUDA/JCuda dependency to be
> > installed on the driver only.
> >
> > The next version will accelerate our distributed instructions as well. 
In
> > this case, we will have six types of instructions:
> > 1. CP
> > 2. GPU
> > 3. SPARK
> > 4. MR
> > 5. SPARK-GPU (requires GPU cluster)
> > 6. MR-GPU (requires GPU cluster)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> >
> > [image: Inactive hide details for Luciano Resende ---05/17/2016 
09:13:24
> > PM---Great to see detailed information on this topic Niketan,]Luciano
> > Resende ---05/17/2016 09:13:24 PM---Great to see detailed information 
on
> > this topic Niketan, I guess I have missed when you posted it in
> >
> > From: Luciano Resende <lu...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 05/17/2016 09:13 PM
> > Subject: Re: Discussion on GPU backend
> > ------------------------------
>
> >
> >
> >
> > Great to see detailed information on this topic Niketan, I guess I 
have
> > missed when you posted it initially.
> >
> > Could you elaborate a little more on what is the programming model for
> when
> > the user wants to leverage GPU ? Also, today I can submit a job to 
spark
> > using --jars and it will handle copying the dependencies to the worker
> > nodes. If my application wants to leverage GPU, what extras 
dependencies
> > will be required on the worker nodes, and how they are going to be
> > installed/updated on the Spark cluster ?
> >
> >
> >
> > On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > >
> > >
> > > Hi all,
> > >
> > > I have updated the design document for our GPU backend in the JIRA
> > >
> https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
>
> > > details are based on the prototype I created and is available in PR
> > >
> https://github.com/apache/incubator-systemml/pull/131. Once we are done
>
> > > with the discussion, I can clean up and separate out the GPU backend
> in a
> > > separate PR for easier review :)
> > >
> > > Here are key design points:
> > > A GPU backend would implement two abstract classes:
> > >    1.   GPUContext
> > >    2.   GPUObject
> > >
> > >
> > >
> > > The GPUContext is responsible for GPU memory management and gets
> > call-backs
> > > from SystemML's bufferpool on following methods:
> > >    1.   void acquireRead(MatrixObject mo)
> > >    2.   void acquireModify(MatrixObject mo)
> > >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> > >    4.   void exportData(MatrixObject mo)
> > >    5.   void evict(MatrixObject mo)
> > >
> > >
> > >
> > > A GPUObject (like RDDObject and BroadcastObject) is stored in
> > CacheableData
> > > object. It contains following methods that are called back from the
> > > corresponding GPUContext:
> > >    1.   void allocateMemoryOnDevice()
> > >    2.   void deallocateMemoryOnDevice()
> > >    3.   long getSizeOnDevice()
> > >    4.   void copyFromHostToDevice()
> > >    5.   void copyFromDeviceToHost()
> > >
> > >
> > >
> > > In the initial implementation, we will add JCudaContext and
> JCudaPointer
> > > that will extend the above abstract classes respectively. The
> > JCudaContext
> > > will be created by ExecutionContextFactory depending on the
> > user-specified
> > > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: 
GPU
> and
> > > implement GPU instructions.
> > >
> > > The above design is general enough so that other people can 
implement
> > > custom accelerators (for example: OpenCL) and also follows the 
design
> > > principles of our CP bufferpool.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> > >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
> >
> >
> >
> >
>
>
>
>








Re: Discussion on GPU backend

Posted by Matthias Boehm <mb...@us.ibm.com>.
Generally, I think we should really stick to (3) as done in the past, i.e.,
bring up major features in the roadmap discussions, create jira epics and
try to break them into rather isolated tasks. This works for almost any
major/minor feature. The only exception are features, where it is initially
unknown if the potential benefits outweigh the increased complexity (or
other disadvantages). Here, prototypes are required but everybody should be
free to choose a way of maintaining them. I also don't expect too much
collaboration here because of the unknown status. Once the initial unknowns
are resolved, we should come back to (3) tough.

Regarding the GPU backend, the unknowns to resolve are (1) the handling of
native libraries/kernels for deployment/test/dev, and (2) performance
comparisons on selected algorithms (prototypes, not fully integrated), data
sizes, and platforms. Once we have answers to these questions, we can
create all the tasks for optimizer/runtime integration.

Regards,
Matthias




From:	Niketan Pansare/Almaden/IBM@IBMUS
To:	dev@systemml.incubator.apache.org
Date:	05/24/2016 11:55 AM
Subject:	Re: Discussion on GPU backend



Hi all,

Since there is interest in collaborating on GPU backend, I wanted to know
what is the preferred way to go ahead with a new feature (i.e. GPU
backend) ? This discussion is also generally applicable to other major
features (for example: Flink backend, Deep Learning support, OpenCL
backend, new data types, new built-in functions, new algorithms, etc).

The first point of discussion is what would qualify as a "major feature"
and how we integrate it into SystemML ? Here are three options that could
serve as a potential requirement:
1. The feature has to be fully functional and fully optimized. For example:
in the case of additional backends, the PR can only be merged in if and
only if, all the instructions (CP or distributed) has been implemented and
is at least as optimized as our existing alternate backends. In the case of
algorithms or the built-in functions, the PR can only be merged in if and
only if, it runs on all the backends for all datasets and is comparable in
performance and accuracy with an external ML libraries.
2. The feature has to be fully functional. In this case, the PR can only be
merged in if and only if all the instructions (CP or distributed) has been
implemented. However, the first version of the new backend need not perform
faster than our existing alternate backends.
3. Increment addition but with unit testcases that addresses quality and
stability concerns. In this case, a PR can be merged if a subset of
instructions has been implemented along with set of unit test cases
suggested by our committers. The main benefit here is quick-feedback
iterations from our committers, whereas the main drawback is an
intermediate state where we don't fully support the given backend for
certain scenario.

If we decide to go with option 1 or 2, then potentially there will be a lot
of code to review at the end and ideally we should give opportunity for our
committers to provide early review comments on the feature. This will
mitigate the risk of having to re-implement the entire feature. The options
here are:
A. Create a branch on https://github.com/apache/incubator-systemml. This
allows people to collaborate as well as allows committers to look at the
code.
B. Create a branch on a fork and have a PR up to allow committers to raise
concerns and provide suggestions. This is done for
https://github.com/apache/incubator-systemml/pull/165 and
https://github.com/apache/incubator-systemml/pull/119. To collaborate, the
person creating PR will act as committer for the feature and will accept PR
on its branch and will be responsible for resolving conflicts and keeping
the PR in sync with the master.

If we decide to go with the option 3 (i.e. incremental addition), the
option B seems to be logical choice as we already do this for other
features.

My goal here is not to create a formal process but instead to avoid any
potential misunderstanding/confusion and also to follow recommended Apache
practices.

Please email back with your thoughts :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

Deron Eriksson ---05/18/2016 11:22:26 AM---Hi Niketan, Good idea, I think
that would be the cleanest solution for now. Since JCuda

From: Deron Eriksson <de...@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 05/18/2016 11:22 AM
Subject: Re: Discussion on GPU backend



Hi Niketan,

Good idea, I think that would be the cleanest solution for now. Since JCuda
doesn't appear to be in a public maven repo, it adds a layer of difficulty
to clean integration via maven builds.

Deron


On Wed, May 18, 2016 at 10:55 AM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Deron,
>
> Good points. I vote that we keep JCUDA and other accelerators we add as
an
> external dependency. This means the user will have to ensure JCuda.jar in
> the class path and JCuda.DLL/JCuda.so in the LD_LIBRARY_PATH.
>
> I don't think JCuda.jar is platform-specific.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Deron Eriksson ---05/18/2016 10:51:17
> AM---Hi, I'm wondering what would be a good way to handle JCuda]Deron
> Eriksson ---05/18/2016 10:51:17 AM---Hi, I'm wondering what would be a
good
> way to handle JCuda in terms of the
>
> From: Deron Eriksson <de...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 05/18/2016 10:51 AM
> Subject: Re: Discussion on GPU backend
> ------------------------------
>
>
>
> Hi,
>
> I'm wondering what would be a good way to handle JCuda in terms of the
> build release packages. Currently we have 11 artifacts that we are
> building:
>   systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
>   systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
>   systemml-0.10.0-incubating-SNAPSHOT-sources.jar
>   systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-src.zip
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
>   systemml-0.10.0-incubating-SNAPSHOT.jar
>   systemml-0.10.0-incubating-SNAPSHOT.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT.zip
>
> It looks like JCuda is platform-specific, so you typically need different
> jars/dlls/sos/etc for each platform. If I'm understanding things
correctly,
> if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
> artifacts for JCuda, we'd potentially have an enormous number of
artifacts.
>
> Is this something that could be potentially handled by specific profiles
in
> the pom so that a user might be able to do something like "mvn clean
> package -P jcuda-windows" so that a user could be responsible for
building
> the platform-specific SystemML jar for jcuda? Or is this something that
> could be handled differently, by putting the platform-specific jcuda jar
on
> the classpath and any dlls or other needed libraries on the path?
>
> Deron
>
>
>
> On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Luciano,
> >
> > Like all our backends, there is no change in the programming model. The
> > user submits a DML script and specifies whether she wants to use an
> > accelerator. Assuming that we compile jcuda jars into SystemML.jar, the
> > user can use GPU backend using following command:
> > spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> > hybrid_spark
> >
> > The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or
so
> > files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> > <https://issues.apache.org/jira/browse/SPARK-1720> ... For example: the
>
> > user can add following to spark-env.sh
> > export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH
> >
> > The first version of GPU backend will only accelerate CP. In this case,
> we
> > have four types of instructions:
> > 1. CP
> > 2. GPU (requires GPU on the driver)
> > 3. SPARK
> > 4. MR
> >
> > Note, the first version will require the CUDA/JCuda dependency to be
> > installed on the driver only.
> >
> > The next version will accelerate our distributed instructions as well.
In
> > this case, we will have six types of instructions:
> > 1. CP
> > 2. GPU
> > 3. SPARK
> > 4. MR
> > 5. SPARK-GPU (requires GPU cluster)
> > 6. MR-GPU (requires GPU cluster)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> >
> > [image: Inactive hide details for Luciano Resende ---05/17/2016
09:13:24
> > PM---Great to see detailed information on this topic Niketan,]Luciano
> > Resende ---05/17/2016 09:13:24 PM---Great to see detailed information
on
> > this topic Niketan, I guess I have missed when you posted it in
> >
> > From: Luciano Resende <lu...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 05/17/2016 09:13 PM
> > Subject: Re: Discussion on GPU backend
> > ------------------------------
>
> >
> >
> >
> > Great to see detailed information on this topic Niketan, I guess I have
> > missed when you posted it initially.
> >
> > Could you elaborate a little more on what is the programming model for
> when
> > the user wants to leverage GPU ? Also, today I can submit a job to
spark
> > using --jars and it will handle copying the dependencies to the worker
> > nodes. If my application wants to leverage GPU, what extras
dependencies
> > will be required on the worker nodes, and how they are going to be
> > installed/updated on the Spark cluster ?
> >
> >
> >
> > On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > >
> > >
> > > Hi all,
> > >
> > > I have updated the design document for our GPU backend in the JIRA
> > >
> https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
>
> > > details are based on the prototype I created and is available in PR
> > >
> https://github.com/apache/incubator-systemml/pull/131. Once we are done
>
> > > with the discussion, I can clean up and separate out the GPU backend
> in a
> > > separate PR for easier review :)
> > >
> > > Here are key design points:
> > > A GPU backend would implement two abstract classes:
> > >    1.   GPUContext
> > >    2.   GPUObject
> > >
> > >
> > >
> > > The GPUContext is responsible for GPU memory management and gets
> > call-backs
> > > from SystemML's bufferpool on following methods:
> > >    1.   void acquireRead(MatrixObject mo)
> > >    2.   void acquireModify(MatrixObject mo)
> > >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> > >    4.   void exportData(MatrixObject mo)
> > >    5.   void evict(MatrixObject mo)
> > >
> > >
> > >
> > > A GPUObject (like RDDObject and BroadcastObject) is stored in
> > CacheableData
> > > object. It contains following methods that are called back from the
> > > corresponding GPUContext:
> > >    1.   void allocateMemoryOnDevice()
> > >    2.   void deallocateMemoryOnDevice()
> > >    3.   long getSizeOnDevice()
> > >    4.   void copyFromHostToDevice()
> > >    5.   void copyFromDeviceToHost()
> > >
> > >
> > >
> > > In the initial implementation, we will add JCudaContext and
> JCudaPointer
> > > that will extend the above abstract classes respectively. The
> > JCudaContext
> > > will be created by ExecutionContextFactory depending on the
> > user-specified
> > > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: GPU
> and
> > > implement GPU instructions.
> > >
> > > The above design is general enough so that other people can implement
> > > custom accelerators (for example: OpenCL) and also follows the design
> > > principles of our CP bufferpool.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> > >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
> >
> >
> >
> >
>
>
>
>




Re: Discussion on GPU backend

Posted by Niketan Pansare <np...@us.ibm.com>.
Hi all,

Since there is interest in collaborating on GPU backend, I wanted to know
what is the preferred way to go ahead with a new feature (i.e. GPU
backend) ? This discussion is also generally applicable to other major
features  (for example: Flink backend, Deep Learning support, OpenCL
backend, new data types, new built-in functions, new algorithms, etc).

The first point of discussion is what would qualify as a "major feature"
and how we integrate it into SystemML ? Here are three options that could
serve as a potential requirement:
1. The feature has to be fully functional and fully optimized. For example:
in the case of additional backends, the PR can only be merged in if and
only if, all the instructions (CP or distributed) has been implemented and
is at least as optimized as our existing alternate backends. In the case of
algorithms or the built-in functions, the PR can only be merged in if and
only if, it runs on all the backends for all datasets and is comparable in
performance and accuracy with an external ML libraries.
2. The feature has to be fully functional. In this case, the PR can only be
merged in if and only if all the instructions (CP or distributed) has been
implemented. However, the first version of the new backend need not perform
faster than our existing alternate backends.
3. Increment addition but with unit testcases that addresses quality and
stability concerns. In this case, a PR can be merged if a subset of
instructions has been implemented along with set of unit test cases
suggested by our committers. The main benefit here is quick-feedback
iterations from our committers, whereas the main drawback is an
intermediate state where we don't fully support the given backend for
certain scenario.

If we decide to go with option 1 or 2, then potentially there will be a lot
of code to review at the end and ideally we should give opportunity for our
committers to provide early review comments on the feature. This will
mitigate the risk of having to re-implement the entire feature. The options
here are:
A. Create a branch on https://github.com/apache/incubator-systemml. This
allows people to collaborate as well as allows committers to look at the
code.
B. Create a branch on a fork and have a PR up to allow committers to raise
concerns and provide suggestions. This is done for
https://github.com/apache/incubator-systemml/pull/165 and
https://github.com/apache/incubator-systemml/pull/119. To collaborate, the
person creating PR will act as committer for the feature and will accept PR
on its branch and will be responsible for resolving conflicts and keeping
the PR in sync with the master.

If we decide to go with the option 3 (i.e. incremental addition), the
option B seems to be logical choice as we already do this for other
features.

My goal here is not to create a formal process but instead to avoid any
potential misunderstanding/confusion and also to follow recommended Apache
practices.

Please email back with your thoughts :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Deron Eriksson <de...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	05/18/2016 11:22 AM
Subject:	Re: Discussion on GPU backend



Hi Niketan,

Good idea, I think that would be the cleanest solution for now. Since JCuda
doesn't appear to be in a public maven repo, it adds a layer of difficulty
to clean integration via maven builds.

Deron


On Wed, May 18, 2016 at 10:55 AM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Deron,
>
> Good points. I vote that we keep JCUDA and other accelerators we add as
an
> external dependency. This means the user will have to ensure JCuda.jar in
> the class path and JCuda.DLL/JCuda.so in the LD_LIBRARY_PATH.
>
> I don't think JCuda.jar is platform-specific.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Deron Eriksson ---05/18/2016 10:51:17
> AM---Hi, I'm wondering what would be a good way to handle JCuda]Deron
> Eriksson ---05/18/2016 10:51:17 AM---Hi, I'm wondering what would be a
good
> way to handle JCuda in terms of the
>
> From: Deron Eriksson <de...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 05/18/2016 10:51 AM
> Subject: Re: Discussion on GPU backend
> ------------------------------
>
>
>
> Hi,
>
> I'm wondering what would be a good way to handle JCuda in terms of the
> build release packages. Currently we have 11 artifacts that we are
> building:
>   systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
>   systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
>   systemml-0.10.0-incubating-SNAPSHOT-sources.jar
>   systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-src.zip
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
>   systemml-0.10.0-incubating-SNAPSHOT.jar
>   systemml-0.10.0-incubating-SNAPSHOT.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT.zip
>
> It looks like JCuda is platform-specific, so you typically need different
> jars/dlls/sos/etc for each platform. If I'm understanding things
correctly,
> if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
> artifacts for JCuda, we'd potentially have an enormous number of
artifacts.
>
> Is this something that could be potentially handled by specific profiles
in
> the pom so that a user might be able to do something like "mvn clean
> package -P jcuda-windows" so that a user could be responsible for
building
> the platform-specific SystemML jar for jcuda? Or is this something that
> could be handled differently, by putting the platform-specific jcuda jar
on
> the classpath and any dlls or other needed libraries on the path?
>
> Deron
>
>
>
> On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Luciano,
> >
> > Like all our backends, there is no change in the programming model. The
> > user submits a DML script and specifies whether she wants to use an
> > accelerator. Assuming that we compile jcuda jars into SystemML.jar, the
> > user can use GPU backend using following command:
> > spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> > hybrid_spark
> >
> > The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or
so
> > files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> > <https://issues.apache.org/jira/browse/SPARK-1720> ... For example: the
>
> > user can add following to spark-env.sh
> > export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH
> >
> > The first version of GPU backend will only accelerate CP. In this case,
> we
> > have four types of instructions:
> > 1. CP
> > 2. GPU (requires GPU on the driver)
> > 3. SPARK
> > 4. MR
> >
> > Note, the first version will require the CUDA/JCuda dependency to be
> > installed on the driver only.
> >
> > The next version will accelerate our distributed instructions as well.
In
> > this case, we will have six types of instructions:
> > 1. CP
> > 2. GPU
> > 3. SPARK
> > 4. MR
> > 5. SPARK-GPU (requires GPU cluster)
> > 6. MR-GPU (requires GPU cluster)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> >
> > [image: Inactive hide details for Luciano Resende ---05/17/2016
09:13:24
> > PM---Great to see detailed information on this topic Niketan,]Luciano
> > Resende ---05/17/2016 09:13:24 PM---Great to see detailed information
on
> > this topic Niketan, I guess I have missed when you posted it in
> >
> > From: Luciano Resende <lu...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 05/17/2016 09:13 PM
> > Subject: Re: Discussion on GPU backend
> > ------------------------------
>
> >
> >
> >
> > Great to see detailed information on this topic Niketan, I guess I have
> > missed when you posted it initially.
> >
> > Could you elaborate a little more on what is the programming model for
> when
> > the user wants to leverage GPU ? Also, today I can submit a job to
spark
> > using --jars and it will handle copying the dependencies to the worker
> > nodes. If my application wants to leverage GPU, what extras
dependencies
> > will be required on the worker nodes, and how they are going to be
> > installed/updated on the Spark cluster ?
> >
> >
> >
> > On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > >
> > >
> > > Hi all,
> > >
> > > I have updated the design document for our GPU backend in the JIRA
> > >
> https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
>
> > > details are based on the prototype I created and is available in PR
> > >
> https://github.com/apache/incubator-systemml/pull/131. Once we are done
>
> > > with the discussion, I can clean up and separate out the GPU backend
> in a
> > > separate PR for easier review :)
> > >
> > > Here are key design points:
> > > A GPU backend would implement two abstract classes:
> > >    1.   GPUContext
> > >    2.   GPUObject
> > >
> > >
> > >
> > > The GPUContext is responsible for GPU memory management and gets
> > call-backs
> > > from SystemML's bufferpool on following methods:
> > >    1.   void acquireRead(MatrixObject mo)
> > >    2.   void acquireModify(MatrixObject mo)
> > >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> > >    4.   void exportData(MatrixObject mo)
> > >    5.   void evict(MatrixObject mo)
> > >
> > >
> > >
> > > A GPUObject (like RDDObject and BroadcastObject) is stored in
> > CacheableData
> > > object. It contains following methods that are called back from the
> > > corresponding GPUContext:
> > >    1.   void allocateMemoryOnDevice()
> > >    2.   void deallocateMemoryOnDevice()
> > >    3.   long getSizeOnDevice()
> > >    4.   void copyFromHostToDevice()
> > >    5.   void copyFromDeviceToHost()
> > >
> > >
> > >
> > > In the initial implementation, we will add JCudaContext and
> JCudaPointer
> > > that will extend the above abstract classes respectively. The
> > JCudaContext
> > > will be created by ExecutionContextFactory depending on the
> > user-specified
> > > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: GPU
> and
> > > implement GPU instructions.
> > >
> > > The above design is general enough so that other people can implement
> > > custom accelerators (for example: OpenCL) and also follows the design
> > > principles of our CP bufferpool.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> > >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
> >
> >
> >
> >
>
>
>
>



Re: Discussion on GPU backend

Posted by Deron Eriksson <de...@gmail.com>.
Hi Niketan,

Good idea, I think that would be the cleanest solution for now. Since JCuda
doesn't appear to be in a public maven repo, it adds a layer of difficulty
to clean integration via maven builds.

Deron


On Wed, May 18, 2016 at 10:55 AM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Deron,
>
> Good points. I vote that we keep JCUDA and other accelerators we add as an
> external dependency. This means the user will have to ensure JCuda.jar in
> the class path and JCuda.DLL/JCuda.so in the LD_LIBRARY_PATH.
>
> I don't think JCuda.jar is platform-specific.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Deron Eriksson ---05/18/2016 10:51:17
> AM---Hi, I'm wondering what would be a good way to handle JCuda]Deron
> Eriksson ---05/18/2016 10:51:17 AM---Hi, I'm wondering what would be a good
> way to handle JCuda in terms of the
>
> From: Deron Eriksson <de...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 05/18/2016 10:51 AM
> Subject: Re: Discussion on GPU backend
> ------------------------------
>
>
>
> Hi,
>
> I'm wondering what would be a good way to handle JCuda in terms of the
> build release packages. Currently we have 11 artifacts that we are
> building:
>   systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
>   systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
>   systemml-0.10.0-incubating-SNAPSHOT-sources.jar
>   systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-src.zip
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
>   systemml-0.10.0-incubating-SNAPSHOT.jar
>   systemml-0.10.0-incubating-SNAPSHOT.tar.gz
>   systemml-0.10.0-incubating-SNAPSHOT.zip
>
> It looks like JCuda is platform-specific, so you typically need different
> jars/dlls/sos/etc for each platform. If I'm understanding things correctly,
> if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
> artifacts for JCuda, we'd potentially have an enormous number of artifacts.
>
> Is this something that could be potentially handled by specific profiles in
> the pom so that a user might be able to do something like "mvn clean
> package -P jcuda-windows" so that a user could be responsible for building
> the platform-specific SystemML jar for jcuda? Or is this something that
> could be handled differently, by putting the platform-specific jcuda jar on
> the classpath and any dlls or other needed libraries on the path?
>
> Deron
>
>
>
> On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> > Hi Luciano,
> >
> > Like all our backends, there is no change in the programming model. The
> > user submits a DML script and specifies whether she wants to use an
> > accelerator. Assuming that we compile jcuda jars into SystemML.jar, the
> > user can use GPU backend using following command:
> > spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> > hybrid_spark
> >
> > The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or so
> > files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> > <https://issues.apache.org/jira/browse/SPARK-1720> ... For example: the
>
> > user can add following to spark-env.sh
> > export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH
> >
> > The first version of GPU backend will only accelerate CP. In this case,
> we
> > have four types of instructions:
> > 1. CP
> > 2. GPU (requires GPU on the driver)
> > 3. SPARK
> > 4. MR
> >
> > Note, the first version will require the CUDA/JCuda dependency to be
> > installed on the driver only.
> >
> > The next version will accelerate our distributed instructions as well. In
> > this case, we will have six types of instructions:
> > 1. CP
> > 2. GPU
> > 3. SPARK
> > 4. MR
> > 5. SPARK-GPU (requires GPU cluster)
> > 6. MR-GPU (requires GPU cluster)
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> >
> > [image: Inactive hide details for Luciano Resende ---05/17/2016 09:13:24
> > PM---Great to see detailed information on this topic Niketan,]Luciano
> > Resende ---05/17/2016 09:13:24 PM---Great to see detailed information on
> > this topic Niketan, I guess I have missed when you posted it in
> >
> > From: Luciano Resende <lu...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 05/17/2016 09:13 PM
> > Subject: Re: Discussion on GPU backend
> > ------------------------------
>
> >
> >
> >
> > Great to see detailed information on this topic Niketan, I guess I have
> > missed when you posted it initially.
> >
> > Could you elaborate a little more on what is the programming model for
> when
> > the user wants to leverage GPU ? Also, today I can submit a job to spark
> > using --jars and it will handle copying the dependencies to the worker
> > nodes. If my application wants to leverage GPU, what extras dependencies
> > will be required on the worker nodes, and how they are going to be
> > installed/updated on the Spark cluster ?
> >
> >
> >
> > On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com>
> > wrote:
> >
> > >
> > >
> > > Hi all,
> > >
> > > I have updated the design document for our GPU backend in the JIRA
> > >
> https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
>
> > > details are based on the prototype I created and is available in PR
> > >
> https://github.com/apache/incubator-systemml/pull/131. Once we are done
>
> > > with the discussion, I can clean up and separate out the GPU backend
> in a
> > > separate PR for easier review :)
> > >
> > > Here are key design points:
> > > A GPU backend would implement two abstract classes:
> > >    1.   GPUContext
> > >    2.   GPUObject
> > >
> > >
> > >
> > > The GPUContext is responsible for GPU memory management and gets
> > call-backs
> > > from SystemML's bufferpool on following methods:
> > >    1.   void acquireRead(MatrixObject mo)
> > >    2.   void acquireModify(MatrixObject mo)
> > >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> > >    4.   void exportData(MatrixObject mo)
> > >    5.   void evict(MatrixObject mo)
> > >
> > >
> > >
> > > A GPUObject (like RDDObject and BroadcastObject) is stored in
> > CacheableData
> > > object. It contains following methods that are called back from the
> > > corresponding GPUContext:
> > >    1.   void allocateMemoryOnDevice()
> > >    2.   void deallocateMemoryOnDevice()
> > >    3.   long getSizeOnDevice()
> > >    4.   void copyFromHostToDevice()
> > >    5.   void copyFromDeviceToHost()
> > >
> > >
> > >
> > > In the initial implementation, we will add JCudaContext and
> JCudaPointer
> > > that will extend the above abstract classes respectively. The
> > JCudaContext
> > > will be created by ExecutionContextFactory depending on the
> > user-specified
> > > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: GPU
> and
> > > implement GPU instructions.
> > >
> > > The above design is general enough so that other people can implement
> > > custom accelerators (for example: OpenCL) and also follows the design
> > > principles of our CP bufferpool.
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> > >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
> >
> >
> >
> >
>
>
>
>

Re: Discussion on GPU backend

Posted by Niketan Pansare <np...@us.ibm.com>.
Hi Deron,

Good points. I vote that we keep JCUDA and other accelerators we add as an
external dependency. This means the user will have to ensure JCuda.jar in
the class path and JCuda.DLL/JCuda.so in the LD_LIBRARY_PATH.

I don't think JCuda.jar is platform-specific.

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Deron Eriksson <de...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	05/18/2016 10:51 AM
Subject:	Re: Discussion on GPU backend



Hi,

I'm wondering what would be a good way to handle JCuda in terms of the
build release packages. Currently we have 11 artifacts that we are
building:
   systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
   systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
   systemml-0.10.0-incubating-SNAPSHOT-sources.jar
   systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
   systemml-0.10.0-incubating-SNAPSHOT-src.zip
   systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
   systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
   systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
   systemml-0.10.0-incubating-SNAPSHOT.jar
   systemml-0.10.0-incubating-SNAPSHOT.tar.gz
   systemml-0.10.0-incubating-SNAPSHOT.zip

It looks like JCuda is platform-specific, so you typically need different
jars/dlls/sos/etc for each platform. If I'm understanding things correctly,
if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
artifacts for JCuda, we'd potentially have an enormous number of artifacts.

Is this something that could be potentially handled by specific profiles in
the pom so that a user might be able to do something like "mvn clean
package -P jcuda-windows" so that a user could be responsible for building
the platform-specific SystemML jar for jcuda? Or is this something that
could be handled differently, by putting the platform-specific jcuda jar on
the classpath and any dlls or other needed libraries on the path?

Deron



On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Luciano,
>
> Like all our backends, there is no change in the programming model. The
> user submits a DML script and specifies whether she wants to use an
> accelerator. Assuming that we compile jcuda jars into SystemML.jar, the
> user can use GPU backend using following command:
> spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> hybrid_spark
>
> The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or so
> files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> <https://issues.apache.org/jira/browse/SPARK-1720> ... For example: the
> user can add following to spark-env.sh
> export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH
>
> The first version of GPU backend will only accelerate CP. In this case,
we
> have four types of instructions:
> 1. CP
> 2. GPU (requires GPU on the driver)
> 3. SPARK
> 4. MR
>
> Note, the first version will require the CUDA/JCuda dependency to be
> installed on the driver only.
>
> The next version will accelerate our distributed instructions as well. In
> this case, we will have six types of instructions:
> 1. CP
> 2. GPU
> 3. SPARK
> 4. MR
> 5. SPARK-GPU (requires GPU cluster)
> 6. MR-GPU (requires GPU cluster)
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Luciano Resende ---05/17/2016 09:13:24
> PM---Great to see detailed information on this topic Niketan,]Luciano
> Resende ---05/17/2016 09:13:24 PM---Great to see detailed information on
> this topic Niketan, I guess I have missed when you posted it in
>
> From: Luciano Resende <lu...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 05/17/2016 09:13 PM
> Subject: Re: Discussion on GPU backend
> ------------------------------
>
>
>
> Great to see detailed information on this topic Niketan, I guess I have
> missed when you posted it initially.
>
> Could you elaborate a little more on what is the programming model for
when
> the user wants to leverage GPU ? Also, today I can submit a job to spark
> using --jars and it will handle copying the dependencies to the worker
> nodes. If my application wants to leverage GPU, what extras dependencies
> will be required on the worker nodes, and how they are going to be
> installed/updated on the Spark cluster ?
>
>
>
> On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> >
> >
> > Hi all,
> >
> > I have updated the design document for our GPU backend in the JIRA
> > https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
> > details are based on the prototype I created and is available in PR
> > https://github.com/apache/incubator-systemml/pull/131. Once we are done
> > with the discussion, I can clean up and separate out the GPU backend in
a
> > separate PR for easier review :)
> >
> > Here are key design points:
> > A GPU backend would implement two abstract classes:
> >    1.   GPUContext
> >    2.   GPUObject
> >
> >
> >
> > The GPUContext is responsible for GPU memory management and gets
> call-backs
> > from SystemML's bufferpool on following methods:
> >    1.   void acquireRead(MatrixObject mo)
> >    2.   void acquireModify(MatrixObject mo)
> >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> >    4.   void exportData(MatrixObject mo)
> >    5.   void evict(MatrixObject mo)
> >
> >
> >
> > A GPUObject (like RDDObject and BroadcastObject) is stored in
> CacheableData
> > object. It contains following methods that are called back from the
> > corresponding GPUContext:
> >    1.   void allocateMemoryOnDevice()
> >    2.   void deallocateMemoryOnDevice()
> >    3.   long getSizeOnDevice()
> >    4.   void copyFromHostToDevice()
> >    5.   void copyFromDeviceToHost()
> >
> >
> >
> > In the initial implementation, we will add JCudaContext and
JCudaPointer
> > that will extend the above abstract classes respectively. The
> JCudaContext
> > will be created by ExecutionContextFactory depending on the
> user-specified
> > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: GPU
and
> > implement GPU instructions.
> >
> > The above design is general enough so that other people can implement
> > custom accelerators (for example: OpenCL) and also follows the design
> > principles of our CP bufferpool.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
>
>
>



Re: Discussion on GPU backend

Posted by Deron Eriksson <de...@gmail.com>.
Hi,

I'm wondering what would be a good way to handle JCuda in terms of the
build release packages. Currently we have 11 artifacts that we are building:
   systemml-0.10.0-incubating-SNAPSHOT-inmemory.jar
   systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
   systemml-0.10.0-incubating-SNAPSHOT-sources.jar
   systemml-0.10.0-incubating-SNAPSHOT-src.tar.gz
   systemml-0.10.0-incubating-SNAPSHOT-src.zip
   systemml-0.10.0-incubating-SNAPSHOT-standalone.jar
   systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz
   systemml-0.10.0-incubating-SNAPSHOT-standalone.zip
   systemml-0.10.0-incubating-SNAPSHOT.jar
   systemml-0.10.0-incubating-SNAPSHOT.tar.gz
   systemml-0.10.0-incubating-SNAPSHOT.zip

It looks like JCuda is platform-specific, so you typically need different
jars/dlls/sos/etc for each platform. If I'm understanding things correctly,
if we generated Windows/Linux/LinuxPowerPC/MacOS-specific SystemML
artifacts for JCuda, we'd potentially have an enormous number of artifacts.

Is this something that could be potentially handled by specific profiles in
the pom so that a user might be able to do something like "mvn clean
package -P jcuda-windows" so that a user could be responsible for building
the platform-specific SystemML jar for jcuda? Or is this something that
could be handled differently, by putting the platform-specific jcuda jar on
the classpath and any dlls or other needed libraries on the path?

Deron



On Tue, May 17, 2016 at 10:50 PM, Niketan Pansare <np...@us.ibm.com>
wrote:

> Hi Luciano,
>
> Like all our backends, there is no change in the programming model. The
> user submits a DML script and specifies whether she wants to use an
> accelerator. Assuming that we compile jcuda jars into SystemML.jar, the
> user can use GPU backend using following command:
> spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
> hybrid_spark
>
> The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or so
> files. Please see *https://issues.apache.org/jira/browse/SPARK-1720*
> <https://issues.apache.org/jira/browse/SPARK-1720> ... For example: the
> user can add following to spark-env.sh
> export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH
>
> The first version of GPU backend will only accelerate CP. In this case, we
> have four types of instructions:
> 1. CP
> 2. GPU (requires GPU on the driver)
> 3. SPARK
> 4. MR
>
> Note, the first version will require the CUDA/JCuda dependency to be
> installed on the driver only.
>
> The next version will accelerate our distributed instructions as well. In
> this case, we will have six types of instructions:
> 1. CP
> 2. GPU
> 3. SPARK
> 4. MR
> 5. SPARK-GPU (requires GPU cluster)
> 6. MR-GPU (requires GPU cluster)
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Luciano Resende ---05/17/2016 09:13:24
> PM---Great to see detailed information on this topic Niketan,]Luciano
> Resende ---05/17/2016 09:13:24 PM---Great to see detailed information on
> this topic Niketan, I guess I have missed when you posted it in
>
> From: Luciano Resende <lu...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 05/17/2016 09:13 PM
> Subject: Re: Discussion on GPU backend
> ------------------------------
>
>
>
> Great to see detailed information on this topic Niketan, I guess I have
> missed when you posted it initially.
>
> Could you elaborate a little more on what is the programming model for when
> the user wants to leverage GPU ? Also, today I can submit a job to spark
> using --jars and it will handle copying the dependencies to the worker
> nodes. If my application wants to leverage GPU, what extras dependencies
> will be required on the worker nodes, and how they are going to be
> installed/updated on the Spark cluster ?
>
>
>
> On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com>
> wrote:
>
> >
> >
> > Hi all,
> >
> > I have updated the design document for our GPU backend in the JIRA
> > https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
> > details are based on the prototype I created and is available in PR
> > https://github.com/apache/incubator-systemml/pull/131. Once we are done
> > with the discussion, I can clean up and separate out the GPU backend in a
> > separate PR for easier review :)
> >
> > Here are key design points:
> > A GPU backend would implement two abstract classes:
> >    1.   GPUContext
> >    2.   GPUObject
> >
> >
> >
> > The GPUContext is responsible for GPU memory management and gets
> call-backs
> > from SystemML's bufferpool on following methods:
> >    1.   void acquireRead(MatrixObject mo)
> >    2.   void acquireModify(MatrixObject mo)
> >    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
> >    4.   void exportData(MatrixObject mo)
> >    5.   void evict(MatrixObject mo)
> >
> >
> >
> > A GPUObject (like RDDObject and BroadcastObject) is stored in
> CacheableData
> > object. It contains following methods that are called back from the
> > corresponding GPUContext:
> >    1.   void allocateMemoryOnDevice()
> >    2.   void deallocateMemoryOnDevice()
> >    3.   long getSizeOnDevice()
> >    4.   void copyFromHostToDevice()
> >    5.   void copyFromDeviceToHost()
> >
> >
> >
> > In the initial implementation, we will add JCudaContext and JCudaPointer
> > that will extend the above abstract classes respectively. The
> JCudaContext
> > will be created by ExecutionContextFactory depending on the
> user-specified
> > accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: GPU and
> > implement GPU instructions.
> >
> > The above design is general enough so that other people can implement
> > custom accelerators (for example: OpenCL) and also follows the design
> > principles of our CP bufferpool.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
>
>
>

Re: Discussion on GPU backend

Posted by Niketan Pansare <np...@us.ibm.com>.
Hi Luciano,

Like all our backends, there is no change in the programming model. The
user submits a DML script and specifies whether she wants to use an
accelerator. Assuming that we compile jcuda jars into SystemML.jar, the
user can use GPU backend using following command:
spark-submit --master yarn-client ... -f MyAlgo.dml -accelerator -exec
hybrid_spark

The user also needs to set LD_LIBRARY_PATH that points to JCuda DLL or so
files. Please see https://issues.apache.org/jira/browse/SPARK-1720 ... For
example: the user can add following to spark-env.sh
export LD_LIBRARY_PATH=<path to jcuda so>:$LD_LIBRARY_PATH

The first version of GPU backend will only accelerate CP. In this case, we
have four types of instructions:
1. CP
2. GPU (requires GPU on the driver)
3. SPARK
4. MR

Note, the first version will require the CUDA/JCuda dependency to be
installed on the driver only.

The next version will accelerate our distributed instructions as well. In
this case, we will have six types of instructions:
1. CP
2. GPU
3. SPARK
4. MR
5. SPARK-GPU (requires GPU cluster)
6. MR-GPU (requires GPU cluster)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	Luciano Resende <lu...@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	05/17/2016 09:13 PM
Subject:	Re: Discussion on GPU backend



Great to see detailed information on this topic Niketan, I guess I have
missed when you posted it initially.

Could you elaborate a little more on what is the programming model for when
the user wants to leverage GPU ? Also, today I can submit a job to spark
using --jars and it will handle copying the dependencies to the worker
nodes. If my application wants to leverage GPU, what extras dependencies
will be required on the worker nodes, and how they are going to be
installed/updated on the Spark cluster ?



On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com> wrote:

>
>
> Hi all,
>
> I have updated the design document for our GPU backend in the JIRA
> https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
> details are based on the prototype I created and is available in PR
> https://github.com/apache/incubator-systemml/pull/131. Once we are done
> with the discussion, I can clean up and separate out the GPU backend in a
> separate PR for easier review :)
>
> Here are key design points:
> A GPU backend would implement two abstract classes:
>    1.   GPUContext
>    2.   GPUObject
>
>
>
> The GPUContext is responsible for GPU memory management and gets
call-backs
> from SystemML's bufferpool on following methods:
>    1.   void acquireRead(MatrixObject mo)
>    2.   void acquireModify(MatrixObject mo)
>    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
>    4.   void exportData(MatrixObject mo)
>    5.   void evict(MatrixObject mo)
>
>
>
> A GPUObject (like RDDObject and BroadcastObject) is stored in
CacheableData
> object. It contains following methods that are called back from the
> corresponding GPUContext:
>    1.   void allocateMemoryOnDevice()
>    2.   void deallocateMemoryOnDevice()
>    3.   long getSizeOnDevice()
>    4.   void copyFromHostToDevice()
>    5.   void copyFromDeviceToHost()
>
>
>
> In the initial implementation, we will add JCudaContext and JCudaPointer
> that will extend the above abstract classes respectively. The
JCudaContext
> will be created by ExecutionContextFactory depending on the
user-specified
> accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: GPU and
> implement GPU instructions.
>
> The above design is general enough so that other people can implement
> custom accelerators (for example: OpenCL) and also follows the design
> principles of our CP bufferpool.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>



--
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/



Re: Discussion on GPU backend

Posted by Luciano Resende <lu...@gmail.com>.
Great to see detailed information on this topic Niketan, I guess I have
missed when you posted it initially.

Could you elaborate a little more on what is the programming model for when
the user wants to leverage GPU ? Also, today I can submit a job to spark
using --jars and it will handle copying the dependencies to the worker
nodes. If my application wants to leverage GPU, what extras dependencies
will be required on the worker nodes, and how they are going to be
installed/updated on the Spark cluster ?



On Tue, May 3, 2016 at 1:26 PM, Niketan Pansare <np...@us.ibm.com> wrote:

>
>
> Hi all,
>
> I have updated the design document for our GPU backend in the JIRA
> https://issues.apache.org/jira/browse/SYSTEMML-445. The implementation
> details are based on the prototype I created and is available in PR
> https://github.com/apache/incubator-systemml/pull/131. Once we are done
> with the discussion, I can clean up and separate out the GPU backend in a
> separate PR for easier review :)
>
> Here are key design points:
> A GPU backend would implement two abstract classes:
>    1.   GPUContext
>    2.   GPUObject
>
>
>
> The GPUContext is responsible for GPU memory management and gets call-backs
> from SystemML's bufferpool on following methods:
>    1.   void acquireRead(MatrixObject mo)
>    2.   void acquireModify(MatrixObject mo)
>    3.   void release(MatrixObject mo, boolean isGPUCopyModified)
>    4.   void exportData(MatrixObject mo)
>    5.   void evict(MatrixObject mo)
>
>
>
> A GPUObject (like RDDObject and BroadcastObject) is stored in CacheableData
> object. It contains following methods that are called back from the
> corresponding GPUContext:
>    1.   void allocateMemoryOnDevice()
>    2.   void deallocateMemoryOnDevice()
>    3.   long getSizeOnDevice()
>    4.   void copyFromHostToDevice()
>    5.   void copyFromDeviceToHost()
>
>
>
> In the initial implementation, we will add JCudaContext and JCudaPointer
> that will extend the above abstract classes respectively. The JCudaContext
> will be created by ExecutionContextFactory depending on the user-specified
> accelarator. Analgous to MR/SPARK/CP, we will add a new ExecType: GPU and
> implement GPU instructions.
>
> The above design is general enough so that other people can implement
> custom accelerators (for example: OpenCL) and also follows the design
> principles of our CP bufferpool.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/