You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Nakul Jindal <na...@gmail.com> on 2016/12/02 20:03:02 UTC

Re: Build and distribution related issues for GPU support

@Matthias,

Thanks for your questions :)
This will help point us to a public discussion about the decision to put
the ptx under version control.


From what I understand, we compile for a certain virtual architecture and
for a certain GPU (using the -code and -arch).
Currently, we compile for sm_20.
https://github.com/apache/incubator-systemml/blob/master/src/main/cpp/kernels/SystemML.ptx#L26
This ptx is good for "higher" REAL architectures also (sm_30, sm_32. sm_35,
sm 50, sm_52, sm_53).

Further Reading / References:
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list

So to answer your first question, whether it will run on Kepler devices -
Yes, it will, because it is higher than sm_20.


For your second question - is there a performance diff between CUBIN and
PTX - yes there is.
CUBINs are compiled for a target architecture, PTX is for the virtual GPU
ISA (forward compatible) which is compiled at runtime by the JIT.
There is a startup cost. This post describes approaches to mitigate that
startup cost:
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/
The blog post suggest either shipping a fat cubin - which has the ptx and
compiled code for more than one target GPU architecture - or using JIT
caching, which is controlled by setting environment variables.

Shipping a fat cubin is obviously much more heavy weight than just the ptx.
Realistically, the PTX JIT compilation adds about <5 seconds of startup
overhead (on the platforms I tested on), if the "-gpu" flag option is used.
It can be argued that in a long running job, a constant cost is justified.


-Nakul












On Thu, Nov 24, 2016 at 12:53 AM, Matthias Boehm <mb...@googlemail.com>
wrote:

> So just to make sure I understand correctly: right now we compiled the few
> example kernels with PTX version 4.3, implying that this is the minimum
> requirement and SystemML's GPU backend will not run, for example, on Kepler
> devices (with PTX version 3), right?
>
> Also, is there a performance difference (generated code, or just-in-time
> compilation overhead) between CUBIN and PTX files? If so, can we quantify
> this difference to make a decision here? Thanks.
>
> Regards,
> Matthias
>
>
> On 11/24/2016 8:34 AM, Nakul Jindal wrote:
>
>> @Matthias -
>> PTX (parallel thread execution) objects are intermediate compiled objects.
>>
>> As of the current master, they are maintained under git version control.
>> This decision was agreed upon after discussing the hassle that a developer
>> of systemml without the nvidia cuda compiler might face.
>> It was decided that a person modifying the .cu files will be responsible
>> for regenerating the .ptx file and committing it to version control.
>> So far, between the active developers of systemml, this practice has not
>> disrupted their regular workflow.
>>
>> About PTX version.
>> Newer PTX versions support newer architectures. As and when we upgrade to
>> newer CUDA versions, we shall use the cuda compiler that ships with that
>> version of the toolkit and compile the .cu files in the project and commit
>> the resulting .ptx files.
>>
>> Thoughts, comments?
>>
>> -Nakul
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 23, 2016 at 2:43 PM, Matthias Boehm <mb...@googlemail.com>
>> wrote:
>>
>> thanks for sharing Nakul. Could you please also comment on the PTX story
>>> for custom kernels and different PTX versions?
>>>
>>> Regards,
>>> Matthias
>>>
>>>
>>> On 11/23/2016 10:13 PM, Nakul Jindal wrote:
>>>
>>> Hi,
>>>>
>>>> SystemML has experimental GPU support, which we are working to solidify.
>>>> Currently, GPU is supported in CP (Standalone/Single Node) mode. It
>>>> uses a
>>>> single GPU (even if the node has more than 1 GPU).
>>>>
>>>> Communication between the GPU and JVM happens through JCuda (MIT
>>>> License)
>>>> -
>>>> a light java wrapper over CUDA that uses JNI. To that end, JCuda needs
>>>> to
>>>> compile a platform specific shared library which is then used to
>>>> communicate with the locally installed Cuda.
>>>> To help with not having to compile a piece of C/C++ code each time, we
>>>> use
>>>> a project Mavenized-Jcuda(MIT-License). This project internally has a
>>>> repository which contains compiled shared objects (for JCuda) for
>>>> different
>>>> platforms for different versions of Cuda.
>>>>
>>>>
>>>> For developers of SystemML (People who compile SystemML from source) :
>>>> As of today, one can checkout the master branch and follow a series of
>>>> setup steps to get SystemML in GPU mode running.
>>>> These are the steps -
>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>> docs/devdocs/gpu-backend.md
>>>>
>>>> 1a)
>>>> Broadly,
>>>> 0. Compile systemml & mavenized jcuda.
>>>> 1. Mavenized JCuda jars are put into the classpath of SystemML.
>>>> 2. The native shared library should be put in the LD_LIBRARY_PATH or
>>>> java.library.path.
>>>> 3. SystemML should be run with the "-gpu" flag. Like so:
>>>> (In the incubator-systemml directory)
>>>>
>>>> bin/systemml "file.dml" -gpu force=true
>>>>
>>>> PR 291 (https://github.com/apache/incubator-systemml/pull/291) tries to
>>>> change this so that setup becomes simpler.  (Given that mavenized-jcuda
>>>> is
>>>> available in one of the repositories specified in systemml's pom.xml)
>>>>
>>>> 1b)
>>>> 0. Compule systemml
>>>> 1. Run systemml
>>>>
>>>> bin/systemml "file.dml" -gpu force=true
>>>>
>>>>
>>>>
>>>> For users of SystemML:
>>>> We haven't yet decided on how to ship SystemML with GPU support. Here
>>>> are
>>>> the 2 ways we can think of:
>>>>
>>>> 2a)
>>>> 0. User installs pre-requisites (java, cuda, etc)
>>>> 1. User "installs" Mavenized-JCuda or JCuda. (i.e. the package jars are
>>>> made available in the classpath). Also the relevant shared object
>>>> library
>>>> files (.so, .dll) files are made available to the JVM through the
>>>> LD_LIBRARY_PATH environment variable or through java.library.path
>>>> setting
>>>> variable. (Note this needs to happen if using cuda <8.0)
>>>> 2. Download and run the systemml jar.
>>>>
>>>> 2b)
>>>> We package JCuda/Mavenized-JCuda with the SystemML distribution. We
>>>> already
>>>> package ANTLR and Wink with our jar. Our other dependencies are
>>>> "provided"
>>>> scope and are not pulled in by the maven shade plugin.
>>>> A separate jar will be released for every platform.
>>>> 0. User installs pre-requisites
>>>> 1. Download and run systemml jar
>>>>
>>>>
>>>> There is also the matter of running SystemML with GPU in distributed
>>>> mode.
>>>> In hybrid_spark mode, with option 2a, we'd need to install
>>>> JCuda/Mavenized-JCuda on all the worker nodes.
>>>> With option 2b, we wouldn't need to.
>>>>
>>>>
>>>> Berthold, Niketan and I have had a discussion and agree on option 2a,
>>>> for
>>>> now.
>>>>
>>>> Are there any thoughts? Inputs?
>>>>
>>>> -Nakul Jindal
>>>>
>>>>
>>>>
>>

Re: Build and distribution related issues for GPU support

Posted by Matthias Boehm <mb...@googlemail.com>.
great - thanks for the clarifications Nakul. That sounds good. Let's 
just keep an eye on the JIT compilation overhead as code size increases 
(as we add new custom kernels). Thanks.

Regards,
Matthias

On 12/2/2016 9:03 PM, Nakul Jindal wrote:
> @Matthias,
>
> Thanks for your questions :)
> This will help point us to a public discussion about the decision to put
> the ptx under version control.
>
>
>>From what I understand, we compile for a certain virtual architecture and
> for a certain GPU (using the -code and -arch).
> Currently, we compile for sm_20.
> https://github.com/apache/incubator-systemml/blob/master/src/main/cpp/kernels/SystemML.ptx#L26
> This ptx is good for "higher" REAL architectures also (sm_30, sm_32. sm_35,
> sm 50, sm_52, sm_53).
>
> Further Reading / References:
> http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list
> http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list
>
> So to answer your first question, whether it will run on Kepler devices -
> Yes, it will, because it is higher than sm_20.
>
>
> For your second question - is there a performance diff between CUBIN and
> PTX - yes there is.
> CUBINs are compiled for a target architecture, PTX is for the virtual GPU
> ISA (forward compatible) which is compiled at runtime by the JIT.
> There is a startup cost. This post describes approaches to mitigate that
> startup cost:
> https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/
> The blog post suggest either shipping a fat cubin - which has the ptx and
> compiled code for more than one target GPU architecture - or using JIT
> caching, which is controlled by setting environment variables.
>
> Shipping a fat cubin is obviously much more heavy weight than just the ptx.
> Realistically, the PTX JIT compilation adds about <5 seconds of startup
> overhead (on the platforms I tested on), if the "-gpu" flag option is used.
> It can be argued that in a long running job, a constant cost is justified.
>
>
> -Nakul
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Nov 24, 2016 at 12:53 AM, Matthias Boehm <mb...@googlemail.com>
> wrote:
>
>> So just to make sure I understand correctly: right now we compiled the few
>> example kernels with PTX version 4.3, implying that this is the minimum
>> requirement and SystemML's GPU backend will not run, for example, on Kepler
>> devices (with PTX version 3), right?
>>
>> Also, is there a performance difference (generated code, or just-in-time
>> compilation overhead) between CUBIN and PTX files? If so, can we quantify
>> this difference to make a decision here? Thanks.
>>
>> Regards,
>> Matthias
>>
>>
>> On 11/24/2016 8:34 AM, Nakul Jindal wrote:
>>
>>> @Matthias -
>>> PTX (parallel thread execution) objects are intermediate compiled objects.
>>>
>>> As of the current master, they are maintained under git version control.
>>> This decision was agreed upon after discussing the hassle that a developer
>>> of systemml without the nvidia cuda compiler might face.
>>> It was decided that a person modifying the .cu files will be responsible
>>> for regenerating the .ptx file and committing it to version control.
>>> So far, between the active developers of systemml, this practice has not
>>> disrupted their regular workflow.
>>>
>>> About PTX version.
>>> Newer PTX versions support newer architectures. As and when we upgrade to
>>> newer CUDA versions, we shall use the cuda compiler that ships with that
>>> version of the toolkit and compile the .cu files in the project and commit
>>> the resulting .ptx files.
>>>
>>> Thoughts, comments?
>>>
>>> -Nakul
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 23, 2016 at 2:43 PM, Matthias Boehm <mb...@googlemail.com>
>>> wrote:
>>>
>>> thanks for sharing Nakul. Could you please also comment on the PTX story
>>>> for custom kernels and different PTX versions?
>>>>
>>>> Regards,
>>>> Matthias
>>>>
>>>>
>>>> On 11/23/2016 10:13 PM, Nakul Jindal wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> SystemML has experimental GPU support, which we are working to solidify.
>>>>> Currently, GPU is supported in CP (Standalone/Single Node) mode. It
>>>>> uses a
>>>>> single GPU (even if the node has more than 1 GPU).
>>>>>
>>>>> Communication between the GPU and JVM happens through JCuda (MIT
>>>>> License)
>>>>> -
>>>>> a light java wrapper over CUDA that uses JNI. To that end, JCuda needs
>>>>> to
>>>>> compile a platform specific shared library which is then used to
>>>>> communicate with the locally installed Cuda.
>>>>> To help with not having to compile a piece of C/C++ code each time, we
>>>>> use
>>>>> a project Mavenized-Jcuda(MIT-License). This project internally has a
>>>>> repository which contains compiled shared objects (for JCuda) for
>>>>> different
>>>>> platforms for different versions of Cuda.
>>>>>
>>>>>
>>>>> For developers of SystemML (People who compile SystemML from source) :
>>>>> As of today, one can checkout the master branch and follow a series of
>>>>> setup steps to get SystemML in GPU mode running.
>>>>> These are the steps -
>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>> docs/devdocs/gpu-backend.md
>>>>>
>>>>> 1a)
>>>>> Broadly,
>>>>> 0. Compile systemml & mavenized jcuda.
>>>>> 1. Mavenized JCuda jars are put into the classpath of SystemML.
>>>>> 2. The native shared library should be put in the LD_LIBRARY_PATH or
>>>>> java.library.path.
>>>>> 3. SystemML should be run with the "-gpu" flag. Like so:
>>>>> (In the incubator-systemml directory)
>>>>>
>>>>> bin/systemml "file.dml" -gpu force=true
>>>>>
>>>>> PR 291 (https://github.com/apache/incubator-systemml/pull/291) tries to
>>>>> change this so that setup becomes simpler.  (Given that mavenized-jcuda
>>>>> is
>>>>> available in one of the repositories specified in systemml's pom.xml)
>>>>>
>>>>> 1b)
>>>>> 0. Compule systemml
>>>>> 1. Run systemml
>>>>>
>>>>> bin/systemml "file.dml" -gpu force=true
>>>>>
>>>>>
>>>>>
>>>>> For users of SystemML:
>>>>> We haven't yet decided on how to ship SystemML with GPU support. Here
>>>>> are
>>>>> the 2 ways we can think of:
>>>>>
>>>>> 2a)
>>>>> 0. User installs pre-requisites (java, cuda, etc)
>>>>> 1. User "installs" Mavenized-JCuda or JCuda. (i.e. the package jars are
>>>>> made available in the classpath). Also the relevant shared object
>>>>> library
>>>>> files (.so, .dll) files are made available to the JVM through the
>>>>> LD_LIBRARY_PATH environment variable or through java.library.path
>>>>> setting
>>>>> variable. (Note this needs to happen if using cuda <8.0)
>>>>> 2. Download and run the systemml jar.
>>>>>
>>>>> 2b)
>>>>> We package JCuda/Mavenized-JCuda with the SystemML distribution. We
>>>>> already
>>>>> package ANTLR and Wink with our jar. Our other dependencies are
>>>>> "provided"
>>>>> scope and are not pulled in by the maven shade plugin.
>>>>> A separate jar will be released for every platform.
>>>>> 0. User installs pre-requisites
>>>>> 1. Download and run systemml jar
>>>>>
>>>>>
>>>>> There is also the matter of running SystemML with GPU in distributed
>>>>> mode.
>>>>> In hybrid_spark mode, with option 2a, we'd need to install
>>>>> JCuda/Mavenized-JCuda on all the worker nodes.
>>>>> With option 2b, we wouldn't need to.
>>>>>
>>>>>
>>>>> Berthold, Niketan and I have had a discussion and agree on option 2a,
>>>>> for
>>>>> now.
>>>>>
>>>>> Are there any thoughts? Inputs?
>>>>>
>>>>> -Nakul Jindal
>>>>>
>>>>>
>>>>>
>>>
>