You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Nikolai Sakharnykh <ns...@nvidia.com> on 2017/03/10 17:06:56 UTC

Native CUDA support

Hello everyone,

We're actively working on adding native CUDA support to Apache Mahout. Currently, GPU acceleration is enabled through ViennaCL (http://viennacl.sourceforge.net/). ViennaCL is a linear algebra framework that provides multiple backends including OpenMP, OpenCL and CUDA. However, as we recently discovered the CUDA backend in ViennaCL is composed of manually written CUDA kernels that are not well tuned for the latest GPU architectures. Instead, we decided to explore a way to leverage CUDA libraries for linear algebra: cuBLAS (dense matrices), cuSPARSE (sparse matrices) and cuSOLVER (dense factorizations and sparse solvers). These libraries are highly tuned by NVIDIA and provide the best performance for many linear algebra primitives on the NVIDIA GPU architecture. Moreover, the libraries are receiving frequent updates with new CUDA toolkit releases: bug fixes, new functionality and optimizations.

We considered two approaches:

  1.  Direct calls to CUDA runtime and libraries through JavaCPP bridge
  2.  Use JCuda package (http://www.jcuda.org/)

JCuda is a thin Java layer on top of the CUDA runtime and already provides Java wrappers for all available CUDA libraries so it makes sense to choose this path. JCuda also provides a mechanism to call custom CUDA kernels by compiling them into PTX with NVIDIA NVCC compiler and then loading through CUDA driver API calls in Java code. Here is an example code that allocates a pointer (cudaMalloc) and copies data to the GPU (cudaMemcpy) using JCuda:

// Allocate memory on the device
Pointer deviceData = new Pointer();
cudaMalloc(deviceData, memorySize);

// Copy the host data to the device
cudaMemcpy(deviceData, Pointer.to(hostData), memorySize,
          cudaMemcpyKind.cudaMemcpyHostToDevice);

Alternatively, a pointer can be allocated using cudaMallocManaged and then it can be accessed on the CPU or on the GPU without explicit copies by leveraging Unified Memory. This enables simpler data management model and on the newer architectures enables features like on-demand paging and transparent GPU memory oversubscription.

All CUDA libraries operate directly on the GPU pointers. Here is an example of calling a single-precision GEMM with JCuda:

// Allocate memory on the device
Pointer d_A = new Pointer();
Pointer d_B = new Pointer();
Pointer d_C = new Pointer();
cudaMalloc(d_A, n * n * Sizeof.FLOAT);
cudaMalloc(d_B, n * n * Sizeof.FLOAT);
cudaMalloc(d_C, n * n * Sizeof.FLOAT);

// Copy the memory

// Execute sgemm
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
           pAlpha, d_A, n, d_B, n, pBeta, d_C, n);

Most of existing sparse matrix classes and sparse matrix conversion routines in Mahout can generally maintain their structure as the CSR format is well-supported in both cuSPARSE and cuSOLVER libraries.

Our plan is to create a proof-of-concept implementation first to demonstrate matrix-matrix and/or matrix-vector multiplication using CUDA libraries, then expand functionality by adding more BLAS operations and advanced algorithms that exist in cuSOLVER. Stay tuned for more updates!

Regards,
Nikolai.


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Re: Native CUDA support

Posted by Shannon Quinn <sq...@gatech.edu>.
Loved this proposal. Excited to see the POC.

On 3/27/17 10:16 PM, Andrew Palumbo wrote:
> Thank you, Nikolai.  This is Great news!
>
> ________________________________
> From: Dmitriy Lyubimov <dl...@gmail.com>
> Sent: Monday, March 27, 2017 7:55:20 PM
> To: dev@mahout.apache.org
> Subject: Re: Native CUDA support
>
> thanks.
>
> JCuda sounds good. :)
>
> On Fri, Mar 10, 2017 at 9:06 AM, Nikolai Sakharnykh <ns...@nvidia.com>
> wrote:
>
>> Hello everyone,
>>
>> We're actively working on adding native CUDA support to Apache Mahout.
>> Currently, GPU acceleration is enabled through ViennaCL (
>> http://viennacl.sourceforge.net/). ViennaCL is a linear algebra framework
>> that provides multiple backends including OpenMP, OpenCL and CUDA. However,
>> as we recently discovered the CUDA backend in ViennaCL is composed of
>> manually written CUDA kernels that are not well tuned for the latest GPU
>> architectures. Instead, we decided to explore a way to leverage CUDA
>> libraries for linear algebra: cuBLAS (dense matrices), cuSPARSE (sparse
>> matrices) and cuSOLVER (dense factorizations and sparse solvers). These
>> libraries are highly tuned by NVIDIA and provide the best performance for
>> many linear algebra primitives on the NVIDIA GPU architecture. Moreover,
>> the libraries are receiving frequent updates with new CUDA toolkit
>> releases: bug fixes, new functionality and optimizations.
>>
>> We considered two approaches:
>>
>>    1.  Direct calls to CUDA runtime and libraries through JavaCPP bridge
>>    2.  Use JCuda package (http://www.jcuda.org/)
>>
>> JCuda is a thin Java layer on top of the CUDA runtime and already provides
>> Java wrappers for all available CUDA libraries so it makes sense to choose
>> this path. JCuda also provides a mechanism to call custom CUDA kernels by
>> compiling them into PTX with NVIDIA NVCC compiler and then loading through
>> CUDA driver API calls in Java code. Here is an example code that allocates
>> a pointer (cudaMalloc) and copies data to the GPU (cudaMemcpy) using JCuda:
>>
>> // Allocate memory on the device
>> Pointer deviceData = new Pointer();
>> cudaMalloc(deviceData, memorySize);
>>
>> // Copy the host data to the device
>> cudaMemcpy(deviceData, Pointer.to(hostData), memorySize,
>>            cudaMemcpyKind.cudaMemcpyHostToDevice);
>>
>> Alternatively, a pointer can be allocated using cudaMallocManaged and then
>> it can be accessed on the CPU or on the GPU without explicit copies by
>> leveraging Unified Memory. This enables simpler data management model and
>> on the newer architectures enables features like on-demand paging and
>> transparent GPU memory oversubscription.
>>
>> All CUDA libraries operate directly on the GPU pointers. Here is an
>> example of calling a single-precision GEMM with JCuda:
>>
>> // Allocate memory on the device
>> Pointer d_A = new Pointer();
>> Pointer d_B = new Pointer();
>> Pointer d_C = new Pointer();
>> cudaMalloc(d_A, n * n * Sizeof.FLOAT);
>> cudaMalloc(d_B, n * n * Sizeof.FLOAT);
>> cudaMalloc(d_C, n * n * Sizeof.FLOAT);
>>
>> // Copy the memory
>>
>> // Execute sgemm
>> cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
>>             pAlpha, d_A, n, d_B, n, pBeta, d_C, n);
>>
>> Most of existing sparse matrix classes and sparse matrix conversion
>> routines in Mahout can generally maintain their structure as the CSR format
>> is well-supported in both cuSPARSE and cuSOLVER libraries.
>>
>> Our plan is to create a proof-of-concept implementation first to
>> demonstrate matrix-matrix and/or matrix-vector multiplication using CUDA
>> libraries, then expand functionality by adding more BLAS operations and
>> advanced algorithms that exist in cuSOLVER. Stay tuned for more updates!
>>
>> Regards,
>> Nikolai.
>>
>>
>> ------------------------------------------------------------
>> -----------------------
>> This email message is for the sole use of the intended recipient(s) and
>> may contain
>> confidential information.  Any unauthorized review, use, disclosure or
>> distribution
>> is prohibited.  If you are not the intended recipient, please contact the
>> sender by
>> reply email and destroy all copies of the original message.
>> ------------------------------------------------------------
>> -----------------------
>>


Re: Native CUDA support

Posted by Andrew Palumbo <ap...@outlook.com>.
Thank you, Nikolai.  This is Great news!

________________________________
From: Dmitriy Lyubimov <dl...@gmail.com>
Sent: Monday, March 27, 2017 7:55:20 PM
To: dev@mahout.apache.org
Subject: Re: Native CUDA support

thanks.

JCuda sounds good. :)

On Fri, Mar 10, 2017 at 9:06 AM, Nikolai Sakharnykh <ns...@nvidia.com>
wrote:

> Hello everyone,
>
> We're actively working on adding native CUDA support to Apache Mahout.
> Currently, GPU acceleration is enabled through ViennaCL (
> http://viennacl.sourceforge.net/). ViennaCL is a linear algebra framework
> that provides multiple backends including OpenMP, OpenCL and CUDA. However,
> as we recently discovered the CUDA backend in ViennaCL is composed of
> manually written CUDA kernels that are not well tuned for the latest GPU
> architectures. Instead, we decided to explore a way to leverage CUDA
> libraries for linear algebra: cuBLAS (dense matrices), cuSPARSE (sparse
> matrices) and cuSOLVER (dense factorizations and sparse solvers). These
> libraries are highly tuned by NVIDIA and provide the best performance for
> many linear algebra primitives on the NVIDIA GPU architecture. Moreover,
> the libraries are receiving frequent updates with new CUDA toolkit
> releases: bug fixes, new functionality and optimizations.
>
> We considered two approaches:
>
>   1.  Direct calls to CUDA runtime and libraries through JavaCPP bridge
>   2.  Use JCuda package (http://www.jcuda.org/)
>
> JCuda is a thin Java layer on top of the CUDA runtime and already provides
> Java wrappers for all available CUDA libraries so it makes sense to choose
> this path. JCuda also provides a mechanism to call custom CUDA kernels by
> compiling them into PTX with NVIDIA NVCC compiler and then loading through
> CUDA driver API calls in Java code. Here is an example code that allocates
> a pointer (cudaMalloc) and copies data to the GPU (cudaMemcpy) using JCuda:
>
> // Allocate memory on the device
> Pointer deviceData = new Pointer();
> cudaMalloc(deviceData, memorySize);
>
> // Copy the host data to the device
> cudaMemcpy(deviceData, Pointer.to(hostData), memorySize,
>           cudaMemcpyKind.cudaMemcpyHostToDevice);
>
> Alternatively, a pointer can be allocated using cudaMallocManaged and then
> it can be accessed on the CPU or on the GPU without explicit copies by
> leveraging Unified Memory. This enables simpler data management model and
> on the newer architectures enables features like on-demand paging and
> transparent GPU memory oversubscription.
>
> All CUDA libraries operate directly on the GPU pointers. Here is an
> example of calling a single-precision GEMM with JCuda:
>
> // Allocate memory on the device
> Pointer d_A = new Pointer();
> Pointer d_B = new Pointer();
> Pointer d_C = new Pointer();
> cudaMalloc(d_A, n * n * Sizeof.FLOAT);
> cudaMalloc(d_B, n * n * Sizeof.FLOAT);
> cudaMalloc(d_C, n * n * Sizeof.FLOAT);
>
> // Copy the memory
>
> // Execute sgemm
> cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
>            pAlpha, d_A, n, d_B, n, pBeta, d_C, n);
>
> Most of existing sparse matrix classes and sparse matrix conversion
> routines in Mahout can generally maintain their structure as the CSR format
> is well-supported in both cuSPARSE and cuSOLVER libraries.
>
> Our plan is to create a proof-of-concept implementation first to
> demonstrate matrix-matrix and/or matrix-vector multiplication using CUDA
> libraries, then expand functionality by adding more BLAS operations and
> advanced algorithms that exist in cuSOLVER. Stay tuned for more updates!
>
> Regards,
> Nikolai.
>
>
> ------------------------------------------------------------
> -----------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
> ------------------------------------------------------------
> -----------------------
>

Re: Native CUDA support

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
thanks.

JCuda sounds good. :)

On Fri, Mar 10, 2017 at 9:06 AM, Nikolai Sakharnykh <ns...@nvidia.com>
wrote:

> Hello everyone,
>
> We're actively working on adding native CUDA support to Apache Mahout.
> Currently, GPU acceleration is enabled through ViennaCL (
> http://viennacl.sourceforge.net/). ViennaCL is a linear algebra framework
> that provides multiple backends including OpenMP, OpenCL and CUDA. However,
> as we recently discovered the CUDA backend in ViennaCL is composed of
> manually written CUDA kernels that are not well tuned for the latest GPU
> architectures. Instead, we decided to explore a way to leverage CUDA
> libraries for linear algebra: cuBLAS (dense matrices), cuSPARSE (sparse
> matrices) and cuSOLVER (dense factorizations and sparse solvers). These
> libraries are highly tuned by NVIDIA and provide the best performance for
> many linear algebra primitives on the NVIDIA GPU architecture. Moreover,
> the libraries are receiving frequent updates with new CUDA toolkit
> releases: bug fixes, new functionality and optimizations.
>
> We considered two approaches:
>
>   1.  Direct calls to CUDA runtime and libraries through JavaCPP bridge
>   2.  Use JCuda package (http://www.jcuda.org/)
>
> JCuda is a thin Java layer on top of the CUDA runtime and already provides
> Java wrappers for all available CUDA libraries so it makes sense to choose
> this path. JCuda also provides a mechanism to call custom CUDA kernels by
> compiling them into PTX with NVIDIA NVCC compiler and then loading through
> CUDA driver API calls in Java code. Here is an example code that allocates
> a pointer (cudaMalloc) and copies data to the GPU (cudaMemcpy) using JCuda:
>
> // Allocate memory on the device
> Pointer deviceData = new Pointer();
> cudaMalloc(deviceData, memorySize);
>
> // Copy the host data to the device
> cudaMemcpy(deviceData, Pointer.to(hostData), memorySize,
>           cudaMemcpyKind.cudaMemcpyHostToDevice);
>
> Alternatively, a pointer can be allocated using cudaMallocManaged and then
> it can be accessed on the CPU or on the GPU without explicit copies by
> leveraging Unified Memory. This enables simpler data management model and
> on the newer architectures enables features like on-demand paging and
> transparent GPU memory oversubscription.
>
> All CUDA libraries operate directly on the GPU pointers. Here is an
> example of calling a single-precision GEMM with JCuda:
>
> // Allocate memory on the device
> Pointer d_A = new Pointer();
> Pointer d_B = new Pointer();
> Pointer d_C = new Pointer();
> cudaMalloc(d_A, n * n * Sizeof.FLOAT);
> cudaMalloc(d_B, n * n * Sizeof.FLOAT);
> cudaMalloc(d_C, n * n * Sizeof.FLOAT);
>
> // Copy the memory
>
> // Execute sgemm
> cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
>            pAlpha, d_A, n, d_B, n, pBeta, d_C, n);
>
> Most of existing sparse matrix classes and sparse matrix conversion
> routines in Mahout can generally maintain their structure as the CSR format
> is well-supported in both cuSPARSE and cuSOLVER libraries.
>
> Our plan is to create a proof-of-concept implementation first to
> demonstrate matrix-matrix and/or matrix-vector multiplication using CUDA
> libraries, then expand functionality by adding more BLAS operations and
> advanced algorithms that exist in cuSOLVER. Stay tuned for more updates!
>
> Regards,
> Nikolai.
>
>
> ------------------------------------------------------------
> -----------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
> ------------------------------------------------------------
> -----------------------
>