You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2015/09/03 22:37:17 UTC

Re: Code generation for GPU

See responses inline.

On Thu, Sep 3, 2015 at 1:58 AM, kiran lonikar <lo...@gmail.com> wrote:

> Hi,
>
>    1. I found where the code generation
>    <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala> happens
>    in spark code from the blogs
>    https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html,
>
>    https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
>    and
>    https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html.
>    However, I could not find where is the generated code executed? A major
>    part of my changes will be there since this executor will now have to send
>    vectors of columns to GPU RAM, invoke execution, and get the results back
>    to CPU RAM. Thus, the existing executor will significantly change.
>
> The code generation generates Java classes that have an apply method, and
the apply method is called in the operators.

E.g. GenerateUnsafeProjection returns a Projection class (which is just a
class with an apply method), and TungstenProject calls that class.



>
>    1. On the project tungsten blog
>    <https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html>,
>    in the third Code Generation section, it is mentioned that you plan
>    to increase the level of code generation from record-at-a-time expression
>    evaluation to vectorized expression evaluation. Has this been implemented?
>    If not, how do I implement this? I will need access to columnar ByteBuffer
>    objects in DataFrame to do this. Having row by row access to data will
>    defeat this exercise. In particular, I need access to
>    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
>    in the executor of the generated code.
>
>
This is future work. You'd need to create batches of rows or columns. This
is a pretty major refactoring though.


>
>    1. One thing that confuses me is the changes from 1.4 to 1.5 possibly
>    due to JIRA https://issues.apache.org/jira/browse/SPARK-7956 and pull
>    request https://github.com/apache/spark/pull/6479/files*. *This
>    changed the code generation from quasiquotes (q) to string s operator. This
>    makes it simpler for me to generate OpenCL code which is string based. The
>    question, is this branch stable now? Should I make my changes on spark 1.4
>    or spark 1.5 or master branch?
>
> In general Spark development velocity is pretty high, as we make a lot of
changes to internals every release. If I were you, I'd use either master or
branch-1.5 for your prototyping.


>
>    1. How do I tune the batch size (number of rows in the ByteBuffer)? Is
>    it through the property spark.sql.inMemoryColumnarStorage.batchSize?
>
>
> Thanks in anticipation,
>
> Kiran
> PS:
>
> Other things I found useful were:
>
> *Spark DataFrames*: https://www.brighttalk.com/webcast/12891/166495
> *Apache Spark 1.5*: https://www.brighttalk.com/webcast/12891/168177
>
> The links to JavaCL/ScalaCL:
>
> *Library to execute OpenCL code through Java*:
> https://github.com/nativelibs4java/ScalaCL
> *Library to convert Scala code to OpenCL and execute on GPUs*:
> https://github.com/nativelibs4java/JavaCL
>
>
>

Re: Code generation for GPU

Posted by kiran lonikar <lo...@gmail.com>.

Thanks for pointing to the yarn JIRA. For now, it would be good for my talk
since it brings out that hadoop and big data community is already aware of
the GPUs and making effort to exploit it.

Good luck for your talk. That fear is lurking in my mind too :)
On 10-Sep-2015 2:08 pm, "Steve Loughran" <st...@hortonworks.com> wrote:

>
> > On 9 Sep 2015, at 20:18, lonikar <lo...@gmail.com> wrote:
> >
> > I have seen a perf improvement of 5-10 times on expression evaluation
> even
> > on "ordinary" laptop GPUs. Thus, it will be a good demo along with some
> > concrete proposals for vectorization. As you said, I will have to hook
> up to
> > a column structure and perform computation and let the existing spark
> > computation also proceed and compare the performance.
> >
>
> you might also be interested to know that there's now a YARN JIRA on
> making GPU another resource you can ask for
> https://issues.apache.org/jira/browse/YARN-4122
>
> if implemented, it'd let you submit work into the cluster asking for GPUs,
> and get allocated containers on servers with the GPU capacity you need.
> This'd allow you to share GPUs with other code (including your own
> containers)
>
> > I will focus on the slides early (7th Oct is deadline), and then continue
> > the work for another 3 weeks till the summit. It still gives me enough
> time
> > to do considerable work. Hope your fear does not come true.
>
> good luck. And the fear is about my talk at apachecon on the Hadoop stack
> & Kerberos
> >
>
>

Re: Code generation for GPU

Posted by Steve Loughran <st...@hortonworks.com>.

> On 9 Sep 2015, at 20:18, lonikar <lo...@gmail.com> wrote:
> 
> I have seen a perf improvement of 5-10 times on expression evaluation even
> on "ordinary" laptop GPUs. Thus, it will be a good demo along with some
> concrete proposals for vectorization. As you said, I will have to hook up to
> a column structure and perform computation and let the existing spark
> computation also proceed and compare the performance.
> 

you might also be interested to know that there's now a YARN JIRA on making GPU another resource you can ask for
https://issues.apache.org/jira/browse/YARN-4122

if implemented, it'd let you submit work into the cluster asking for GPUs, and get allocated containers on servers with the GPU capacity you need. This'd allow you to share GPUs with other code (including your own containers)

> I will focus on the slides early (7th Oct is deadline), and then continue
> the work for another 3 weeks till the summit. It still gives me enough time
> to do considerable work. Hope your fear does not come true.

good luck. And the fear is about my talk at apachecon on the Hadoop stack & Kerberos
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Code generation for GPU

Posted by kiran lonikar <lo...@gmail.com>.

Thanks. Yes thats exactly what i would like to do: copy large amounts of
data to GPU RAM, perform computation and get bulk rows back for map/filter
or reduce result. It is true that non trivial operations benefit more. Even
streaming data to GPU RAM and interleaving computation with data transfer
works but it complicates the design and doing it in spark would be even
more so.

Thanks for bringing out the sorting. Its a good idea since its already
isolated as you pointed out. I was looking at the terasort effort and
something I always wanted to take up. But somehow thought expression would
be easier to deal with in a short term. Would love to work on that after
this especially because unsafe is for primitive types and suited for GPUs
computation model. It would be exciting to better the terasort record too.

Kiran
On 10-Sep-2015 1:12 pm, "Paul Wais" <pa...@gmail.com> wrote:

> In order to get a major speedup from applying *single-pass*
> map/filter/reduce
> operations on an array in GPU memory, wouldn't you need to stream the
> columnar data directly into GPU memory somehow?  You might find in your
> experiments that GPU memory allocation is a bottleneck.  See e.g. John
> Canny's paper here (Section 1.1 paragraph 2):
> http://www.cs.berkeley.edu/~jfc/papers/13/BIDMach.pdf    If the per-item
> operation is very non-trivial, though, a dramatic GPU speedup may be more
> likely.
>
> Something related (and perhaps easier to contribute to Spark) might be a
> GPU-accelerated sorter for sorting Unsafe records.  Especially since that
> stuff is already broken out somewhat well-- e.g. `UnsafeInMemorySorter`.
> Spark appears to use (single-threaded) Timsort for sorting Unsafe records,
> so I imagine a multi-thread/multi-core GPU solution could handily beat
> that.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p14030.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Code generation for GPU

Posted by Paul Wais <pa...@gmail.com>.

In order to get a major speedup from applying *single-pass* map/filter/reduce
operations on an array in GPU memory, wouldn't you need to stream the
columnar data directly into GPU memory somehow?  You might find in your
experiments that GPU memory allocation is a bottleneck.  See e.g. John
Canny's paper here (Section 1.1 paragraph 2):
http://www.cs.berkeley.edu/~jfc/papers/13/BIDMach.pdf    If the per-item
operation is very non-trivial, though, a dramatic GPU speedup may be more
likely.

Something related (and perhaps easier to contribute to Spark) might be a
GPU-accelerated sorter for sorting Unsafe records.  Especially since that
stuff is already broken out somewhat well-- e.g. `UnsafeInMemorySorter`. 
Spark appears to use (single-threaded) Timsort for sorting Unsafe records,
so I imagine a multi-thread/multi-core GPU solution could handily beat that.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p14030.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Code generation for GPU

Posted by lonikar <lo...@gmail.com>.

I am already looking at the dataframes APIs and the implementation. In fact,
the columnar representation
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
is what gave me the idea of my talk proposal. It is ideally suited for
computation on GPU. But from what Reynold said, it appears that the columnar
structure is not exploited for computation like expressions. It appears that
the columnar structure is used only for space efficient in memory storage
and not for computations. Even the TungstenProject invokes the operations on
a row by row basis. The UnsafeRow is optimized in the sense that it is only
a logical row as opposed to the InternalRow which has physical copies of the
values. But the computation is still on a per row basis rather than batches
of rows stored in columnar structure.

Thanks for some concrete suggestions on presentation. I do have the core
idea or theme of my talk ready in mind, but I will now present on the lines
you suggest. I wasn't really thinking of a demo, but now I will do that. I
was actually hoping to be able to contribute to spark code and show results
on those changes rather than offline changes. I will still try to do that by
hooking to the columnar structure, but it may not be in a shape that can go
in the spark code. Thats what I meant by severely limiting the scope of my
talk.

I have seen a perf improvement of 5-10 times on expression evaluation even
on "ordinary" laptop GPUs. Thus, it will be a good demo along with some
concrete proposals for vectorization. As you said, I will have to hook up to
a column structure and perform computation and let the existing spark
computation also proceed and compare the performance.

I will focus on the slides early (7th Oct is deadline), and then continue
the work for another 3 weeks till the summit. It still gives me enough time
to do considerable work. Hope your fear does not come true.

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p14025.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Code generation for GPU

Posted by Steve Loughran <st...@hortonworks.com>.

On 7 Sep 2015, at 20:44, lonikar <lo...@gmail.com>> wrote:

2. If the vectorization is difficult or a major effort, I am not sure how I
am going to implement even a glimpse of changes I would like to. I think I
will have to satisfied with only a partial effort. Batching rows defeats the
purpose as I have found that it consumes a considerable amount of CPU cycles
and producing one row at a time also takes away the performance benefit.
Whats really required is to access a large partition and produce the result
partition in one shot.

why not look at the dataframes APIs and the back-end implementations of things which support it? The data sources which are columnized from the outset (ORC, parquet) are the ones where vector operations work well : you can read at of columns, perform a parallel operation, then repeat.

If you can hook up to a column structure you may get that speedup.

I think I will have to severely limit the scope of my talk in that case. Or
re-orient it to propose the changes instead of presenting the results of
execution on GPU. Please suggest since you seem to have selected the talk.

It is always essential to have the core of your talk ready before you propose the talk -its something reviewers (nothing to do with me here) mostly expect. Otherwise you are left in a panic three days before trying to do bash together some slides you will have to present to an audience that may include people that know the code better than you. I've been there -and fear I will be there again in 3 weeks time.

Some general suggestions

1. assume the audience knows spark, but not how to code for GPUs: intro that on a slide or two
2. cover the bandwidth problem: how much computation is needed before working with the GPU is justified
3. Look at the body of work of Hadoop MapReduce & GPUs and the limitations (IO bandwidth, intermediate stage B/W) as well as benefits (perf on CPU workloads, power budget)
4. Cover how that's changing: SDDs, in-memory filesystems, whether infiniband would help.
5. Try to demo something. It's always nice to show something working at a talk, even if its just your laptop

Re: Code generation for GPU

Posted by lonikar <lo...@gmail.com>.

Hi Reynold,

Thanks for responding. I was waiting for this on the spark user group and my
own email id since I had not posted this on spark dev. Just saw your reply.

1. I figured the various code generation classes have either *apply* or
*eval* method depending on whether it computes something or uses expression
as filter. And the code that executes this generated code is in
sql.execution.basicOperators.scala.

3. I agree, its pretty high paced development. I have started working on
1.5.1 spapshot.

4. How do I tune the batch size (number of rows in the ByteBuffer)? Is it
through the property spark.sql.inMemoryColumnarStorage.batchSize?

-Kiran

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Code-generation-for-GPU-tp13954p13989.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org