You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Rinka Singh (Jira)" <ji...@apache.org> on 2020/02/01 18:14:00 UTC
[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

    [ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028105#comment-17028105 ] 

Rinka Singh edited comment on LUCENE-7745 at 2/1/20 6:13 PM:
-------------------------------------------------------------

Another update.

Sorry about the delay.  It took me SIGNIFICANTLY longer (multiple race conditions, debugging on the GPU etc., etc.,) than I anticipated but I think I have the histogram running:
 * Created a sorted histogram with word count - max file size tested is about 5 MB/~436K words.
 * The histogram also does some math: Calculating median, mean, std. deviation etc., etc., but I haven't optimized that and it is horrendously slow with the large data so I just commented the code out as that's not important anyway.
 * applied stopwords - I have a stop-word list of ~4.4K words.
 * Performance (this is a debug compile and the numbers are while running under gdb):
 ** Quadro 2000 GPU (192 cores) + Intel Dual Core + Kubuntu 14.04, CUDA 7.5: 765 sec
 ** GeForce GTX 780 (2304 cores) on an I5, Kubuntu 16.04, CUDA 8.0: 640 sec.
 * I have done some perf. optimization - (registers and shared memory) but there's a lot more that can be done.  I suspect I can bump the speeds up by at least 5x if not more.
 ** Applying the stop-word can be optimized further but I assume that is not so critical since the index will be updated infrequently.  At this point it is in the code path and contributes to the 644 sec.
 ** Algorithm optimization can give quite a bit of bang for the buck.
 ** I :) "invented" my own sort (a parallel version of a selection sort within and across multiple chunks).  I'd need to do more experimenting here.
 ** :) I have yet to compile and test for production code.  Testing was exclusively in debug code running under gdb.
 ** The whole thing is sequential at a high level: Read file into GPU, break into chunks, apply stop words, sort..  This can be parallelized significantly - reading data and sorting can be parallelized.
 * I also don't have access to high end GPUs (like a v100 [https://www.nvidia.com/en-us/data-center/v100/] for example). The high end GPUs should get another significant  performance boost over and above the perf. optimization I can do.  To give you an idea, I remember I'd done some testing (about a year ago on a K80 based machine) on a very old version of this and I'd seen something like a 2x boost.
 * Going forward, I will find it VERY difficult to commit to timelines as it seems to take me something like 7-10x the time I would have taken on a CPU.  Reasons for this are many:
 ** GPU development is inherently much, much slower - thinking the design through takes at least 3-4x more time.  I dumped many alternative designs halfway through development (something that would never happen to me when doing CPU based development).
 ** Debugging is SIGNIFICANTLY slower - despite having CUDA tools.
 ** Race conditions have bitten me multiple times and each time I lost weeks and months of time.
 ** And finally of course is my own limitation - that of transitioning back into being a developer.  It took me quite a while and I still am not at the point where I was as a 10 year experienced developer (so long ago).

I'll release the code on my github in a day or two with data on how to compile & run this (I'll include both the data & the stop-word files). I'll put the link here.

*I'd love to hear how these numbers relate to running an equivalent histogram on a CPU cluster.  Please can someone run this and let me know.*  Also *if someone can provide me with a V100 based Instance* (even an AWS instance is fine), I can run it there (as is) and generate some numbers.

Underlying assumption: Code is working fine (this can be a bad assumption since I have done just enough testing for me to process one file - code works with a small set of boundary conditions and this is not something that I would deploy at this point). I was more focused on getting it out than doing extensive testing.

Next steps:

I'll start updating the code to;
 * Put this on github, do some more measurements
 * implement an inverted index for a single file.
 * Then for multiple files.
 * Finally, set it up so that you can send queries to this inverted index running on the GPU...
 * :) Testing of course.

*But I'd like feedback on this* before going down this rabbit hole.

Also, once I have an initial inverted index in place, I'd love to have contributors.  There is a lot of contribution that even a CPU developer can do - one look at the code will tell you.  *If you can help, please do reach out to me.* One major advantage for the contributor will be learning GPU programming - and trust me on this, that's a completely different ball game altogether.


was (Author: rinka):
Another update.

Sorry about the delay.  It took me SIGNIFICANTLY longer (multiple race conditions, debugging on the GPU etc., etc.,) than I anticipated but I think I have the histogram running:
 * Created a sorted histogram with word count - max file size tested is about 5 MB/~436K words.
 * The histogram also does some math: Calculating median, mean, std. deviation etc., etc., but I haven't optimized that and it is horrendously slow with the large data so I just commented the code out as that's not important anyway.
 * applied stopwords - I have a stop-word list of ~4.4K words.
 * Performance (this is a debug compile and the numbers are while running under gdb):
 ** Quadro 2000 GPU (192 cores) + Intel Dual Core + Kubuntu 14.04, CUDA 7.5: 765 sec
 ** GeForce GTX 780 (2304 cores) on an I5, Kubuntu 16.04, CUDA 8.0: 644 sec.
 * I have done some perf. optimization - (registers and shared memory) but there's a lot more that can be done.  I suspect I can bump the speeds up by at least 5x if not more.
 ** Applying the stop-word can be optimized further but I assume that is not so critical since the index will be updated infrequently.  At this point it is in the code path and contributes to the 644 sec.
 ** Algorithm optimization can give quite a bit of bang for the buck.
 ** I :) "invented" my own sort (a parallel version of a selection sort within and across multiple chunks).  I'd need to do more experimenting here.
 ** :) I have yet to compile and test for production code.  Testing was exclusively in debug code running under gdb.
 ** The whole thing is sequential at a high level: Read file into GPU, break into chunks, apply stop words, sort..  This can be parallelized significantly - reading data and sorting can be parallelized.
 * I also don't have access to high end GPUs (like a v100 https://www.nvidia.com/en-us/data-center/v100/ for example). The high end GPUs should get another significant  performance boost over and above the perf. optimization I can do.  To give you an idea, I remember I'd done some testing (about a year ago on a K80 based machine) on a very old version of this and I'd seen something like a 2x boost.
 * Going forward, I will find it VERY difficult to commit to timelines as it seems to take me something like 7-10x the time I would have taken on a CPU.  Reasons for this are many:
 ** GPU development is inherently much, much slower - thinking the design through takes at least 3-4x more time.  I dumped many alternative designs halfway through development (something that would never happen to me when doing CPU based development).
 ** Debugging is SIGNIFICANTLY slower - despite having CUDA tools.
 ** Race conditions have bitten me multiple times and each time I lost weeks and months of time.
 ** And finally of course is my own limitation - that of transitioning back into being a developer.  It took me quite a while and I still am not at the point where I was as a 10 year experienced developer (so long ago).

I'll release the code on my github in a day or two with data on how to compile & run this (I'll include both the data & the stop-word files). I'll put the link here.

*I'd love to hear how these numbers relate to running an equivalent histogram on a CPU cluster.  Please can someone run this and let me know.*  Also *if someone can provide me with a V100 based Instance* (even an AWS instance is fine), I can run it there (as is) and generate some numbers.

Underlying assumption: Code is working fine (this can be a bad assumption since I have done just enough testing for me to process one file - code works with a small set of boundary conditions and this is not something that I would deploy at this point). I was more focused on getting it out than doing extensive testing.

Next steps:

I'll start updating the code to;
 * Put this on github, do some more measurements
 * implement an inverted index for a single file.
 * Then for multiple files.
 * Finally, set it up so that you can send queries to this inverted index running on the GPU...
 * :) Testing of course.

*But I'd like feedback on this* before going down this rabbit hole.

Also, once I have an initial inverted index in place, I'd love to have contributors.  There is a lot of contribution that even a CPU developer can do - one look at the code will tell you.  *If you can help, please do reach out to me.* One major advantage for the contributor will be learning GPU programming - and trust me on this, that's a completely different ball game altogether.

> Explore GPU acceleration
> ------------------------
>
>                 Key: LUCENE-7745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7745
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>              Labels: gsoc2017, mentor
>         Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations were to be offloaded from CPU to the GPU(s). With commodity GPUs having as high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known to be a good candidate for GPU based speedup (esp. when complex polygons are involved). In the past, Mike McCandless has mentioned that "both initial indexing and merging are CPU/IO intensive, but they are very amenable to soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org