You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Rinka Singh (JIRA)" <ji...@apache.org> on 2019/07/03 15:57:01 UTC

[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

    [ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877934#comment-16877934 ] 

Rinka Singh edited comment on LUCENE-7745 at 7/3/19 3:56 PM:
-------------------------------------------------------------

{quote}The basic idea is to compute sub-histograms in each thread block with each thread block accumulating into the local memory. Then, when each thread block finishes its workload, it atomically adds the result to global memory, reducing the overall amount of traffic to global memory.To increase throughput and reduce shared memory contention, the main contribution here is that they actually use R "replicated" sub-histograms in each thread block, and they offset them so that bin 0 of the 1st histogram falls into a different memory bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, it improves throughput in the degenerate case where multiple threads are trying to accumulate the same histogram bin at the same time.
{quote}
So here's what I've done/am doing:

I have a basic histogramming (including eliminating stop words) working on a single GPU  (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 5MB (text file) and it seems to be working OK.

The following is how I'm implementing it - briefly.

Read a file in from command line (linux executable) into the GPU
 * convert the stream to words, chunk them into blocks
 * eliminate the stop words
 * sort/merge (including word-count) everything first inside a block and then across blocks - I came up with my own sort - haven't had the time to explore the parallel sorts out there
 * This results in a sorted histogram is held in multiple blocks in the GPU.

The advantages of this approach (to my mind) are:
 * i can scale up use the entire GPU memory.  My guess is I can create and manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested with a 5 MB text file so far.
 * Easy to add fresh data into the existing histogram.  All I need to do is create new blocks and sort/merge them all.
 * I'm guessing this should make it easy to implement scaling across GPUs which means on a multi-GPU machine, I can scale to the almost the number of GPUs there and then of course one can setup a cluster of such machines...  This is far in the future though...
 * The sort is kept separate so we can experiment with various sorts and see which one performs best.

 

The issues are:
 * It is currently horrendously slow (I use global memory all the way and no optimization).  Well OK much too slow for my liking (I went over to nVidia's office and tested it on a K80 and it was just twice as fast as my GPU).  I'm currently trying to implement a shared memory version (and a few other tweaks) that should speed it up.
 * I have yet to do comparisons with the histogramming tools out there and so cannot say how much better it is.  Once I have the basic inverted index in place, I'll reach out to you all for the testing.
 * It is still a bit fragile - I'm still finding bugs as I test but the basic works.

 

Currently in process:
 * code is modified for (some) performance.  Am debugging/testing - it will take a while.  As of now, I feel good about what I've done but I won't know till I get it to work and test for performance.
 * Need to add ability to handle multiple files (I think I will postpone this as one can always cat the files together and pass it in - that is a pretty simple script that can be wrapped around the executable).
 * Need to create inverted index.
 * we'll worry about searching on the index later but that should be pretty trivial - well actually nothing is trivial here.

 
{quote}Re: efficient histogram implementation in CUDA

If it helps, [this approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3] has been good for a balance between GPU performance and ease of implementation for work I've done in the past. If academic paywalls block you for all those results, it looks to also be available (presumably by the authors) on [researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]
{quote}
 Took a quick look - they are all priced products.  I will take a look at researchgate sometime.

I apologize but I may not be very responsive in the next month or so as we are in the middle of a release at work and also my night time job (this).


was (Author: rinka):
{quote}The basic idea is to compute sub-histograms in each thread block with each thread block accumulating into the local memory. Then, when each thread block finishes its workload, it atomically adds the result to global memory, reducing the overall amount of traffic to global memory.To increase throughput and reduce shared memory contention, the main contribution here is that they actually use R "replicated" sub-histograms in each thread block, and they offset them so that bin 0 of the 1st histogram falls into a different memory bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, it improves throughput in the degenerate case where multiple threads are trying to accumulate the same histogram bin at the same time.
{quote}
So here's what I've done/am doing:

I have a basic histogramming (including eliminating stop words) working on a single GPU  (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 5MB (text file) and it seems to be working OK.

The following is how I'm implementing it - briefly.

Read a file in from command line (linux executable) into the GPU
 * convert the stream to words, chunk them into blocks
 * eliminate the stop words
 * sort/merge (including word-count) everything first inside a block and then across blocks - I came up with my own sort - haven't had the time to explore the parallel sorts out there
 * This results in a sorted histogram is held in multiple blocks in the GPU.

The advantages of this approach (to my mind) are:
 * i can scale up use the entire GPU memory.  My guess is I can create and manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested with a 5 MB text file so far.
 * Easy to add fresh data into the existing histogram.  All I need to do is create new blocks and sort/merge them all.
 * I'm guessing this should make it easy to implement scaling across GPUs which means on a multi-GPU machine, I can scale to the almost the number of GPUs there and then of course one can setup a cluster of such machines...  This is far in the future though...
 * The sort is kept separate so we can experiment with various sorts and see which one performs best.

 

The issues are:
 * It is currently horrendously slow (I use global memory all the way and no optimization).  Well OK much too slow for my liking (I went over to nVidia's office and tested it on a K80 and it was just twice as fast as my GPU).  I'm currently trying to implement a shared memory version (and a few other tweaks) that should speed it up.
 * I have yet to do comparisons with the histogramming tools out there and so cannot say how much better it is.  Once I have the basic inverted index in place, I'll reach out to you all for the testing.
 * It is still a bit fragile - I'm still finding bugs as I test but the basic works.

 

Currently in process:
 * code is modified for (some) performance.  Am debugging/testing - it will take a while.  As of now, I feel good about what I've done but I won't know till I test for performance.
 * Need to add ability to handle multiple files (I think I will postpone this as one can always cat the files together and pass it in - that is a pretty simple script that can be wrapped around the executable).
 * Need to create inverted index.
 * we'll worry about searching on the index later but that should be pretty trivial - well actually nothing is trivial here.

 
{quote}Re: efficient histogram implementation in CUDA

If it helps, [this approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3] has been good for a balance between GPU performance and ease of implementation for work I've done in the past. If academic paywalls block you for all those results, it looks to also be available (presumably by the authors) on [researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]
{quote}
 Took a quick look - they are all priced products.  I will take a look at researchgate sometime.

I apologize but I may not be very responsive in the next month or so as we are in the middle of a release at work and also my night time job (this).

> Explore GPU acceleration
> ------------------------
>
>                 Key: LUCENE-7745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7745
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>              Labels: gsoc2017, mentor
>         Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations were to be offloaded from CPU to the GPU(s). With commodity GPUs having as high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known to be a good candidate for GPU based speedup (esp. when complex polygons are involved). In the past, Mike McCandless has mentioned that "both initial indexing and merging are CPU/IO intensive, but they are very amenable to soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org