You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Joshua Mack (JIRA)" <ji...@apache.org> on 2019/07/02 17:22:00 UTC
[jira] [Commented] (LUCENE-7745) Explore GPU acceleration

    [ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877165#comment-16877165 ] 

Joshua Mack commented on LUCENE-7745:
-------------------------------------

Sounds good!

Re: efficient histogram implementation in CUDA

If it helps, [this approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3] has been good for a balance between GPU performance and ease of implementation for work I've done in the past. If academic paywalls block you for all those results, it looks to also be available (presumably by the authors) on [researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]

The basic idea is to compute sub-histograms in each thread block with each thread block accumulating into the local memory. Then, when each thread block finishes its workload, it atomically adds the result to global memory, reducing the overall amount of traffic to global memory.

To increase throughput and reduce shared memory contention, the main contribution here is that they actually use R "replicated" sub-histograms in each thread block, and they offset them so that bin 0 of the 1st histogram falls into a different memory bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, it improves throughput in the degenerate case where multiple threads are trying to accumulate the same histogram bin at the same time.

> Explore GPU acceleration
> ------------------------
>
>                 Key: LUCENE-7745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7745
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>              Labels: gsoc2017, mentor
>         Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations were to be offloaded from CPU to the GPU(s). With commodity GPUs having as high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known to be a good candidate for GPU based speedup (esp. when complex polygons are involved). In the past, Mike McCandless has mentioned that "both initial indexing and merging are CPU/IO intensive, but they are very amenable to soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org