You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Steve Casselman <sc...@commacorp.com> on 2016/06/17 22:52:44 UTC

Accelerated Lucene Indexing

Hi Mike. I'm writing code for the Altera OpenCL SDK. I have a code base that
gives me a non-Lucene format index. I was wondering in your benchmark what
kind of data do you collect? Do you collect all the position and frequency
data? I'm also curious about what you see as the biggest bottleneck in
creating an index? Is it creating the index from the data or merging the
indexes?  Or something else? Do you feel the algorithm is CPU, memory or
disk bound? And finally do you think there is a market for accelerated
indexing? Say I could quadruple the price performance yet still make 100%
Lucene compatible indexes, would people pay for that? 

 

 

Thanks

 

Steve

Re: Accelerated Lucene Indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi Steve,

Lucene on OpenCL sounds neat!

In Lucene's nightly indexing benchmarks (
http://home.apache.org/~mikemccand/lucenebench/indexing.html) I index an
export of Wikipedia's english content, including terms, docIDs, term
frequencies, positions, and also points, doc values, stored fields.  The
full (messy!) source code is in this repository:
https://github.com/mikemccand/luceneutil.

Both initial indexing and merging are CPU/IO intensive, but they are very
amenable to soaking up the hardware's concurrency.

On whether there's a market, that's beyond my pay grade ;)  I just work on
the bits! Different users care about different things.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jun 17, 2016 at 6:52 PM, Steve Casselman <sc...@commacorp.com> wrote:

> Hi Mike. I’m writing code for the Altera OpenCL SDK. I have a code base
> that gives me a non-Lucene format index. I was wondering in your benchmark
> what kind of data do you collect? Do you collect all the position and
> frequency data? I’m also curious about what you see as the biggest
> bottleneck in creating an index? Is it creating the index from the data or
> merging the indexes?  Or something else? Do you feel the algorithm is CPU,
> memory or disk bound? And finally do you think there is a market for
> accelerated indexing? Say I could quadruple the price performance yet still
> make 100% Lucene compatible indexes, would people pay for that?
>
>
>
>
>
> Thanks
>
>
>
> Steve
>
>
>