You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2010/10/05 00:42:15 UTC

Flex indexing : Hybrid index maintnenance for faster indexing

Hi all,

Would it be possible to implement something like this in Flex?


Büttcher, S., & Clarke, C. L. A. (2008). Hybrid index maintenance for contiguous inverted lists. Information Retrieval, 11(3), 175-207. doi:10.1007/s10791-007-9042-8  

The approach takes advantage of having a different policy for large postings lists (ie frequent terms)  versus small postings lists for flushing the buffer and writing to disk.


Tom Burton-West

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Flex indexing : Hybrid index maintnenance for faster indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.

I'm also very curious how your index will be affected by upgrading to flex.

I know the terms index gets alot smaller :)  But I'm curious about the
query performance...

I haven't seen that paper -- it sounds very interesting!  I'll add to
the list to read...

Mike

On Tue, Oct 5, 2010 at 11:12 AM, Burton-West, Tom <tb...@umich.edu> wrote:
> Thanks Mike,
>
> I suspected the approach might require architectural changes beyond flex, but since our indexes are so huge and disk I/O is our main bottleneck both for searching and indexing, I'm always looking for ways to deal with very large postings and positions lists that might reduce I/O.
>
> I haven't looked in detail into PFOR and Simple9 and some of the other new encodings, but my understanding is that they trade off compression for decompression speed. i.e. they take up a bit more space, but are more efficient to decompress.   In our case, where we have underutilized CPU, mostly because the processors are waiting on disk I/O, I'll be curious to find out whether the slight increase in disk I/O time due to lower compression is still outweighed by the increase in decompression speed. (Don't know if we'll find the time to try flex for a while though:)
>
>
> BTW: have you seen this paper looking at 64-bit words?
>  "Index Compression Using 64-Bit Words", Anh, Moffat. Software -- Practice & Experience, 40(2):131-148, February 2010
>
>
> Tom
> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Tuesday, October 05, 2010 6:21 AM
> To: dev@lucene.apache.org
> Subject: Re: Flex indexing : Hybrid index maintnenance for faster indexing
>
> Nice paper!
>
> It's a neat trick to index the large postings as separate files, ie
> let the fileystem handle the growth as new postings are appended
> over time.
>
> But, unfortunately, we can't easily do this in Lucene, since Lucene
> assumes index files are write once, and derives its transactional
> semantics from this approach.  Ie, this would require sizable changes,
> beyond just swapping in a different Codec.
>
> Still, the idea that small/big postings lists should be handled
> differently is something we can take advantage of in a Codec, and I
> think we should.  I think likely we will switch to a default codec
> that uses pulsing (storing term's postiugs directly in terms dict) for
> very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR
> for high freq terms.
>
> Mike
>
> On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom <tb...@umich.edu> wrote:
>> Hi all,
>>
>> Would it be possible to implement something like this in Flex?
>>
>>
>> Büttcher, S., & Clarke, C. L. A. (2008). Hybrid index maintenance for contiguous inverted lists. Information Retrieval, 11(3), 175-207. doi:10.1007/s10791-007-9042-8
>>
>> The approach takes advantage of having a different policy for large postings lists (ie frequent terms)  versus small postings lists for flushing the buffer and writing to disk.
>>
>>
>> Tom Burton-West
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: Flex indexing : Hybrid index maintnenance for faster indexing

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Thanks Mike,

I suspected the approach might require architectural changes beyond flex, but since our indexes are so huge and disk I/O is our main bottleneck both for searching and indexing, I'm always looking for ways to deal with very large postings and positions lists that might reduce I/O.

I haven't looked in detail into PFOR and Simple9 and some of the other new encodings, but my understanding is that they trade off compression for decompression speed. i.e. they take up a bit more space, but are more efficient to decompress.   In our case, where we have underutilized CPU, mostly because the processors are waiting on disk I/O, I'll be curious to find out whether the slight increase in disk I/O time due to lower compression is still outweighed by the increase in decompression speed. (Don't know if we'll find the time to try flex for a while though:)

BTW: have you seen this paper looking at 64-bit words?
 "Index Compression Using 64-Bit Words", Anh, Moffat. Software -- Practice & Experience, 40(2):131-148, February 2010

Tom 
-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Tuesday, October 05, 2010 6:21 AM
To: dev@lucene.apache.org
Subject: Re: Flex indexing : Hybrid index maintnenance for faster indexing

Nice paper!

It's a neat trick to index the large postings as separate files, ie
let the fileystem handle the growth as new postings are appended
over time.

But, unfortunately, we can't easily do this in Lucene, since Lucene
assumes index files are write once, and derives its transactional
semantics from this approach.  Ie, this would require sizable changes,
beyond just swapping in a different Codec.

Still, the idea that small/big postings lists should be handled
differently is something we can take advantage of in a Codec, and I
think we should.  I think likely we will switch to a default codec
that uses pulsing (storing term's postiugs directly in terms dict) for
very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR
for high freq terms.

Mike

On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hi all,
>
> Would it be possible to implement something like this in Flex?
>
>
> Büttcher, S., & Clarke, C. L. A. (2008). Hybrid index maintenance for contiguous inverted lists. Information Retrieval, 11(3), 175-207. doi:10.1007/s10791-007-9042-8
>
> The approach takes advantage of having a different policy for large postings lists (ie frequent terms)  versus small postings lists for flushing the buffer and writing to disk.
>
>
> Tom Burton-West
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Flex indexing : Hybrid index maintnenance for faster indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.

Nice paper!

It's a neat trick to index the large postings as separate files, ie
let the fileystem handle the growth as new postings are appended
over time.

But, unfortunately, we can't easily do this in Lucene, since Lucene
assumes index files are write once, and derives its transactional
semantics from this approach.  Ie, this would require sizable changes,
beyond just swapping in a different Codec.

Still, the idea that small/big postings lists should be handled
differently is something we can take advantage of in a Codec, and I
think we should.  I think likely we will switch to a default codec
that uses pulsing (storing term's postiugs directly in terms dict) for
very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR
for high freq terms.

Mike

On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hi all,
>
> Would it be possible to implement something like this in Flex?
>
>
> Büttcher, S., & Clarke, C. L. A. (2008). Hybrid index maintenance for contiguous inverted lists. Information Retrieval, 11(3), 175-207. doi:10.1007/s10791-007-9042-8
>
> The approach takes advantage of having a different policy for large postings lists (ie frequent terms)  versus small postings lists for flushing the buffer and writing to disk.
>
>
> Tom Burton-West
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org