You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Manuel Le Normand <ma...@gmail.com> on 2013/09/22 15:34:44 UTC

Understanding FST Prefix & CheckIndex output

Hi there,
I try to deep dive into the inner LucenePostingFormat to check what might I
do for improving query performance. I'm curious about the termBlock stats
that I get from checkIndex -verbose.

What does the followong mean:
index FST bytes - the FST size, which is the field's partition of the .tip
file?
num of terms - written 2M, although Luke interface shows me 8M, how come?
term / index FST bytes - summing up all my fields bytes doesn't get me
close to the .tim / tip file, how come?
blocks - these are the SUFFIX blocks (.tim files), which are implemented as
Burst Tries, right?
block types - where can I get the info about these different types?

As background, my main performance issue is (random?) read miss IO while
looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed
queries, so my optimization is avoiding IO's. That said, is there any
reason getting the right block will require more than <segment_count> IO
(of 4kB)?

Does a certain distribution of prefix length of block types should alarm me
in some way?

field "text_txt"
  index FST:
    18300 nodes
    45779 arc
    583438 bytes
  term:
    2053393 terms
    25597203 bytes (12.5 bytes/term)
  blocks:
    66086 blocks
    51870 terms-only blocks
    47 sub-block-only blocks
    14169 mixed blocks
    13599 floor blocks
    22862 non-floor blocks
    43224 floor sub-blcoks
    18289568 term suffix bytes (276.8 suffix-bytes/block)
    4174480 term stas bytes (63.2 stats-bytes/block)
    7632796 other bytes (115.5 stats-bytes/block)
    by prefix length:
      0: 1
      1: 683
      2: 10782
      3. 17133

etc...

Thanks alot,
Manuel

Re: Understanding FST Prefix & CheckIndex output

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Sun, Sep 22, 2013 at 9:34 AM, Manuel Le Normand
<ma...@gmail.com> wrote:
> Hi there,
> I try to deep dive into the inner LucenePostingFormat to check what might I
> do for improving query performance. I'm curious about the termBlock stats
> that I get from checkIndex -verbose.
>
> What does the followong mean:
> index FST bytes - the FST size, which is the field's partition of the .tip
> file?

Yes.

> num of terms - written 2M, although Luke interface shows me 8M, how come?

This should be the total unique term count for that field; not sure
why Luke disagrees.  Maybe Luke is printing Terms.sumTotalTermFreq =
total number of term occurrences (not unique terms).

> term / index FST bytes - summing up all my fields bytes doesn't get me
> close to the .tim / tip file, how come?

The .tim file also contains the postings metadata (file pointers into
.doc, .pos, skip length, etc.), so .tim file size will be larger than
the terms + index bytes as reported by CheckIndex.

> blocks - these are the SUFFIX blocks (.tim files), which are implemented as
> Burst Tries, right?

Yes; this is basically the number of nodes in the prefix trie, where
each node has a block of terms' suffixes, plus their metadata
(docFreq, totalTermFreq) plus the postings metadata.

> block types - where can I get the info about these different types?

A terms-only block is a "leaf" node in the prefix trie: it has no
child sub-blocks, so all of its entries are terms.

A sub-clock only block is a non-leaf node that has only pointers to
other blocks, i.e. it contains no terms itself.

A mixed block is all other blocks: they contain at least one term and
at least one sub-block.  The total number of blocks is equal to the
sum of these three counts.

A floor block is orthogonal to the above breakdown, and is used when a
single block would contain too many entries.  This is necessary
because the problem of dividing up the terms into blocks of size 25 -
48 terms or sub-blocks is over-constrained.  Since the labels are
bytes, in the worst case you could have 255 entries in a block; when
this happens, we split the block into an initial floor block, followed
by a number of "sub-floor" blocks.  When seeking, the FST points to
each of these floor / sub-floor blocks so e.g. if you are trying to
get to byte 255 at this point, you don't have to scan through the
first 255 entries to get there.

> As background, my main performance issue is (random?) read miss IO while
> looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed
> queries, so my optimization is avoiding IO's.

Maybe try the new in-memory terms dict?  It holds all terms + metadata
in RAM so there's no seeking when looking up the term.  But the
postings are still on disk, so seeking is necessary there.  This is
currently trunk only, FSTTerms* under lucene/codec.

You could also try MemoryPostingsFormat, which holds all terms +
postings in RAM.

But you need to have enough RAM vs your index size for the fields, for these :)

> That said, is there any
> reason getting the right block will require more than <segment_count> IO
> (of 4kB)?

In the worst case it's 2 seeks per term: first the FST tells us which
block to load from the .tim file, so we seek + scan to it.  Then,
assuming you need to visit the documents for this term, it's another
seek to get to the right block for the postings, unless there was only
1 doc for this term and you indexed DOCS_ONLY, in which case that
docID is inlined into the term metadata (so only one seek).

If you also need positions, then that will be another seek; if
payloads/offsets, another seek.

> Does a certain distribution of prefix length of block types should alarm me
> in some way?
>
> field "text_txt"
>   index FST:
>     18300 nodes
>     45779 arc
>     583438 bytes
>   term:
>     2053393 terms
>     25597203 bytes (12.5 bytes/term)
>   blocks:
>     66086 blocks
>     51870 terms-only blocks
>     47 sub-block-only blocks
>     14169 mixed blocks
>     13599 floor blocks
>     22862 non-floor blocks
>     43224 floor sub-blcoks
>     18289568 term suffix bytes (276.8 suffix-bytes/block)
>     4174480 term stas bytes (63.2 stats-bytes/block)
>     7632796 other bytes (115.5 stats-bytes/block)
>     by prefix length:
>       0: 1
>       1: 683
>       2: 10782
>       3. 17133
>
> etc...

There are term distributions that are "better", but there should be no
"adversary" case that cripples the performance.

E.g. if you use UIDs, the best performance should be 0-prefixed base-N
ints, where N is over 25 and well below 48, is best, and they are
sequentially assigned.  Random UIDs are worst!

But, no distribution by prefix length should be "alarming" ...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org