You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Manuel Le Normand <ma...@gmail.com> on 2013/09/22 15:34:44 UTC
Understanding FST Prefix & CheckIndex output
Hi there,
I try to deep dive into the inner LucenePostingFormat to check what might I
do for improving query performance. I'm curious about the termBlock stats
that I get from checkIndex -verbose.
What does the followong mean:
index FST bytes - the FST size, which is the field's partition of the .tip
file?
num of terms - written 2M, although Luke interface shows me 8M, how come?
term / index FST bytes - summing up all my fields bytes doesn't get me
close to the .tim / tip file, how come?
blocks - these are the SUFFIX blocks (.tim files), which are implemented as
Burst Tries, right?
block types - where can I get the info about these different types?
As background, my main performance issue is (random?) read miss IO while
looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed
queries, so my optimization is avoiding IO's. That said, is there any
reason getting the right block will require more than <segment_count> IO
(of 4kB)?
Does a certain distribution of prefix length of block types should alarm me
in some way?
field "text_txt"
index FST:
18300 nodes
45779 arc
583438 bytes
term:
2053393 terms
25597203 bytes (12.5 bytes/term)
blocks:
66086 blocks
51870 terms-only blocks
47 sub-block-only blocks
14169 mixed blocks
13599 floor blocks
22862 non-floor blocks
43224 floor sub-blcoks
18289568 term suffix bytes (276.8 suffix-bytes/block)
4174480 term stas bytes (63.2 stats-bytes/block)
7632796 other bytes (115.5 stats-bytes/block)
by prefix length:
0: 1
1: 683
2: 10782
3. 17133
etc...
Thanks alot,
Manuel
Re: Understanding FST Prefix & CheckIndex output
Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sun, Sep 22, 2013 at 9:34 AM, Manuel Le Normand
<ma...@gmail.com> wrote:
> Hi there,
> I try to deep dive into the inner LucenePostingFormat to check what might I
> do for improving query performance. I'm curious about the termBlock stats
> that I get from checkIndex -verbose.
>
> What does the followong mean:
> index FST bytes - the FST size, which is the field's partition of the .tip
> file?
Yes.
> num of terms - written 2M, although Luke interface shows me 8M, how come?
This should be the total unique term count for that field; not sure
why Luke disagrees. Maybe Luke is printing Terms.sumTotalTermFreq =
total number of term occurrences (not unique terms).
> term / index FST bytes - summing up all my fields bytes doesn't get me
> close to the .tim / tip file, how come?
The .tim file also contains the postings metadata (file pointers into
.doc, .pos, skip length, etc.), so .tim file size will be larger than
the terms + index bytes as reported by CheckIndex.
> blocks - these are the SUFFIX blocks (.tim files), which are implemented as
> Burst Tries, right?
Yes; this is basically the number of nodes in the prefix trie, where
each node has a block of terms' suffixes, plus their metadata
(docFreq, totalTermFreq) plus the postings metadata.
> block types - where can I get the info about these different types?
A terms-only block is a "leaf" node in the prefix trie: it has no
child sub-blocks, so all of its entries are terms.
A sub-clock only block is a non-leaf node that has only pointers to
other blocks, i.e. it contains no terms itself.
A mixed block is all other blocks: they contain at least one term and
at least one sub-block. The total number of blocks is equal to the
sum of these three counts.
A floor block is orthogonal to the above breakdown, and is used when a
single block would contain too many entries. This is necessary
because the problem of dividing up the terms into blocks of size 25 -
48 terms or sub-blocks is over-constrained. Since the labels are
bytes, in the worst case you could have 255 entries in a block; when
this happens, we split the block into an initial floor block, followed
by a number of "sub-floor" blocks. When seeking, the FST points to
each of these floor / sub-floor blocks so e.g. if you are trying to
get to byte 255 at this point, you don't have to scan through the
first 255 entries to get there.
> As background, my main performance issue is (random?) read miss IO while
> looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed
> queries, so my optimization is avoiding IO's.
Maybe try the new in-memory terms dict? It holds all terms + metadata
in RAM so there's no seeking when looking up the term. But the
postings are still on disk, so seeking is necessary there. This is
currently trunk only, FSTTerms* under lucene/codec.
You could also try MemoryPostingsFormat, which holds all terms +
postings in RAM.
But you need to have enough RAM vs your index size for the fields, for these :)
> That said, is there any
> reason getting the right block will require more than <segment_count> IO
> (of 4kB)?
In the worst case it's 2 seeks per term: first the FST tells us which
block to load from the .tim file, so we seek + scan to it. Then,
assuming you need to visit the documents for this term, it's another
seek to get to the right block for the postings, unless there was only
1 doc for this term and you indexed DOCS_ONLY, in which case that
docID is inlined into the term metadata (so only one seek).
If you also need positions, then that will be another seek; if
payloads/offsets, another seek.
> Does a certain distribution of prefix length of block types should alarm me
> in some way?
>
> field "text_txt"
> index FST:
> 18300 nodes
> 45779 arc
> 583438 bytes
> term:
> 2053393 terms
> 25597203 bytes (12.5 bytes/term)
> blocks:
> 66086 blocks
> 51870 terms-only blocks
> 47 sub-block-only blocks
> 14169 mixed blocks
> 13599 floor blocks
> 22862 non-floor blocks
> 43224 floor sub-blcoks
> 18289568 term suffix bytes (276.8 suffix-bytes/block)
> 4174480 term stas bytes (63.2 stats-bytes/block)
> 7632796 other bytes (115.5 stats-bytes/block)
> by prefix length:
> 0: 1
> 1: 683
> 2: 10782
> 3. 17133
>
> etc...
There are term distributions that are "better", but there should be no
"adversary" case that cripples the performance.
E.g. if you use UIDs, the best performance should be 0-prefixed base-N
ints, where N is over 25 and well below 48, is best, and they are
sequentially assigned. Random UIDs are worst!
But, no distribution by prefix length should be "alarming" ...
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org