You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vadim Gindin <vg...@detectum.com> on 2017/12/14 09:15:52 UTC

Terminology. LeafReader -> TermEnum -> PostingsEnum

Hi All

I have a question about API. Particularly, about used terminology.

1. LeafReader. Why it starts with "Leaf"? Can I understand that, that such
reader is intended for reading only one leaf of index tree? Does it mean
that it is working inside a context (LeafReaderContext) of several
documents "physically" located in that leaf?

2.  Our LeafReader is positioned in some document, and reader.terms(field)
will return terms list for the single field from the index. Right?

3. LeafReader is the successor of IndexReader, which has getTermVectors(int
docID)
Can I use it in my custom Query (to be aware of all documents fields)
instead of terms(field)

4. I.e. LeafReader contains statistical methods, methods returning the
document values, and the methods returning terms and postings. terms() and
postings() are intended for search.

3. What is Postings/PostingEnum? Why is it named starting with "Posting"?
My native language is Russian and I'm a bit confused trying to find a
corresponding meaning of this word in a search context.

5. Ok, I see PostingEnum implements some basic interface DocIdSetIterator,
but PostingEnum is one of approximately 20 implementations of that
interface. Why is it used in LeafReader? What the principal difference
between these 20 implementations and which of them can be really useful?

Regards,
Vadim Gindin

Re: Terminology. LeafReader -> TermEnum -> PostingsEnum

Posted by Vadim Gindin <vg...@detectum.com>.

I made a mistake in issue 5. The real case is the PostingEnum has many
implementations, not the DocIdSetIterator. Please read the question 5 as
follows.

5. Should I use a concrete implementation of PostingEnum? When it makes
sense? Or I always should get PostingsEnum as a result of a call
TermEnum.postings(...)?

I forgot one interesting question.

6. PostingEnum has the field AttributeSource atts attribute source. It
looks like a connection point with query Analyzer here. Is it true? If yes
it could be very useful for me and what is appropriate usage scheme of this
attribute? Let's assume that I need to keep some coefficients along with
tokens to use them further in scoring. For example, if the matched token is
a synonym - I could multiple the query score to 0.75.

Regards,
Vadim Gindin

On Thu, Dec 14, 2017 at 2:15 PM, Vadim Gindin <vg...@detectum.com> wrote:

> Hi All
>
> I have a question about API. Particularly, about used terminology.
>
> 1. LeafReader. Why it starts with "Leaf"? Can I understand that, that such
> reader is intended for reading only one leaf of index tree? Does it mean
> that it is working inside a context (LeafReaderContext) of several
> documents "physically" located in that leaf?
>
> 2.  Our LeafReader is positioned in some document, and reader.terms(field)
> will return terms list for the single field from the index. Right?
>
> 3. LeafReader is the successor of IndexReader, which has getTermVectors(
> int docID)
> Can I use it in my custom Query (to be aware of all documents fields)
> instead of terms(field)
>
> 4. I.e. LeafReader contains statistical methods, methods returning the
> document values, and the methods returning terms and postings. terms()
> and postings() are intended for search.
>
> 3. What is Postings/PostingEnum? Why is it named starting with "Posting"?
> My native language is Russian and I'm a bit confused trying to find a
> corresponding meaning of this word in a search context.
>
> 5. Ok, I see PostingEnum implements some basic interface DocIdSetIterator,
> but PostingEnum is one of approximately 20 implementations of that
> interface. Why is it used in LeafReader? What the principal difference
> between these 20 implementations and which of them can be really useful?
>
> Regards,
> Vadim Gindin
>

Re: Terminology. LeafReader -> TermEnum -> PostingsEnum

Posted by Mikhail Khludnev <mk...@apache.org>.

Vadim,
I suppose https://vimeo.com/32065505 is old good explanation of all Lucene
API dimensions.
It covers the most of your questions. FWIW, Leaf is a segment, and postings
is a list of occurrences.
Regarding attributes in postings, iirc it's only used in some suggester,
but now I even can't find this usage.

On Thu, Dec 14, 2017 at 12:15 PM, Vadim Gindin <vg...@detectum.com> wrote:

> Hi All
>
> I have a question about API. Particularly, about used terminology.
>
> 1. LeafReader. Why it starts with "Leaf"? Can I understand that, that such
> reader is intended for reading only one leaf of index tree? Does it mean
> that it is working inside a context (LeafReaderContext) of several
> documents "physically" located in that leaf?
>
> 2.  Our LeafReader is positioned in some document, and reader.terms(field)
> will return terms list for the single field from the index. Right?
>
> 3. LeafReader is the successor of IndexReader, which has getTermVectors(int
> docID)
> Can I use it in my custom Query (to be aware of all documents fields)
> instead of terms(field)
>
> 4. I.e. LeafReader contains statistical methods, methods returning the
> document values, and the methods returning terms and postings. terms() and
> postings() are intended for search.
>
> 3. What is Postings/PostingEnum? Why is it named starting with "Posting"?
> My native language is Russian and I'm a bit confused trying to find a
> corresponding meaning of this word in a search context.
>
> 5. Ok, I see PostingEnum implements some basic interface DocIdSetIterator,
> but PostingEnum is one of approximately 20 implementations of that
> interface. Why is it used in LeafReader? What the principal difference
> between these 20 implementations and which of them can be really useful?
>
> Regards,
> Vadim Gindin
>



-- 
Sincerely yours
Mikhail Khludnev