You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ian Vink <ia...@gmail.com> on 2009/05/11 17:35:08 UTC

IndexReader.Terms - internals

            IndexReader rdr = IndexReader.Open(myFolder);
            TermEnum terms = rdr.Terms((new Term(myTermName, "")));

(from .NET land, but it's all the same)

This code works great, I can loop thru the terms nicely, but after it
returns all the myTermName terms, it goes into all other terms.

Is there a way to limit the rdr.Terms to return only those whose field is
myTermName

Re: IndexReader.Terms - internals

Posted by Ian Vink <ia...@gmail.com>.
Thanks guys,
Here's what I built:

http://BahaiResearch.com

It allows any language speaker to read about another person's religion in
any language. Helps promote unity in diversity. It's open source.

Ian



On Mon, May 11, 2009 at 1:39 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> No, there is no other way to do this. And if you think, the TermEnum takes
> too much RAM when returning all terms and also from different, you can be
> sure, that there is no wasted memory, as the term enum does not allocate
> the
> whole terms (like normal Java iterators). The term enum is iterated on disk
> and terms are loaded from there (this is why it throws IOException).
>
> The reason behind this behaviour is simple:
> IR.terms(term) returns all terms >= the given term (see javadoc), not all
> terms starting with a specific field. Terms are ordered by fieldname and
> then text. Because of this it looks like the TermEnum would only return
> terms of this field. One special case is:
> If the field name does not exist in the Index, IR.terms(term) would also be
> positioned on the first term >= the given one, but as the field does not
> exist, it would be the first term of the alphabetically next field name.
>
> So in gernal you stop iterating when no more terms are available or the
> field name of the current term != the requested field. Almost all internal
> algorithms inside Lucene (PrefixQuery, RangeQuery,...) work in this way!
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: David Causse [mailto:dcausse@spotter.com]
> > Sent: Monday, May 11, 2009 6:21 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: IndexReader.Terms - internals
> >
> > Hi,
> > We noticed this behaviour also, so we do like this :
> >
> > Map<Term, Integer> result = new HashMap<Term, Integer>();
> > TermEnum all;
> > if(matcher.fullScan()) {
> >         all = reader.terms(new Term(field));
> > } else {
> >         all = reader.terms(new Term(field, matcher.prefix()));
> > }
> > if(all == null) return result;
> > Term t;
> > do {
> >         t = all.term();
> >         if(t != null && matcher.match(t.text()))
> >                 result.put(t,all.docFreq());
> >
> > } while(all.next() && all.term().field() == field && (matcher.fullScan()
> > ? true : t.text().startsWith(matcher.prefix())));
> > return result;
> >
> > matcher is an application level object it is designed to match complex
> > word. So we loop on the TermEnum until we consider we reached the end of
> > interesting information.
> > To summarize: you stop the loop when
> > 1. there is no more data in TermEnum
> > 2. the field is not the same (don't forget to intern String field if it
> > comes from outside)
> > 3. you reached non-matching Terms by checking a prefix.
> >
> > If there is better way to do I'd be glad to hear of.
> >
> > David.
> >
> > Ian Vink a écrit :
> > >             IndexReader rdr = IndexReader.Open(myFolder);
> > >             TermEnum terms = rdr.Terms((new Term(myTermName, "")));
> > >
> > > (from .NET land, but it's all the same)
> > >
> > > This code works great, I can loop thru the terms nicely, but after it
> > > returns all the myTermName terms, it goes into all other terms.
> > >
> > > Is there a way to limit the rdr.Terms to return only those whose field
> > is
> > > myTermName
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: IndexReader.Terms - internals

Posted by Uwe Schindler <uw...@thetaphi.de>.
No, there is no other way to do this. And if you think, the TermEnum takes
too much RAM when returning all terms and also from different, you can be
sure, that there is no wasted memory, as the term enum does not allocate the
whole terms (like normal Java iterators). The term enum is iterated on disk
and terms are loaded from there (this is why it throws IOException).

The reason behind this behaviour is simple:
IR.terms(term) returns all terms >= the given term (see javadoc), not all
terms starting with a specific field. Terms are ordered by fieldname and
then text. Because of this it looks like the TermEnum would only return
terms of this field. One special case is:
If the field name does not exist in the Index, IR.terms(term) would also be
positioned on the first term >= the given one, but as the field does not
exist, it would be the first term of the alphabetically next field name.

So in gernal you stop iterating when no more terms are available or the
field name of the current term != the requested field. Almost all internal
algorithms inside Lucene (PrefixQuery, RangeQuery,...) work in this way!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: David Causse [mailto:dcausse@spotter.com]
> Sent: Monday, May 11, 2009 6:21 PM
> To: java-user@lucene.apache.org
> Subject: Re: IndexReader.Terms - internals
> 
> Hi,
> We noticed this behaviour also, so we do like this :
> 
> Map<Term, Integer> result = new HashMap<Term, Integer>();
> TermEnum all;
> if(matcher.fullScan()) {
>         all = reader.terms(new Term(field));
> } else {
>         all = reader.terms(new Term(field, matcher.prefix()));
> }
> if(all == null) return result;
> Term t;
> do {
>         t = all.term();
>         if(t != null && matcher.match(t.text()))
>                 result.put(t,all.docFreq());
> 
> } while(all.next() && all.term().field() == field && (matcher.fullScan()
> ? true : t.text().startsWith(matcher.prefix())));
> return result;
> 
> matcher is an application level object it is designed to match complex
> word. So we loop on the TermEnum until we consider we reached the end of
> interesting information.
> To summarize: you stop the loop when
> 1. there is no more data in TermEnum
> 2. the field is not the same (don't forget to intern String field if it
> comes from outside)
> 3. you reached non-matching Terms by checking a prefix.
> 
> If there is better way to do I'd be glad to hear of.
> 
> David.
> 
> Ian Vink a écrit :
> >             IndexReader rdr = IndexReader.Open(myFolder);
> >             TermEnum terms = rdr.Terms((new Term(myTermName, "")));
> >
> > (from .NET land, but it's all the same)
> >
> > This code works great, I can loop thru the terms nicely, but after it
> > returns all the myTermName terms, it goes into all other terms.
> >
> > Is there a way to limit the rdr.Terms to return only those whose field
> is
> > myTermName
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexReader.Terms - internals

Posted by David Causse <dc...@spotter.com>.
Hi,
We noticed this behaviour also, so we do like this :

Map<Term, Integer> result = new HashMap<Term, Integer>();
TermEnum all;
if(matcher.fullScan()) {
        all = reader.terms(new Term(field));
} else {
        all = reader.terms(new Term(field, matcher.prefix()));
}
if(all == null) return result;
Term t;
do {
        t = all.term();
        if(t != null && matcher.match(t.text()))
                result.put(t,all.docFreq());

} while(all.next() && all.term().field() == field && (matcher.fullScan() 
? true : t.text().startsWith(matcher.prefix())));
return result;

matcher is an application level object it is designed to match complex 
word. So we loop on the TermEnum until we consider we reached the end of 
interesting information.
To summarize: you stop the loop when
1. there is no more data in TermEnum
2. the field is not the same (don't forget to intern String field if it 
comes from outside)
3. you reached non-matching Terms by checking a prefix.

If there is better way to do I'd be glad to hear of.

David.

Ian Vink a écrit :
>             IndexReader rdr = IndexReader.Open(myFolder);
>             TermEnum terms = rdr.Terms((new Term(myTermName, "")));
>
> (from .NET land, but it's all the same)
>
> This code works great, I can loop thru the terms nicely, but after it
> returns all the myTermName terms, it goes into all other terms.
>
> Is there a way to limit the rdr.Terms to return only those whose field is
> myTermName
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexReader.Terms - internals

Posted by Ian Lea <ia...@gmail.com>.
I believe not. Just get out when term.field() != myTermName, interned.


--
Ian.


On Mon, May 11, 2009 at 4:35 PM, Ian Vink <ia...@gmail.com> wrote:
>            IndexReader rdr = IndexReader.Open(myFolder);
>            TermEnum terms = rdr.Terms((new Term(myTermName, "")));
>
> (from .NET land, but it's all the same)
>
> This code works great, I can loop thru the terms nicely, but after it
> returns all the myTermName terms, it goes into all other terms.
>
> Is there a way to limit the rdr.Terms to return only those whose field is
> myTermName
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org