You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ian Vink <ia...@gmail.com> on 2009/05/10 17:03:43 UTC

Distinct terms values? (like in Luke)

I have tagged each of my documents with a term "religion" and values like
"Baha'i, Christian, Jewish, Islam" etc.
In Luke it shows me that I have a term count of 8 for the term "religion"

How do I get a list of the 8 distinct values for the term religion from an
index?

Ian

RE: Distinct terms values? (like in Luke)

Posted by Uwe Schindler <uw...@thetaphi.de>.

> Don't mean to hijack this thread, but I have a related question:
> 
> Is there also a way to filter the terms based on another field?
> 
> For example, the documents might also contain the field "published
> date", so I want to get a distinct list of values for the term
> "religion" in documents published within a range of dates.

This is not covered by the TermEnum and cannot be retrieved easily. If you
really need such functionality, you can e.g. use payloads. So index the
religion as a term and attach the dates as a binary payload. Then you can
enumerate over all religions like mentioned before, but filter the terms
using the TermPositions access methods. It may also be possible the other
way round (index dates and attach religion as payload).

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Distinct terms values? (like in Luke)

Posted by Jeff Turner <je...@pointomatic.com>.

Don't mean to hijack this thread, but I have a related question:

Is there also a way to filter the terms based on another field?

For example, the documents might also contain the field "published  
date", so I want to get a distinct list of values for the term  
"religion" in documents published within a range of dates.

Thanks
Jeff

On May 10, 2009, at 11:35 AM, Uwe Schindler wrote:

> You can get this list using IndexReader.terms(new  
> Term(fieldname,"")). This
> returns an enumeration of all terms starting with the given one (the  
> field
> name). Just iterate over the TermEnum util the field name of the  
> iterated
> term changes.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>> -----Original Message-----
>> From: Ian Vink [mailto:ianvink@gmail.com]
>> Sent: Sunday, May 10, 2009 5:04 PM
>> To: java-user@lucene.apache.org
>> Subject: Distinct terms values? (like in Luke)
>>
>> I have tagged each of my documents with a term "religion" and  
>> values like
>> "Baha'i, Christian, Jewish, Islam" etc.
>> In Luke it shows me that I have a term count of 8 for the term  
>> "religion"
>>
>> How do I get a list of the 8 distinct values for the term religion  
>> from an
>> index?
>>
>> Ian
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Distinct terms values? (like in Luke)

Posted by Uwe Schindler <uw...@thetaphi.de>.

I forgot, an alternative to this is to use the FieldCache parsers, which
automatically throw an RuntimeException, if a lower precision value is in
term to stop iteration in the FieldCache uninversion:

 try {
   while (next != null && next.field().equals("trie")) {
     ints.add(FieldCache.NUMERIC_UTILS_INT_PARSER.parseInt(next.text()));
     next = termEnum.next() ? termEnum.term() : null;
   }
 } catch (RuntimeException e) {}

See the code of FieldCacheImpl that does exactly that.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: Monday, October 26, 2009 10:43 AM
> To: java-user@lucene.apache.org
> Subject: RE: Distinct terms values? (like in Luke)
> 
> >     @Test
> >     public void distinct() throws Exception {
> >         RAMDirectory directory = new RAMDirectory();
> >         IndexWriter writer = new IndexWriter(directory, new
> > WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
> >
> >         for (int l = -2; l <= 2; l++) {
> >             Document doc = new Document();
> >             doc.add(new Field("text", "the big brown", Field.Store.NO,
> > Field.Index.ANALYZED));
> >             doc.add(new NumericField("trie", Field.Store.NO,
> > true).setIntValue(l));
> >             writer.addDocument(doc);
> >         }
> >
> >         writer.close();
> >
> >         IndexReader reader = IndexReader.open(directory, true);
> >         TermEnum termEnum = reader.terms(new Term("trie", ""));
> >         Term next = termEnum.term();
> >         List<Integer> ints = new ArrayList<Integer>();
> >
> >         while (next != null && next.field().equals("trie")) {
> >             ints.add(NumericUtils.prefixCodedToInt(next.text()));
> >             next = termEnum.next() ? termEnum.term() : null;
> >         }
> >
> >        reader.close();
> >
> >         log.info(ints.toString());
> >     }
> >
> > ==> [-2, -1, 0, 1, 2, -16, 0, -256, 0, -4096, 0, -65536, 0, -1048576, 0,
> > -16777216, 0, -268435456, 0]
> 
> You can add a check in your while statement to break iteration, if the
> next
> lower precision is used:
> 
> while (next != null && next.field().equals("trie") &&
> next.term().charAt(0)
> == NumericUtils.SHIFT_START_INT)...
> 
> use the same constant for float, and SHIFT_START_LONG for long and double.
> 
> This should work. Maybe we add a method to NumericUtils that checks this
> and
> returns true/false if the term is not of highest precision.
> 
> Uwe
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Distinct terms values? (like in Luke)

Posted by Uwe Schindler <uw...@thetaphi.de>.

>     @Test
>     public void distinct() throws Exception {
>         RAMDirectory directory = new RAMDirectory();
>         IndexWriter writer = new IndexWriter(directory, new
> WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
> 
>         for (int l = -2; l <= 2; l++) {
>             Document doc = new Document();
>             doc.add(new Field("text", "the big brown", Field.Store.NO,
> Field.Index.ANALYZED));
>             doc.add(new NumericField("trie", Field.Store.NO,
> true).setIntValue(l));
>             writer.addDocument(doc);
>         }
> 
>         writer.close();
> 
>         IndexReader reader = IndexReader.open(directory, true);
>         TermEnum termEnum = reader.terms(new Term("trie", ""));
>         Term next = termEnum.term();
>         List<Integer> ints = new ArrayList<Integer>();
> 
>         while (next != null && next.field().equals("trie")) {
>             ints.add(NumericUtils.prefixCodedToInt(next.text()));
>             next = termEnum.next() ? termEnum.term() : null;
>         }
> 
>        reader.close();
> 
>         log.info(ints.toString());
>     }
> 
> ==> [-2, -1, 0, 1, 2, -16, 0, -256, 0, -4096, 0, -65536, 0, -1048576, 0,
> -16777216, 0, -268435456, 0]

You can add a check in your while statement to break iteration, if the next
lower precision is used:

while (next != null && next.field().equals("trie") && next.term().charAt(0)
== NumericUtils.SHIFT_START_INT)...

use the same constant for float, and SHIFT_START_LONG for long and double.

This should work. Maybe we add a method to NumericUtils that checks this and
returns true/false if the term is not of highest precision.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Distinct terms values? (like in Luke)

Posted by vsevel <v....@lombardodier.com>.

Does it work for numeric fields too? I am working with 2.9.0 and the
following code gives extra values:

    @Test
    public void distinct() throws Exception {
        RAMDirectory directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(directory, new
WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

        for (int l = -2; l <= 2; l++) {
            Document doc = new Document();
            doc.add(new Field("text", "the big brown", Field.Store.NO,
Field.Index.ANALYZED));
            doc.add(new NumericField("trie", Field.Store.NO,
true).setIntValue(l));
            writer.addDocument(doc);
        }

        writer.close();

        IndexReader reader = IndexReader.open(directory, true);
        TermEnum termEnum = reader.terms(new Term("trie", ""));
        Term next = termEnum.term();
        List<Integer> ints = new ArrayList<Integer>();

        while (next != null && next.field().equals("trie")) {
            ints.add(NumericUtils.prefixCodedToInt(next.text()));
            next = termEnum.next() ? termEnum.term() : null;
        }

       reader.close();

        log.info(ints.toString());
    }

==> [-2, -1, 0, 1, 2, -16, 0, -256, 0, -4096, 0, -65536, 0, -1048576, 0,
-16777216, 0, -268435456, 0]

Is there a way to make this work?


Uwe Schindler wrote:
> 
> You can get this list using IndexReader.terms(new Term(fieldname,"")).
> This
> returns an enumeration of all terms starting with the given one (the field
> name). Just iterate over the TermEnum util the field name of the iterated
> term changes.
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
-- 
View this message in context: http://www.nabble.com/Distinct-terms-values--%28like-in-Luke%29-tp23470919p26056543.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Distinct terms values? (like in Luke)

Posted by Uwe Schindler <uw...@thetaphi.de>.

You can get this list using IndexReader.terms(new Term(fieldname,"")). This
returns an enumeration of all terms starting with the given one (the field
name). Just iterate over the TermEnum util the field name of the iterated
term changes.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Ian Vink [mailto:ianvink@gmail.com]
> Sent: Sunday, May 10, 2009 5:04 PM
> To: java-user@lucene.apache.org
> Subject: Distinct terms values? (like in Luke)
> 
> I have tagged each of my documents with a term "religion" and values like
> "Baha'i, Christian, Jewish, Islam" etc.
> In Luke it shows me that I have a term count of 8 for the term "religion"
> 
> How do I get a list of the 8 distinct values for the term religion from an
> index?
> 
> Ian


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org