You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark Bennett <mb...@ideaeng.com> on 2009/08/03 23:39:28 UTC

Using Luke to get terms for docs matching a specific query filter?

You can get a nice list of terms for a field using the Luke handler:
    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000

But what I'd really like is to get the terms for the docs that match a
particular slice of the index.

For example, let's say I have records for all 50 states, but I want to get
the top 1,000 terms for documents in California.

I'd like to add q or fq like this:
    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
        OR
    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA

Although I don't get any errors, this syntax doesn't seem to filter the
terms.  Not a bug, nobody ever said it would.

But has anybody written a utility to get term instances for a subset of the
index, based on a query?  And to be clear, I was hoping to get all of the
terms in matching documents, not just terms that are also present in the
query.

Thanks,
Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Using Luke to get terms for docs matching a specific query filter?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Tue, Aug 4, 2009 at 12:16 AM, Mark Bennett<mb...@ideaeng.com> wrote:
> Sow just make sure to use rows=1 ?

No, make sure that the query matches one document - rows (the number
of top docs returned) is irrelevant to faceting.
So q=id:some_doc

-Yonik
http://www.lucidimagination.com

Re: Using Luke to get terms for docs matching a specific query filter?

Posted by Mark Bennett <mb...@ideaeng.com>.

Sow just make sure to use rows=1 ?

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Mon, Aug 3, 2009 at 5:51 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Mon, Aug 3, 2009 at 8:26 PM, Mark Bennett<mb...@ideaeng.com> wrote:
> > Yonik, can you confirm reasoning below for 1.4 for a text field?
>
> The bit about warming?  Looks right to me - a big base docset can
> trigger short-circuit logic in the enum faceting code... using a
> docset of size 1 currently avoids this.
>
> -Yonik
> http://www.lucidimagination.com
>
>
> > ( Of course faceting is so much faster in 1.4 anyway, it's probably worth
> > the upgrade.
> >     https://issues.apache.org/jira/browse/SOLR-475  )
> >
> > A warning for folks NOT using 1.4:
> >
> > At the bottom of this wiki page: (very bottom)
> >    http://wiki.apache.org/solr/SimpleFacetParameters
> > It says:
> >    Warming
> >    facet.field queries using the term enumeration method can avoid the
> > evaluation of some terms for greater efficiency. To force the evaluation
> of
> > all terms for warming, the base query should match a single document.
> >
> > I think this is OK in the newer version, because as of 1.4 the default is
> > "fc", not "enum".  But prior to 1.4 there was no fc!
> >
> > Wiki info on the default (enum vs. fc)
> >    http://wiki.apache.org/solr/SimpleFacetParameters
> >
> > facet.method
> >    This parameter indicates what type of algorithm/method to use when
> > faceting a field.
> >
> > enum
> >    Enumerates all terms in a field, calculating the set intersection of
> > documents that match the term with documents that match the query. This
> was
> > the default (and only) method for faceting multi-valued fields prior to
> Solr
> > 1.4.
> >
> > fc (stands for field cache)
> >    The facet counts are calculated by iterating over documents that match
> > the query and summing the terms that appear in each document. This was
> the
> > default method for single valued fields prior to Solr 1.4.
> >
> > The default value is fc (except for BoolField) since it tends to use less
> > memory and is faster when a field has many unique terms in the index.
> >
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >
> >
> > On Mon, Aug 3, 2009 at 2:49 PM, Yonik Seeley <yonik@lucidimagination.com
> >wrote:
> >
> >> Sounds like faceting?
> >> q=state:CA&facet=true&facet.field=title&facet.limit=1000
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
> >>
> >>
> >> On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett<mb...@ideaeng.com>
> wrote:
> >> > You can get a nice list of terms for a field using the Luke handler:
> >> >    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
> >> >
> >> > But what I'd really like is to get the terms for the docs that match a
> >> > particular slice of the index.
> >> >
> >> > For example, let's say I have records for all 50 states, but I want to
> >> get
> >> > the top 1,000 terms for documents in California.
> >> >
> >> > I'd like to add q or fq like this:
> >> >
> >> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
> >> >        OR
> >> >
> >>
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
> >> >
> >> > Although I don't get any errors, this syntax doesn't seem to filter
> the
> >> > terms.  Not a bug, nobody ever said it would.
> >> >
> >> > But has anybody written a utility to get term instances for a subset
> of
> >> the
> >> > index, based on a query?  And to be clear, I was hoping to get all of
> the
> >> > terms in matching documents, not just terms that are also present in
> the
> >> > query.
> >> >
> >> > Thanks,
> >> > Mark
> >> >
> >> > --
> >> > Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> >> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >> >
> >>
> >
>

Re: Using Luke to get terms for docs matching a specific query filter?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Aug 3, 2009 at 8:26 PM, Mark Bennett<mb...@ideaeng.com> wrote:
> Yonik, can you confirm reasoning below for 1.4 for a text field?

The bit about warming?  Looks right to me - a big base docset can
trigger short-circuit logic in the enum faceting code... using a
docset of size 1 currently avoids this.

-Yonik
http://www.lucidimagination.com


> ( Of course faceting is so much faster in 1.4 anyway, it's probably worth
> the upgrade.
>     https://issues.apache.org/jira/browse/SOLR-475  )
>
> A warning for folks NOT using 1.4:
>
> At the bottom of this wiki page: (very bottom)
>    http://wiki.apache.org/solr/SimpleFacetParameters
> It says:
>    Warming
>    facet.field queries using the term enumeration method can avoid the
> evaluation of some terms for greater efficiency. To force the evaluation of
> all terms for warming, the base query should match a single document.
>
> I think this is OK in the newer version, because as of 1.4 the default is
> "fc", not "enum".  But prior to 1.4 there was no fc!
>
> Wiki info on the default (enum vs. fc)
>    http://wiki.apache.org/solr/SimpleFacetParameters
>
> facet.method
>    This parameter indicates what type of algorithm/method to use when
> faceting a field.
>
> enum
>    Enumerates all terms in a field, calculating the set intersection of
> documents that match the term with documents that match the query. This was
> the default (and only) method for faceting multi-valued fields prior to Solr
> 1.4.
>
> fc (stands for field cache)
>    The facet counts are calculated by iterating over documents that match
> the query and summing the terms that appear in each document. This was the
> default method for single valued fields prior to Solr 1.4.
>
> The default value is fc (except for BoolField) since it tends to use less
> memory and is faster when a field has many unique terms in the index.
>
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>
> On Mon, Aug 3, 2009 at 2:49 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:
>
>> Sounds like faceting?
>> q=state:CA&facet=true&facet.field=title&facet.limit=1000
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>> On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett<mb...@ideaeng.com> wrote:
>> > You can get a nice list of terms for a field using the Luke handler:
>> >    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
>> >
>> > But what I'd really like is to get the terms for the docs that match a
>> > particular slice of the index.
>> >
>> > For example, let's say I have records for all 50 states, but I want to
>> get
>> > the top 1,000 terms for documents in California.
>> >
>> > I'd like to add q or fq like this:
>> >
>> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
>> >        OR
>> >
>> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
>> >
>> > Although I don't get any errors, this syntax doesn't seem to filter the
>> > terms.  Not a bug, nobody ever said it would.
>> >
>> > But has anybody written a utility to get term instances for a subset of
>> the
>> > index, based on a query?  And to be clear, I was hoping to get all of the
>> > terms in matching documents, not just terms that are also present in the
>> > query.
>> >
>> > Thanks,
>> > Mark
>> >
>> > --
>> > Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
>> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>> >
>>
>

Re: Using Luke to get terms for docs matching a specific query filter?

Posted by Mark Bennett <mb...@ideaeng.com>.

Yonik, can you confirm reasoning below for 1.4 for a text field?

( Of course faceting is so much faster in 1.4 anyway, it's probably worth
the upgrade.
     https://issues.apache.org/jira/browse/SOLR-475  )

A warning for folks NOT using 1.4:

At the bottom of this wiki page: (very bottom)
    http://wiki.apache.org/solr/SimpleFacetParameters
It says:
    Warming
    facet.field queries using the term enumeration method can avoid the
evaluation of some terms for greater efficiency. To force the evaluation of
all terms for warming, the base query should match a single document.

I think this is OK in the newer version, because as of 1.4 the default is
"fc", not "enum".  But prior to 1.4 there was no fc!

Wiki info on the default (enum vs. fc)
    http://wiki.apache.org/solr/SimpleFacetParameters

facet.method
    This parameter indicates what type of algorithm/method to use when
faceting a field.

enum
    Enumerates all terms in a field, calculating the set intersection of
documents that match the term with documents that match the query. This was
the default (and only) method for faceting multi-valued fields prior to Solr
1.4.

fc (stands for field cache)
    The facet counts are calculated by iterating over documents that match
the query and summing the terms that appear in each document. This was the
default method for single valued fields prior to Solr 1.4.

The default value is fc (except for BoolField) since it tends to use less
memory and is faster when a field has many unique terms in the index.


--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Mon, Aug 3, 2009 at 2:49 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> Sounds like faceting?
> q=state:CA&facet=true&facet.field=title&facet.limit=1000
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett<mb...@ideaeng.com> wrote:
> > You can get a nice list of terms for a field using the Luke handler:
> >    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
> >
> > But what I'd really like is to get the terms for the docs that match a
> > particular slice of the index.
> >
> > For example, let's say I have records for all 50 states, but I want to
> get
> > the top 1,000 terms for documents in California.
> >
> > I'd like to add q or fq like this:
> >
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
> >        OR
> >
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
> >
> > Although I don't get any errors, this syntax doesn't seem to filter the
> > terms.  Not a bug, nobody ever said it would.
> >
> > But has anybody written a utility to get term instances for a subset of
> the
> > index, based on a query?  And to be clear, I was hoping to get all of the
> > terms in matching documents, not just terms that are also present in the
> > query.
> >
> > Thanks,
> > Mark
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >
>

Re: Using Luke to get terms for docs matching a specific query filter?

Posted by Mark Bennett <mb...@ideaeng.com>.

Ah!  Looks like that'll work.  Thanks Yonik!

For other folks listening in, he's suggesting not using Luke, and instead
reverting to a regular faceted query.

The full facet query URL would then be:

http://localhost:8983/solr/select?facet=true&facet.field=title&facet.limit=1000&q=state:CA
Vs. my attempted Luke URL of:
    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA

The output is still in XML, though the XPath to the terms is a bit
different.

The Facet XPath is something like:

/response/lst[@name='facet_count']/lst[@name='facet_fields']/lst[@name='title']/int/@name

The Luke XPath (terms for all docs) is something like:

/response/lst[@name='fields']/lst[@name='title']/lst[@name='topTerms']/int/@name

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Mon, Aug 3, 2009 at 2:49 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> Sounds like faceting?
> q=state:CA&facet=true&facet.field=title&facet.limit=1000
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett<mb...@ideaeng.com> wrote:
> > You can get a nice list of terms for a field using the Luke handler:
> >    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
> >
> > But what I'd really like is to get the terms for the docs that match a
> > particular slice of the index.
> >
> > For example, let's say I have records for all 50 states, but I want to
> get
> > the top 1,000 terms for documents in California.
> >
> > I'd like to add q or fq like this:
> >
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
> >        OR
> >
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
> >
> > Although I don't get any errors, this syntax doesn't seem to filter the
> > terms.  Not a bug, nobody ever said it would.
> >
> > But has anybody written a utility to get term instances for a subset of
> the
> > index, based on a query?  And to be clear, I was hoping to get all of the
> > terms in matching documents, not just terms that are also present in the
> > query.
> >
> > Thanks,
> > Mark
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >
>

Re: Using Luke to get terms for docs matching a specific query filter?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

Sounds like faceting?
q=state:CA&facet=true&facet.field=title&facet.limit=1000

-Yonik
http://www.lucidimagination.com


On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett<mb...@ideaeng.com> wrote:
> You can get a nice list of terms for a field using the Luke handler:
>    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
>
> But what I'd really like is to get the terms for the docs that match a
> particular slice of the index.
>
> For example, let's say I have records for all 50 states, but I want to get
> the top 1,000 terms for documents in California.
>
> I'd like to add q or fq like this:
>    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
>        OR
>    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
>
> Although I don't get any errors, this syntax doesn't seem to filter the
> terms.  Not a bug, nobody ever said it would.
>
> But has anybody written a utility to get term instances for a subset of the
> index, based on a query?  And to be clear, I was hoping to get all of the
> terms in matching documents, not just terms that are also present in the
> query.
>
> Thanks,
> Mark
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>