You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2007/05/16 04:09:37 UTC

Luke request handler issue

I've switched Flare to use the Luke request handler simply to  
retrieve the fields in the index.

In the case of a 3.7M document index, it takes a LONG time to execute  
because of the top terms its generating.  I tried setting numTerms=0  
and got an array index out of bounds exception.  Is there a trick I'm  
not seeing in getting just the list of fields back in the fastest  
possible way with this request handler?

If Ryan is date-less again tonight, I'm sure he'll have it all fixed  
up by the time I wake up :)  Otherwise, I'll dig in and roll up my  
sleeves sometime this week and make some adjustments to allow turning  
off the top terms feature.

Thanks,
	Erik

Re: Luke request handler issue

Posted by Ryan McKinley <ry...@gmail.com>.

Erik Hatcher wrote:
> 
> On May 22, 2007, at 10:42 PM, Ryan McKinley wrote:
>> If thats the case, I think the .diff you posted is fine...
> 
> Not really, because I commented out a bit to get past things.  It was 
> more than just setting the default to zero.
> 

the bit you commented calculated numTerms across all fields (forcing it 
to walk through all terms) since this is not all that useful and 
configuring it seems overkill, I don't mind throwing it out.

I'll take a look and make sure though.

>> The only thing I would change is I think the default should be some 
>> positive number.  For the app where you want the default to be 0, you 
>> can initialize the request handler with:
>>
>>   <requestHandler ... >
>>     <lst name="defaults">
>>      <int name="numTerms">0</int>
>>     </lst>
>>   </requestHandler>
> 
> I don't get why the default should be non-zero.  The most common use 
> case would be field/type/size introspection, I presume.  

I have been using it as a visual inspection of what it in the index. 
The default page that shows all information for all fields is good 
because (without figuring out what parameters do what) you can just see 
what is in the index...  for the indexes I have worked with (so far 
<300K docs) that has been fine.

Luke (the app) opens showing top terms across all fields - then you 
click on individual fields to see the top terms for that field.

I would like the default (no params / no config) be the most useful to 
people who are just starting with lucene/solr and want to know what all 
this talk about "terms" is.

programmatic uses can easily send "numTerms=0" in the request or 
configure it in the defaults.

>I don't see 
> getting top terms being as needed.  But, I'm fine with the default being 
> non-zero if others feel it should be - setting it in the config file is 
> no big deal for me :)
> 
>     Erik
> 
>

Re: Luke request handler issue

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On May 22, 2007, at 10:42 PM, Ryan McKinley wrote:
> If thats the case, I think the .diff you posted is fine...

Not really, because I commented out a bit to get past things.  It was  
more than just setting the default to zero.

> The only thing I would change is I think the default should be some  
> positive number.  For the app where you want the default to be 0,  
> you can initialize the request handler with:
>
>   <requestHandler ... >
>     <lst name="defaults">
>      <int name="numTerms">0</int>
>     </lst>
>   </requestHandler>

I don't get why the default should be non-zero.  The most common use  
case would be field/type/size introspection, I presume.  I don't see  
getting top terms being as needed.  But, I'm fine with the default  
being non-zero if others feel it should be - setting it in the config  
file is no big deal for me :)

	Erik

Re: Luke request handler issue

Posted by Ryan McKinley <ry...@gmail.com>.

If thats the case, I think the .diff you posted is fine...

The only thing I would change is I think the default should be some 
positive number.  For the app where you want the default to be 0, you 
can initialize the request handler with:

   <requestHandler ... >
     <lst name="defaults">
      <int name="numTerms">0</int>
     </lst>
   </requestHandler>


Erik Hatcher wrote:
> Ryan - just so you know where my current need with this is, it's only in 
> getting field names and types, as well as total documents back.  The top 
> terms aren't a need for my projects.  So I don't really have any 
> preference on the specifics, other than needing to be able to turn that 
> feature off :)
> 
>     Erik
> 
> 
> On May 22, 2007, at 9:56 PM, Ryan McKinley wrote:
> 
>> Ryan McKinley wrote:
>>> Yonik Seeley wrote:
>>>> The whole topTerms thing is exactly the same concept as faceting
>>>> with *:* as a base (with perhaps the exception of ignoring deleted
>>>> docs by using df?)
>>>> Should these parameters be aligned somehow?
>>>>
>>> Using the faceting implementation would be good too... since you 
>>> would get the all the caching etc.
>>> maybe it can directly use faceting parameters (and implementation) 
>>> for "topTerms" -- if nothing is specified for "facet.field", it will 
>>> add all fields (alternatively, normal faceting could support *, but 
>>> that seems like a bad idea in the general case)
>>> I'll take a look at that and see how it feels...
>>
>> There are a few show stoppers with that idea.... most notable the 
>> faceting implementation needs a solr field.  Much of the motivation 
>> for the LukeRequestHandler is to inspect an index regardless of what 
>> solr thinks about it.
>>
>> - - - -
>>
>> How do you imagine the parameters would be aligned?
>>
>> It could use the same per/field specification:
>>  f.category.facet.limit=5
>>
>> perhaps it Luke should support:
>>  terms.top=10
>>   and
>>  f.category.terms.top=10
>>
>> I'm reluctant to go this route because it makes asking if any we 
>> should calculate top terms or not difficut (ok, akward) and i'm not 
>> sure it helps that much...
>>
>> I'll make a JIRA issue with a simple implementation you all can poke at.
>>
>> ryan
> 
>

Re: Luke request handler issue

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Ryan - just so you know where my current need with this is, it's only  
in getting field names and types, as well as total documents back.   
The top terms aren't a need for my projects.  So I don't really have  
any preference on the specifics, other than needing to be able to  
turn that feature off :)

	Erik


On May 22, 2007, at 9:56 PM, Ryan McKinley wrote:

> Ryan McKinley wrote:
>> Yonik Seeley wrote:
>>> The whole topTerms thing is exactly the same concept as faceting
>>> with *:* as a base (with perhaps the exception of ignoring deleted
>>> docs by using df?)
>>> Should these parameters be aligned somehow?
>>>
>> Using the faceting implementation would be good too... since you  
>> would get the all the caching etc.
>> maybe it can directly use faceting parameters (and implementation)  
>> for "topTerms" -- if nothing is specified for "facet.field", it  
>> will add all fields (alternatively, normal faceting could support  
>> *, but that seems like a bad idea in the general case)
>> I'll take a look at that and see how it feels...
>
> There are a few show stoppers with that idea.... most notable the  
> faceting implementation needs a solr field.  Much of the motivation  
> for the LukeRequestHandler is to inspect an index regardless of  
> what solr thinks about it.
>
> - - - -
>
> How do you imagine the parameters would be aligned?
>
> It could use the same per/field specification:
>  f.category.facet.limit=5
>
> perhaps it Luke should support:
>  terms.top=10
>   and
>  f.category.terms.top=10
>
> I'm reluctant to go this route because it makes asking if any we  
> should calculate top terms or not difficut (ok, akward) and i'm not  
> sure it helps that much...
>
> I'll make a JIRA issue with a simple implementation you all can  
> poke at.
>
> ryan

Re: Luke request handler issue

Posted by Ryan McKinley <ry...@gmail.com>.

Yonik Seeley wrote:
> On 5/23/07, Ryan McKinley <ry...@gmail.com> wrote:
>> > If someone wants to retrieve *all* of the terms in a specific field,
>> > it doesn't seem like they should have to get all of the terms in all
>> > other fields too, right?
>> >
>>
>> As implemented, you get the top terms for all the fields you ask for.
>> By default this is all of them.  If you specify a field (with fl=xxx)
>> you only get that field's top terms:
>>   http://localhost:8983/solr/admin/luke?fl=text&numTerms=1000
>>
>> It may be useful to want 10 terms from field 'A' and 100 for field 'B',
>> but for now, that should probably be done with faceting.
>>
>> Faceting returns readable values (from the schema) while Luke deals with
>> the raw lucene index.
> 
> Ah, yes... I see both as being useful.
> If solr does know about the fieldType, should the default be to use
> the external (human readable) values?
> 

Thats how it currently works:

   NamedList<Integer> list = new NamedList<Integer>();
   for (TermInfo i : aslist) {
     String txt = i.term.text();
     SchemaField ft = schema.getFieldOrNull( i.term.field() );
     if( ft != null ) {
       txt = ft.getType().indexedToReadable( txt );
     }
     list.add( txt, i.docFreq );
   }
   return list;

When you inspect a single document, it returns both.

ryan

Re: Luke request handler issue

Posted by Yonik Seeley <yo...@apache.org>.

On 5/23/07, Ryan McKinley <ry...@gmail.com> wrote:
> > If someone wants to retrieve *all* of the terms in a specific field,
> > it doesn't seem like they should have to get all of the terms in all
> > other fields too, right?
> >
>
> As implemented, you get the top terms for all the fields you ask for.
> By default this is all of them.  If you specify a field (with fl=xxx)
> you only get that field's top terms:
>   http://localhost:8983/solr/admin/luke?fl=text&numTerms=1000
>
> It may be useful to want 10 terms from field 'A' and 100 for field 'B',
> but for now, that should probably be done with faceting.
>
> Faceting returns readable values (from the schema) while Luke deals with
> the raw lucene index.

Ah, yes... I see both as being useful.
If solr does know about the fieldType, should the default be to use
the external (human readable) values?

-Yonik

Re: Luke request handler issue

Posted by Ryan McKinley <ry...@gmail.com>.

> 
> If someone wants to retrieve *all* of the terms in a specific field,
> it doesn't seem like they should have to get all of the terms in all
> other fields too, right?
> 

As implemented, you get the top terms for all the fields you ask for. 
By default this is all of them.  If you specify a field (with fl=xxx) 
you only get that field's top terms:
  http://localhost:8983/solr/admin/luke?fl=text&numTerms=1000

It may be useful to want 10 terms from field 'A' and 100 for field 'B', 
but for now, that should probably be done with faceting.

Faceting returns readable values (from the schema) while Luke deals with 
the raw lucene index.

> All this configurability doesn't need to be implemented now, but we
> should plan for it and leave room in the interface if possible.
> 

that sounds good.  For now, making numTerms=0 not walk through should be 
enough.  The rest should come as we see a specific need for it.

Re: Luke request handler issue

Posted by Yonik Seeley <yo...@apache.org>.

On 5/22/07, Ryan McKinley <ry...@gmail.com> wrote:
> How do you imagine the parameters would be aligned?

It just seemed like they were doing largely the same thing...
specify if you want terms enumerated in order, or sorted,
specify the number of top terms, etc.

> It could use the same per/field specification:
>   f.category.facet.limit=5
>
> perhaps it Luke should support:
>   terms.top=10
>    and
>   f.category.terms.top=10
>
> I'm reluctant to go this route because it makes asking if any we should
> calculate top terms or not difficut (ok, akward) and i'm not sure it
> helps that much...

Then one could have topTerms=true like highlighting/faceting do, or
one could perhaps specify a field list
  topTerms=fooField,barField
or
  topTerms=*

If someone wants to retrieve *all* of the terms in a specific field,
it doesn't seem like they should have to get all of the terms in all
other fields too, right?

All this configurability doesn't need to be implemented now, but we
should plan for it and leave room in the interface if possible.

-Yonik

Re: Luke request handler issue

Posted by Ryan McKinley <ry...@gmail.com>.

Ryan McKinley wrote:
> Yonik Seeley wrote:
>> The whole topTerms thing is exactly the same concept as faceting
>> with *:* as a base (with perhaps the exception of ignoring deleted
>> docs by using df?)
>> Should these parameters be aligned somehow?
>>
> 
> Using the faceting implementation would be good too... since you would 
> get the all the caching etc.
> 
> maybe it can directly use faceting parameters (and implementation) for 
> "topTerms" -- if nothing is specified for "facet.field", it will add all 
> fields (alternatively, normal faceting could support *, but that seems 
> like a bad idea in the general case)
> 
> I'll take a look at that and see how it feels...
> 

There are a few show stoppers with that idea.... most notable the 
faceting implementation needs a solr field.  Much of the motivation for 
the LukeRequestHandler is to inspect an index regardless of what solr 
thinks about it.

- - - -

How do you imagine the parameters would be aligned?

It could use the same per/field specification:
  f.category.facet.limit=5

perhaps it Luke should support:
  terms.top=10
   and
  f.category.terms.top=10

I'm reluctant to go this route because it makes asking if any we should 
calculate top terms or not difficut (ok, akward) and i'm not sure it 
helps that much...

I'll make a JIRA issue with a simple implementation you all can poke at.

ryan

Re: Luke request handler issue

Posted by Ryan McKinley <ry...@gmail.com>.

Yonik Seeley wrote:
> The whole topTerms thing is exactly the same concept as faceting
> with *:* as a base (with perhaps the exception of ignoring deleted
> docs by using df?)
> Should these parameters be aligned somehow?
> 

Using the faceting implementation would be good too... since you would 
get the all the caching etc.

maybe it can directly use faceting parameters (and implementation) for 
"topTerms" -- if nothing is specified for "facet.field", it will add all 
fields (alternatively, normal faceting could support *, but that seems 
like a bad idea in the general case)

I'll take a look at that and see how it feels...

Re: Luke request handler issue

Posted by Yonik Seeley <yo...@apache.org>.

The whole topTerms thing is exactly the same concept as faceting
with *:* as a base (with perhaps the exception of ignoring deleted
docs by using df?)
Should these parameters be aligned somehow?

-Yonik

Re: Luke request handler issue

Posted by Ryan McKinley <ry...@gmail.com>.

Yonik Seeley wrote:
> On 5/22/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>> I think all the features of the Luke request handler should be made
>> optional, except for just feeding back the fields and types in the
>> index which seems a reasonable always-on feature.
> 
> +1
> 
> The feature list of this handler could grow, and I think having
> explicit ways to turn them on is the right way to go (like
> standard/dismax request handlers have for faceting, highlighting, etc)
> 
> -Yonik
> 

Sounds good...  There are two reasons you would turn on/off options: 
speed and display.

Since many of the features are calculated together, we need a parameter 
interface that makes sense.  For example, turning off "TopTerms" also 
turns off "distinct terms" and the histogram...  If you want the 
histogram, but not the the TopTerms, it takes the same time - and even 
calculates the TopTerms - but does not return them

A quick look of features/hierarchy you may want to turn on/off:

* indexinfo (numDocs, maxDocs, etc)
* terms (has to walk through reader.terms())
   * distinct (simple ++)
   * top (PriorityQueue)
   * histogram (Hash)
* luceneFieldInfo (has too call "fieldName:[* TO *]")
  * flags
  * numDocs
* key

One option would be to enumerate these features and have a boolean param 
for each thing in the hierarchy.  If the parent is set, but none of the 
children, it assumes true for all of them.

For example: (assuming the default for everything is false)

/luke?return.indexinfo=on&return.terms=true
would return:
  indexinifo,
  terms.distinct,
  terms.top,
  terms.histogram

/luke?return.terms.histogram=true
would return only return:
  terms.histogram
(but not top terms would still be computed)

Perhaps there should also be a:
  return.default=true/false

that decides what the default for *all* of the features is.

Thoughts?  Is this what you had in mind?  Is there a better parameter 
name to consider then "return.terms=false"?

ryan

Re: Luke request handler issue

Posted by Yonik Seeley <yo...@apache.org>.

On 5/22/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> I think all the features of the Luke request handler should be made
> optional, except for just feeding back the fields and types in the
> index which seems a reasonable always-on feature.

+1

The feature list of this handler could grow, and I think having
explicit ways to turn them on is the right way to go (like
standard/dismax request handlers have for faceting, highlighting, etc)

-Yonik

Re: Luke request handler issue

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On May 21, 2007, at 11:00 PM, Ryan McKinley wrote:
>> If Ryan is date-less again tonight, I'm sure he'll have it all  
>> fixed up by the time I wake up :)  Otherwise, I'll dig in and roll  
>> up my sleeves sometime this week and make some adjustments to  
>> allow turning off the top terms feature.
>
> Just back from a week of nothing that goes beep... it was great!

Nice!

> Perhaps numTerms=0 would skip all term collecting...  also, it may  
> be worth caching the top terms, but i don't really want to go there  
> just yet.

I think all the features of the Luke request handler should be made  
optional, except for just feeding back the fields and types in the  
index which seems a reasonable always-on feature.

> Did you get a chance to look at the code?  or should i add the  
> numTerms=0 behavior?

I hacked my working copy to turn off all the term counting stuff for  
demo purposes (diff posted below).   We need to more robustly  
overhaul it though.

	Erik

$ svn diff
Index: java/org/apache/solr/handler/admin/LukeRequestHandler.java
===================================================================
--- java/org/apache/solr/handler/admin/LukeRequestHandler.java   
(revision 538798)
+++ java/org/apache/solr/handler/admin/LukeRequestHandler.java   
(working copy)
@@ -81,7 +81,7 @@
    public static final String NUMTERMS = "numTerms";
    public static final String DOC_ID = "docId";
    public static final String ID = "id";
-  public static final int DEFAULT_COUNT = 10;
+  public static final int DEFAULT_COUNT = 0;

    @Override
    public void handleRequestBody(SolrQueryRequest req,  
SolrQueryResponse rsp) throws Exception
@@ -272,8 +272,11 @@
      IndexReader reader = searcher.getReader();
      IndexSchema schema = searcher.getSchema();

-    // Walk the term enum and keep a priority quey for each map in  
our set
-    Map<String,TopTermQueue> ttinfo = getTopTerms(reader, fields,  
numTerms, null );
+    Map<String,TopTermQueue> ttinfo = null;
+    if (numTerms > 0) {
+      // Walk the term enum and keep a priority quey for each map in  
our set
+      ttinfo = getTopTerms(reader, fields, numTerms, null );
+    }
      SimpleOrderedMap<Object> finfo = new SimpleOrderedMap<Object>();
      Collection<String> fieldNames = reader.getFieldNames 
(IndexReader.FieldOption.ALL);
      for (String fieldName : fieldNames) {
@@ -288,8 +291,9 @@
        f.add( "type", (ftype==null)?null:ftype.getTypeName() );
        f.add( "schema", getFieldFlags( sfield ) );
-
-      Query q = qp.parse( fieldName+":[* TO *]" );
+
+       // TODO: this could use a constant scoring range query  
instead of parsing
+/*      Query q = qp.parse( fieldName+":[* TO *]" );
        int docCount = searcher.numDocs( q, matchAllDocs );
        if( docCount > 0 ) {
          // Find a document with this field
@@ -311,16 +315,19 @@
          // Find one document so we can get the fieldable
        }
        f.add( "docs", docCount );
-
-      TopTermQueue topTerms = ttinfo.get( fieldName );
-      if( topTerms != null ) {
-        f.add( "distinct", topTerms.distinctTerms );
-
-        // Include top terms
-        f.add( "topTerms", topTerms.toNamedList( searcher.getSchema 
() ) );
+*/
-        // Add a histogram
-        f.add( "histogram", topTerms.histogram.toNamedList() );
+      if (ttinfo != null) {
+        TopTermQueue topTerms = ttinfo.get( fieldName );
+        if( topTerms != null ) {
+          f.add( "distinct", topTerms.distinctTerms );
+
+          // Include top terms
+          f.add( "topTerms", topTerms.toNamedList( searcher.getSchema 
() ) );
+
+          // Add a histogram
+          f.add( "histogram", topTerms.histogram.toNamedList() );
+        }
        }

        // Add the field
@@ -333,12 +340,12 @@
    private static SimpleOrderedMap<Object> getIndexInfo( IndexReader  
reader ) throws IOException
    {
      // Count the terms
-    TermEnum te = reader.terms();
+//    TermEnum te = reader.terms();
      int numTerms = 0;
-    while (te.next()) {
-      numTerms++;
-    }
-
+//    while (te.next()) {
+//      numTerms++;
+//    }
+
      Directory dir = reader.directory();
      SimpleOrderedMap<Object> indexInfo = new  
SimpleOrderedMap<Object>();
      indexInfo.add("numDocs", reader.numDocs());

Re: Luke request handler issue

Posted by Ryan McKinley <ry...@gmail.com>.

> 
> If Ryan is date-less again tonight, I'm sure he'll have it all fixed up 
> by the time I wake up :)  Otherwise, I'll dig in and roll up my sleeves 
> sometime this week and make some adjustments to allow turning off the 
> top terms feature.
> 

Just back from a week of nothing that goes beep... it was great!  Now my 
girlfriend has to study for the bar, so i will have more time ;)

Perhaps numTerms=0 would skip all term collecting...  also, it may be 
worth caching the top terms, but i don't really want to go there just yet.

Yonik suggested numTerms should be specified per field.  This is easy 
enough...

Did you get a chance to look at the code?  or should i add the 
numTerms=0 behavior?

ryan

Re: Luke request handler issue

Posted by Yonik Seeley <yo...@apache.org>.

On 5/15/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> I've switched Flare to use the Luke request handler simply to
> retrieve the fields in the index.
>
> In the case of a 3.7M document index, it takes a LONG time to execute
> because of the top terms its generating.  I tried setting numTerms=0
> and got an array index out of bounds exception.

I never had chance to check out the luke handler, but since it's still
experimental, I think all of the "parts" should be optional.

Something that takes as long as generating top terms should also be
able to be specified per-field IMO.

-Yonik