You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2007/05/16 04:09:37 UTC
Luke request handler issue
I've switched Flare to use the Luke request handler simply to
retrieve the fields in the index.
In the case of a 3.7M document index, it takes a LONG time to execute
because of the top terms its generating. I tried setting numTerms=0
and got an array index out of bounds exception. Is there a trick I'm
not seeing in getting just the list of fields back in the fastest
possible way with this request handler?
If Ryan is date-less again tonight, I'm sure he'll have it all fixed
up by the time I wake up :) Otherwise, I'll dig in and roll up my
sleeves sometime this week and make some adjustments to allow turning
off the top terms feature.
Thanks,
Erik
Re: Luke request handler issue
Posted by Ryan McKinley <ry...@gmail.com>.
Erik Hatcher wrote:
>
> On May 22, 2007, at 10:42 PM, Ryan McKinley wrote:
>> If thats the case, I think the .diff you posted is fine...
>
> Not really, because I commented out a bit to get past things. It was
> more than just setting the default to zero.
>
the bit you commented calculated numTerms across all fields (forcing it
to walk through all terms) since this is not all that useful and
configuring it seems overkill, I don't mind throwing it out.
I'll take a look and make sure though.
>> The only thing I would change is I think the default should be some
>> positive number. For the app where you want the default to be 0, you
>> can initialize the request handler with:
>>
>> <requestHandler ... >
>> <lst name="defaults">
>> <int name="numTerms">0</int>
>> </lst>
>> </requestHandler>
>
> I don't get why the default should be non-zero. The most common use
> case would be field/type/size introspection, I presume.
I have been using it as a visual inspection of what it in the index.
The default page that shows all information for all fields is good
because (without figuring out what parameters do what) you can just see
what is in the index... for the indexes I have worked with (so far
<300K docs) that has been fine.
Luke (the app) opens showing top terms across all fields - then you
click on individual fields to see the top terms for that field.
I would like the default (no params / no config) be the most useful to
people who are just starting with lucene/solr and want to know what all
this talk about "terms" is.
programmatic uses can easily send "numTerms=0" in the request or
configure it in the defaults.
>I don't see
> getting top terms being as needed. But, I'm fine with the default being
> non-zero if others feel it should be - setting it in the config file is
> no big deal for me :)
>
> Erik
>
>
Re: Luke request handler issue
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 22, 2007, at 10:42 PM, Ryan McKinley wrote:
> If thats the case, I think the .diff you posted is fine...
Not really, because I commented out a bit to get past things. It was
more than just setting the default to zero.
> The only thing I would change is I think the default should be some
> positive number. For the app where you want the default to be 0,
> you can initialize the request handler with:
>
> <requestHandler ... >
> <lst name="defaults">
> <int name="numTerms">0</int>
> </lst>
> </requestHandler>
I don't get why the default should be non-zero. The most common use
case would be field/type/size introspection, I presume. I don't see
getting top terms being as needed. But, I'm fine with the default
being non-zero if others feel it should be - setting it in the config
file is no big deal for me :)
Erik
Re: Luke request handler issue
Posted by Ryan McKinley <ry...@gmail.com>.
If thats the case, I think the .diff you posted is fine...
The only thing I would change is I think the default should be some
positive number. For the app where you want the default to be 0, you
can initialize the request handler with:
<requestHandler ... >
<lst name="defaults">
<int name="numTerms">0</int>
</lst>
</requestHandler>
Erik Hatcher wrote:
> Ryan - just so you know where my current need with this is, it's only in
> getting field names and types, as well as total documents back. The top
> terms aren't a need for my projects. So I don't really have any
> preference on the specifics, other than needing to be able to turn that
> feature off :)
>
> Erik
>
>
> On May 22, 2007, at 9:56 PM, Ryan McKinley wrote:
>
>> Ryan McKinley wrote:
>>> Yonik Seeley wrote:
>>>> The whole topTerms thing is exactly the same concept as faceting
>>>> with *:* as a base (with perhaps the exception of ignoring deleted
>>>> docs by using df?)
>>>> Should these parameters be aligned somehow?
>>>>
>>> Using the faceting implementation would be good too... since you
>>> would get the all the caching etc.
>>> maybe it can directly use faceting parameters (and implementation)
>>> for "topTerms" -- if nothing is specified for "facet.field", it will
>>> add all fields (alternatively, normal faceting could support *, but
>>> that seems like a bad idea in the general case)
>>> I'll take a look at that and see how it feels...
>>
>> There are a few show stoppers with that idea.... most notable the
>> faceting implementation needs a solr field. Much of the motivation
>> for the LukeRequestHandler is to inspect an index regardless of what
>> solr thinks about it.
>>
>> - - - -
>>
>> How do you imagine the parameters would be aligned?
>>
>> It could use the same per/field specification:
>> f.category.facet.limit=5
>>
>> perhaps it Luke should support:
>> terms.top=10
>> and
>> f.category.terms.top=10
>>
>> I'm reluctant to go this route because it makes asking if any we
>> should calculate top terms or not difficut (ok, akward) and i'm not
>> sure it helps that much...
>>
>> I'll make a JIRA issue with a simple implementation you all can poke at.
>>
>> ryan
>
>
Re: Luke request handler issue
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Ryan - just so you know where my current need with this is, it's only
in getting field names and types, as well as total documents back.
The top terms aren't a need for my projects. So I don't really have
any preference on the specifics, other than needing to be able to
turn that feature off :)
Erik
On May 22, 2007, at 9:56 PM, Ryan McKinley wrote:
> Ryan McKinley wrote:
>> Yonik Seeley wrote:
>>> The whole topTerms thing is exactly the same concept as faceting
>>> with *:* as a base (with perhaps the exception of ignoring deleted
>>> docs by using df?)
>>> Should these parameters be aligned somehow?
>>>
>> Using the faceting implementation would be good too... since you
>> would get the all the caching etc.
>> maybe it can directly use faceting parameters (and implementation)
>> for "topTerms" -- if nothing is specified for "facet.field", it
>> will add all fields (alternatively, normal faceting could support
>> *, but that seems like a bad idea in the general case)
>> I'll take a look at that and see how it feels...
>
> There are a few show stoppers with that idea.... most notable the
> faceting implementation needs a solr field. Much of the motivation
> for the LukeRequestHandler is to inspect an index regardless of
> what solr thinks about it.
>
> - - - -
>
> How do you imagine the parameters would be aligned?
>
> It could use the same per/field specification:
> f.category.facet.limit=5
>
> perhaps it Luke should support:
> terms.top=10
> and
> f.category.terms.top=10
>
> I'm reluctant to go this route because it makes asking if any we
> should calculate top terms or not difficut (ok, akward) and i'm not
> sure it helps that much...
>
> I'll make a JIRA issue with a simple implementation you all can
> poke at.
>
> ryan
Re: Luke request handler issue
Posted by Ryan McKinley <ry...@gmail.com>.
Yonik Seeley wrote:
> On 5/23/07, Ryan McKinley <ry...@gmail.com> wrote:
>> > If someone wants to retrieve *all* of the terms in a specific field,
>> > it doesn't seem like they should have to get all of the terms in all
>> > other fields too, right?
>> >
>>
>> As implemented, you get the top terms for all the fields you ask for.
>> By default this is all of them. If you specify a field (with fl=xxx)
>> you only get that field's top terms:
>> http://localhost:8983/solr/admin/luke?fl=text&numTerms=1000
>>
>> It may be useful to want 10 terms from field 'A' and 100 for field 'B',
>> but for now, that should probably be done with faceting.
>>
>> Faceting returns readable values (from the schema) while Luke deals with
>> the raw lucene index.
>
> Ah, yes... I see both as being useful.
> If solr does know about the fieldType, should the default be to use
> the external (human readable) values?
>
Thats how it currently works:
NamedList<Integer> list = new NamedList<Integer>();
for (TermInfo i : aslist) {
String txt = i.term.text();
SchemaField ft = schema.getFieldOrNull( i.term.field() );
if( ft != null ) {
txt = ft.getType().indexedToReadable( txt );
}
list.add( txt, i.docFreq );
}
return list;
When you inspect a single document, it returns both.
ryan
Re: Luke request handler issue
Posted by Yonik Seeley <yo...@apache.org>.
On 5/23/07, Ryan McKinley <ry...@gmail.com> wrote:
> > If someone wants to retrieve *all* of the terms in a specific field,
> > it doesn't seem like they should have to get all of the terms in all
> > other fields too, right?
> >
>
> As implemented, you get the top terms for all the fields you ask for.
> By default this is all of them. If you specify a field (with fl=xxx)
> you only get that field's top terms:
> http://localhost:8983/solr/admin/luke?fl=text&numTerms=1000
>
> It may be useful to want 10 terms from field 'A' and 100 for field 'B',
> but for now, that should probably be done with faceting.
>
> Faceting returns readable values (from the schema) while Luke deals with
> the raw lucene index.
Ah, yes... I see both as being useful.
If solr does know about the fieldType, should the default be to use
the external (human readable) values?
-Yonik
Re: Luke request handler issue
Posted by Ryan McKinley <ry...@gmail.com>.
>
> If someone wants to retrieve *all* of the terms in a specific field,
> it doesn't seem like they should have to get all of the terms in all
> other fields too, right?
>
As implemented, you get the top terms for all the fields you ask for.
By default this is all of them. If you specify a field (with fl=xxx)
you only get that field's top terms:
http://localhost:8983/solr/admin/luke?fl=text&numTerms=1000
It may be useful to want 10 terms from field 'A' and 100 for field 'B',
but for now, that should probably be done with faceting.
Faceting returns readable values (from the schema) while Luke deals with
the raw lucene index.
> All this configurability doesn't need to be implemented now, but we
> should plan for it and leave room in the interface if possible.
>
that sounds good. For now, making numTerms=0 not walk through should be
enough. The rest should come as we see a specific need for it.
Re: Luke request handler issue
Posted by Yonik Seeley <yo...@apache.org>.
On 5/22/07, Ryan McKinley <ry...@gmail.com> wrote:
> How do you imagine the parameters would be aligned?
It just seemed like they were doing largely the same thing...
specify if you want terms enumerated in order, or sorted,
specify the number of top terms, etc.
> It could use the same per/field specification:
> f.category.facet.limit=5
>
> perhaps it Luke should support:
> terms.top=10
> and
> f.category.terms.top=10
>
> I'm reluctant to go this route because it makes asking if any we should
> calculate top terms or not difficut (ok, akward) and i'm not sure it
> helps that much...
Then one could have topTerms=true like highlighting/faceting do, or
one could perhaps specify a field list
topTerms=fooField,barField
or
topTerms=*
If someone wants to retrieve *all* of the terms in a specific field,
it doesn't seem like they should have to get all of the terms in all
other fields too, right?
All this configurability doesn't need to be implemented now, but we
should plan for it and leave room in the interface if possible.
-Yonik
Re: Luke request handler issue
Posted by Ryan McKinley <ry...@gmail.com>.
Ryan McKinley wrote:
> Yonik Seeley wrote:
>> The whole topTerms thing is exactly the same concept as faceting
>> with *:* as a base (with perhaps the exception of ignoring deleted
>> docs by using df?)
>> Should these parameters be aligned somehow?
>>
>
> Using the faceting implementation would be good too... since you would
> get the all the caching etc.
>
> maybe it can directly use faceting parameters (and implementation) for
> "topTerms" -- if nothing is specified for "facet.field", it will add all
> fields (alternatively, normal faceting could support *, but that seems
> like a bad idea in the general case)
>
> I'll take a look at that and see how it feels...
>
There are a few show stoppers with that idea.... most notable the
faceting implementation needs a solr field. Much of the motivation for
the LukeRequestHandler is to inspect an index regardless of what solr
thinks about it.
- - - -
How do you imagine the parameters would be aligned?
It could use the same per/field specification:
f.category.facet.limit=5
perhaps it Luke should support:
terms.top=10
and
f.category.terms.top=10
I'm reluctant to go this route because it makes asking if any we should
calculate top terms or not difficut (ok, akward) and i'm not sure it
helps that much...
I'll make a JIRA issue with a simple implementation you all can poke at.
ryan
Re: Luke request handler issue
Posted by Ryan McKinley <ry...@gmail.com>.
Yonik Seeley wrote:
> The whole topTerms thing is exactly the same concept as faceting
> with *:* as a base (with perhaps the exception of ignoring deleted
> docs by using df?)
> Should these parameters be aligned somehow?
>
Using the faceting implementation would be good too... since you would
get the all the caching etc.
maybe it can directly use faceting parameters (and implementation) for
"topTerms" -- if nothing is specified for "facet.field", it will add all
fields (alternatively, normal faceting could support *, but that seems
like a bad idea in the general case)
I'll take a look at that and see how it feels...
Re: Luke request handler issue
Posted by Yonik Seeley <yo...@apache.org>.
The whole topTerms thing is exactly the same concept as faceting
with *:* as a base (with perhaps the exception of ignoring deleted
docs by using df?)
Should these parameters be aligned somehow?
-Yonik
Re: Luke request handler issue
Posted by Ryan McKinley <ry...@gmail.com>.
Yonik Seeley wrote:
> On 5/22/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>> I think all the features of the Luke request handler should be made
>> optional, except for just feeding back the fields and types in the
>> index which seems a reasonable always-on feature.
>
> +1
>
> The feature list of this handler could grow, and I think having
> explicit ways to turn them on is the right way to go (like
> standard/dismax request handlers have for faceting, highlighting, etc)
>
> -Yonik
>
Sounds good... There are two reasons you would turn on/off options:
speed and display.
Since many of the features are calculated together, we need a parameter
interface that makes sense. For example, turning off "TopTerms" also
turns off "distinct terms" and the histogram... If you want the
histogram, but not the the TopTerms, it takes the same time - and even
calculates the TopTerms - but does not return them
A quick look of features/hierarchy you may want to turn on/off:
* indexinfo (numDocs, maxDocs, etc)
* terms (has to walk through reader.terms())
* distinct (simple ++)
* top (PriorityQueue)
* histogram (Hash)
* luceneFieldInfo (has too call "fieldName:[* TO *]")
* flags
* numDocs
* key
One option would be to enumerate these features and have a boolean param
for each thing in the hierarchy. If the parent is set, but none of the
children, it assumes true for all of them.
For example: (assuming the default for everything is false)
/luke?return.indexinfo=on&return.terms=true
would return:
indexinifo,
terms.distinct,
terms.top,
terms.histogram
/luke?return.terms.histogram=true
would return only return:
terms.histogram
(but not top terms would still be computed)
Perhaps there should also be a:
return.default=true/false
that decides what the default for *all* of the features is.
Thoughts? Is this what you had in mind? Is there a better parameter
name to consider then "return.terms=false"?
ryan
Re: Luke request handler issue
Posted by Yonik Seeley <yo...@apache.org>.
On 5/22/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> I think all the features of the Luke request handler should be made
> optional, except for just feeding back the fields and types in the
> index which seems a reasonable always-on feature.
+1
The feature list of this handler could grow, and I think having
explicit ways to turn them on is the right way to go (like
standard/dismax request handlers have for faceting, highlighting, etc)
-Yonik
Re: Luke request handler issue
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 21, 2007, at 11:00 PM, Ryan McKinley wrote:
>> If Ryan is date-less again tonight, I'm sure he'll have it all
>> fixed up by the time I wake up :) Otherwise, I'll dig in and roll
>> up my sleeves sometime this week and make some adjustments to
>> allow turning off the top terms feature.
>
> Just back from a week of nothing that goes beep... it was great!
Nice!
> Perhaps numTerms=0 would skip all term collecting... also, it may
> be worth caching the top terms, but i don't really want to go there
> just yet.
I think all the features of the Luke request handler should be made
optional, except for just feeding back the fields and types in the
index which seems a reasonable always-on feature.
> Did you get a chance to look at the code? or should i add the
> numTerms=0 behavior?
I hacked my working copy to turn off all the term counting stuff for
demo purposes (diff posted below). We need to more robustly
overhaul it though.
Erik
$ svn diff
Index: java/org/apache/solr/handler/admin/LukeRequestHandler.java
===================================================================
--- java/org/apache/solr/handler/admin/LukeRequestHandler.java
(revision 538798)
+++ java/org/apache/solr/handler/admin/LukeRequestHandler.java
(working copy)
@@ -81,7 +81,7 @@
public static final String NUMTERMS = "numTerms";
public static final String DOC_ID = "docId";
public static final String ID = "id";
- public static final int DEFAULT_COUNT = 10;
+ public static final int DEFAULT_COUNT = 0;
@Override
public void handleRequestBody(SolrQueryRequest req,
SolrQueryResponse rsp) throws Exception
@@ -272,8 +272,11 @@
IndexReader reader = searcher.getReader();
IndexSchema schema = searcher.getSchema();
- // Walk the term enum and keep a priority quey for each map in
our set
- Map<String,TopTermQueue> ttinfo = getTopTerms(reader, fields,
numTerms, null );
+ Map<String,TopTermQueue> ttinfo = null;
+ if (numTerms > 0) {
+ // Walk the term enum and keep a priority quey for each map in
our set
+ ttinfo = getTopTerms(reader, fields, numTerms, null );
+ }
SimpleOrderedMap<Object> finfo = new SimpleOrderedMap<Object>();
Collection<String> fieldNames = reader.getFieldNames
(IndexReader.FieldOption.ALL);
for (String fieldName : fieldNames) {
@@ -288,8 +291,9 @@
f.add( "type", (ftype==null)?null:ftype.getTypeName() );
f.add( "schema", getFieldFlags( sfield ) );
-
- Query q = qp.parse( fieldName+":[* TO *]" );
+
+ // TODO: this could use a constant scoring range query
instead of parsing
+/* Query q = qp.parse( fieldName+":[* TO *]" );
int docCount = searcher.numDocs( q, matchAllDocs );
if( docCount > 0 ) {
// Find a document with this field
@@ -311,16 +315,19 @@
// Find one document so we can get the fieldable
}
f.add( "docs", docCount );
-
- TopTermQueue topTerms = ttinfo.get( fieldName );
- if( topTerms != null ) {
- f.add( "distinct", topTerms.distinctTerms );
-
- // Include top terms
- f.add( "topTerms", topTerms.toNamedList( searcher.getSchema
() ) );
+*/
- // Add a histogram
- f.add( "histogram", topTerms.histogram.toNamedList() );
+ if (ttinfo != null) {
+ TopTermQueue topTerms = ttinfo.get( fieldName );
+ if( topTerms != null ) {
+ f.add( "distinct", topTerms.distinctTerms );
+
+ // Include top terms
+ f.add( "topTerms", topTerms.toNamedList( searcher.getSchema
() ) );
+
+ // Add a histogram
+ f.add( "histogram", topTerms.histogram.toNamedList() );
+ }
}
// Add the field
@@ -333,12 +340,12 @@
private static SimpleOrderedMap<Object> getIndexInfo( IndexReader
reader ) throws IOException
{
// Count the terms
- TermEnum te = reader.terms();
+// TermEnum te = reader.terms();
int numTerms = 0;
- while (te.next()) {
- numTerms++;
- }
-
+// while (te.next()) {
+// numTerms++;
+// }
+
Directory dir = reader.directory();
SimpleOrderedMap<Object> indexInfo = new
SimpleOrderedMap<Object>();
indexInfo.add("numDocs", reader.numDocs());
Re: Luke request handler issue
Posted by Ryan McKinley <ry...@gmail.com>.
>
> If Ryan is date-less again tonight, I'm sure he'll have it all fixed up
> by the time I wake up :) Otherwise, I'll dig in and roll up my sleeves
> sometime this week and make some adjustments to allow turning off the
> top terms feature.
>
Just back from a week of nothing that goes beep... it was great! Now my
girlfriend has to study for the bar, so i will have more time ;)
Perhaps numTerms=0 would skip all term collecting... also, it may be
worth caching the top terms, but i don't really want to go there just yet.
Yonik suggested numTerms should be specified per field. This is easy
enough...
Did you get a chance to look at the code? or should i add the
numTerms=0 behavior?
ryan
Re: Luke request handler issue
Posted by Yonik Seeley <yo...@apache.org>.
On 5/15/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> I've switched Flare to use the Luke request handler simply to
> retrieve the fields in the index.
>
> In the case of a 3.7M document index, it takes a LONG time to execute
> because of the top terms its generating. I tried setting numTerms=0
> and got an array index out of bounds exception.
I never had chance to check out the luke handler, but since it's still
experimental, I think all of the "parts" should be optional.
Something that takes as long as generating top terms should also be
able to be specified per-field IMO.
-Yonik