You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tracey Jaquith <tr...@archive.org> on 2007/01/11 08:24:14 UTC

listing/enumerating field information

The Internet Archive is getting close to going live with Solr.
I have two remaining classes of problems.

1) across the entire index, enumerate all the unique values for a given 
field.
2) we use unrestricted dynamicField additions from documents.  (that is 
our users are free to add any named field they like to their document's 
data (which is metadata for their item)).  we want to list all the 
unique field names in the index.

Eg:
<doc>
  ...
 <mediatype>audio</mediatype>
</doc>
<doc>
  ...
  <mediatype>movies</mediatype>
  <collection>prelinger</collection>
</doc>

1) would yield a list of audio and movies if the field passed in was 
mediatype
2) would yield a list of  mediatype and collection


 From our prior implementation of a java + lucene search engine, we already
ran in to queries that our SE could not handle.  So we nightly build a cache
structure to handle those other queries.  We *could* solve 1) and 2) in
this nightly cache, but ideally we'd like to use Solr if possible.

thanks!
--tracey


-- 
*       --Tracey Jaquith - http://www.archive.org/~tracey 
<http://www.archive.org/%7Etracey> --*

Re: listing/enumerating field information

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 19, 2007, at 9:59 PM, Erik Hatcher wrote:
> I think a standalone request handler to return all the field names  
> in the index would be great.  Anything this thing should return  
> other than a list of field names?  I think this should be  
> standalone because it is separate from a search, and in my use case  
> it would be something solrb would load up once (per connection to  
> Solr perhaps) and use to dynamically show what can be faceted on,  
> sorted on, searched on, etc.  Combine that with a fetch of  
> schema.xml, and a client would have a pretty good picture of what  
> is under the covers.  Maybe the request handler that returns the  
> field names could also combine that with the schema information,  
> flattened a bit.

I just created a very simple StructureRequestHandler:

	<https://issues.apache.org/jira/browse/SOLR-116>

For the request http://localhost:8983/solr/select/? 
qt=structure&wt=ruby on the example index, it returns this:

{'responseHeader'=>{'status'=>0,'QTime'=>2},'fields'=> 
{'includes'=>'text','cat'=>'text_ws','alphaNameSort'=>'alphaOnlySort','i 
d'=>'string','text'=>'text','manu_exact'=>'string','features'=>'text','p 
rice'=>'sfloat','incubationdate_dt'=>'date','timestamp'=>'date','sku'=>' 
textTight','name'=>'text','nameSort'=>'string','manu'=>'text','weight'=> 
'sfloat','inStock'=>'boolean','popularity'=>'sint'}}

(the example pasted into the JIRA issue was from an older version of  
the example index; I just refreshed it from the trunk .xml files for  
this message).  In this example, the one field you'd not get by  
reading schema.xml is incubationdate_dt.

Suggestions for improvement?

I'll be happy to document and commit when given the thumbs up.

	Erik


Re: listing/enumerating field information

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 19, 2007, at 11:34 PM, Yonik Seeley wrote:

> On 1/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>> So, where are we with these things now?
>>
>>
>> The one delta for my use case is getting ones that start with a
>> certain prefix.  I'll look into adding this when I can.
>>
>
> The code for most of these things is easy... the hardest part is
> coming up with interfaces flexible enough that you don't box yourself
> in later and make things ugly.
>
> For example, how does one specify a prefix constraint while  
> faceting a field:
>
> facet.field=myfield&f.myfield.terms=foo*
>
> But what if you also wanted to facet a different way on the same field
> in the same request?
> When we came up with per-field params for highlighting, it worked well
> because everything was field based.  It no longer really is.
>
> One could use (abuse) the per-field-param capability to do  
> something like:
>
> facet.id=1
> f.1.field=myfield
> f.1.terms=foo*
> facet.id=2
> f.2.field=myfield
> f.2.terns=bar*
>
> I'm not impressed.  I'm leaning toward being practical (punting and
> going with the first form for now... the upshot being that you can't
> facet on the same field multiple ways.)
>
> Thoughts?

The latter example is overengineering this a bit.  My use case is  
simply to return terms that match a prefix within some given some  
constraints (q + fq), and for each of those terms also return the  
counts.  Requesting more than one field for faceting isn't currently  
needed.  We're talking about a text box wired to some JavaScript that  
returns the terms and counts back only for a single field.

I envisioned a specialized request handler that leverages what it can  
from the existing faceting infrastructure.  Though with your first  
example, the standard handler would be sufficient.  While it makes  
perfect sense to return search results at the same time you request  
facets (except for successive pages, when facets won't change, of  
course), for the suggest behavior I'd set rows=0.

I personally tend towards smaller single purpose handlers rather than  
trying to make the "standard" one do it all.

	Erik



Re: listing/enumerating field information

Posted by Yonik Seeley <yo...@apache.org>.
On 1/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> So, where are we with these things now?
>
> On Jan 11, 2007, at 2:24 AM, Tracey Jaquith wrote:
> > 1) across the entire index, enumerate all the unique values for a
> > given field.
>
> &facet=true&facet.field={fieldname} can do this with the standard
> request handler.  Right?
>
> The one delta for my use case is getting ones that start with a
> certain prefix.  I'll look into adding this when I can.
>
> > 2) we use unrestricted dynamicField additions from documents.
> > (that is our users are free to add any named field they like to
> > their document's data (which is metadata for their item)).  we want
> > to list all the unique field names in the index.
>
> I think a standalone request handler to return all the field names in
> the index would be great.  Anything this thing should return other
> than a list of field names?  I think this should be standalone
> because it is separate from a search, and in my use case it would be
> something solrb would load up once (per connection to Solr perhaps)
> and use to dynamically show what can be faceted on, sorted on,
> searched on, etc.  Combine that with a fetch of schema.xml, and a
> client would have a pretty good picture of what is under the covers.
> Maybe the request handler that returns the field names could also
> combine that with the schema information, flattened a bit.
>
> Thoughts?

The code for most of these things is easy... the hardest part is
coming up with interfaces flexible enough that you don't box yourself
in later and make things ugly.

For example, how does one specify a prefix constraint while faceting a field:

facet.field=myfield&f.myfield.terms=foo*

But what if you also wanted to facet a different way on the same field
in the same request?
When we came up with per-field params for highlighting, it worked well
because everything was field based.  It no longer really is.

One could use (abuse) the per-field-param capability to do something like:

facet.id=1
f.1.field=myfield
f.1.terms=foo*
facet.id=2
f.2.field=myfield
f.2.terns=bar*

I'm not impressed.  I'm leaning toward being practical (punting and
going with the first form for now... the upshot being that you can't
facet on the same field multiple ways.)

Thoughts?

-Yonik

Re: listing/enumerating field information

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
So, where are we with these things now?

On Jan 11, 2007, at 2:24 AM, Tracey Jaquith wrote:
> 1) across the entire index, enumerate all the unique values for a  
> given field.

&facet=true&facet.field={fieldname} can do this with the standard  
request handler.  Right?

The one delta for my use case is getting ones that start with a  
certain prefix.  I'll look into adding this when I can.

> 2) we use unrestricted dynamicField additions from documents.   
> (that is our users are free to add any named field they like to  
> their document's data (which is metadata for their item)).  we want  
> to list all the unique field names in the index.

I think a standalone request handler to return all the field names in  
the index would be great.  Anything this thing should return other  
than a list of field names?  I think this should be standalone  
because it is separate from a search, and in my use case it would be  
something solrb would load up once (per connection to Solr perhaps)  
and use to dynamically show what can be faceted on, sorted on,  
searched on, etc.  Combine that with a fetch of schema.xml, and a  
client would have a pretty good picture of what is under the covers.   
Maybe the request handler that returns the field names could also  
combine that with the schema information, flattened a bit.

Thoughts?

	Erik


Re: listing/enumerating field information

Posted by Yonik Seeley <yo...@apache.org>.
On 1/11/07, Tracey Jaquith <tr...@archive.org> wrote:
>  The Internet Archive is getting close to going live with Solr.
>  I have two remaining classes of problems.
>
>  1) across the entire index, enumerate all the unique values for a given field.
>  2) we use unrestricted dynamicField additions from documents.  (that is our users are free to add any named field they like to their document's data (which is metadata for their item)).  we want to list all the unique field names in the index.

Reasonable requests, they both seem like they would be useful additions to Solr.
I've considered doing (1) in the past, adding the doc frequency of each term.

Relying on the schema for (2) is slightly ambiguous.
Do you want a) all the fields defined by the schema, or b) all the
fields actually in the index (which may exclude some fields in the
schema if not used, but also include any dynamic fields in use).

For 2.b, we could use IndexReader.getFieldNames()

-Yonik

Re: listing/enumerating field information

Posted by Chris Hostetter <ho...@fucit.org>.
: >  Attempting to enumerating
: > all of the values for a field could be dangerous
:
: We do it for faceting :-)  But we don't drag it all into memory at once...

i ment trying to return them all to the user at one time ... even if we
decreased the server side memory usage risk my supporting Iterators in the
OUtputWriters, we could still wind up slammingthe client with a large
reply (theoretically: an infinite list)

basicly i'm just arguing that we design the API to have a build in "limit"
concept, and default it to something managable 9the same way we do for
term based facet counts)

: Adding a start and end (like a range query) is a great idea!

oh yeah ... i hadn't considered an "end" ... just a limit, but it would be
trivial to support both.

: Perhaps adding Iterator or Iterable to the list of supported types in
: TextWriter would be a nice general way to go.

yeah ... Iterable would probably make more sense since it's the more
generic API and would allow people to pass truely "lazy" objects to the
SolrQueryResponse (where the iterator() method does the initialization
work)

...that seems like a seperate (but related) issue to having an easy way to
acces Term/Field stats.


-Hoss


Re: listing/enumerating field information

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 12, 2007, at 8:26 PM, Chris Hostetter wrote:
> Yeah ... what Erik's talking about really sounds like a simple  
> faceting
> issue: supporting prefix's for limiting the list of constraints ...  
> what
> Tracey was talking about seems much more like Luke-esqe index info:  
> what
> are all the fields, and what are all the terms in those fields.  
> (and what
> is the docFreq of each of those terms)

The only difference in that last bit and the faceting in my stuff  
returns all terms in a field only for the documents that match some  
constraints (fq kinda stuff).

> : >  Paging via start/
> : > rows is necessary,
> :
> : Hmmm, really?  Not too hard I guess, but I'd be interested in how  
> you use it.
>
> yeah paginating facet constraints doesn't really make sense to me ...
> unless you're just taling about paginated the main results like we  
> already
> do now.  for the facets themselves facet.limit seems good enough ..
> especially since it already sorts.

Suppose there are a 1000 terms that start with "foo"... the user  
would like to see them all, but page through them rather than see  
them all at once.  facet.limit could be increased to get more back,  
but I'd like to have the ability to get 50 at a time, in pages.

> I get why pagination is important in Tracey's use case though (where
> she's just dumping all terms)

My stuff dumps terms too, just not all of them, only ones that start  
with a prefix, which could still be more than the client wants in one  
shot.  Sure, the client could buffering it, but it seems Solr would  
be better adept at being efficient about it.

	Erik


Re: listing/enumerating field information

Posted by Yonik Seeley <yo...@apache.org>.
On 1/12/07, Chris Hostetter <ho...@fucit.org> wrote:
> what
> Tracey was talking about seems much more like Luke-esqe index info: what
> are all the fields, and what are all the terms in those fields. (and what
> is the docFreq of each of those terms)

Right... it's very similar in appearance, but a big difference in
resource usage.

-Yonik

Re: listing/enumerating field information

Posted by Chris Hostetter <ho...@fucit.org>.
: Sounds like this stuff could/should be extensions to the current facet.field

Yeah ... what Erik's talking about really sounds like a simple faceting
issue: supporting prefix's for limiting the list of constraints ... what
Tracey was talking about seems much more like Luke-esqe index info: what
are all the fields, and what are all the terms in those fields. (and what
is the docFreq of each of those terms)

: facet.field=year&f.year.facet.prefix=186
:   or
: facet.field=year,f.year.facet.terms=186*
:   or
: facet.terms=year:186*
: #the latter two forms could relatively easily allow for anything that
: could be converted
: #to a TermEnum, so all wildcards could be handled.  Just parse with
: the queryparser and
: # check what type of query comes out?

the notion of a bunch of "if (q instanceof PrefixQuery) { ... } else if
(q instanceof WildcardQuery) { ... } ..." doesn't really appeal to me ...
but i could certainly get on board a new "f.*.facet.prefix" param.

: >  Paging via start/
: > rows is necessary,
:
: Hmmm, really?  Not too hard I guess, but I'd be interested in how you use it.

yeah paginating facet constraints doesn't really make sense to me ...
unless you're just taling about paginated the main results like we already
do now.  for the facets themselves facet.limit seems good enough ..
especially since it already sorts.

I get why pagination is important in Tracey's use case though (where
she's just dumping all terms)


-Hoss


Re: listing/enumerating field information

Posted by Yonik Seeley <yo...@apache.org>.
On 1/12/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> What the user-interface needs is a way to ask Solr for terms that
> begin with a specified prefix, as the user types.

Sounds like this stuff could/should be extensions to the current facet.field
facet.field=year&f.year.facet.prefix=186
  or
facet.field=year,f.year.facet.terms=186*
  or
facet.terms=year:186*
#the latter two forms could relatively easily allow for anything that
could be converted
#to a TermEnum, so all wildcards could be handled.  Just parse with
the queryparser and
# check what type of query comes out?

With the last form, facet.terms=year:* is equivalent to facet.field=year

supporting start/end terms:
facet.field=year&f.year.facet.startterm=1860&f.year.facet.endterm=1899


>  Paging via start/
> rows is necessary,

Hmmm, really?  Not too hard I guess, but I'd be interested in how you use it.

> and also sorting by frequency given some specified constraints.
Faceting code does this when given a limit... did you mean anything else?


> As for Hoss's suggestion of a Stats handler - I still hold the
> opinion that all of the admin JSPs really ought to be first class
> request handlers that go through the whole ResponseWriter stuff, so I
> can get all of that great capability in Ruby format instead of XML.
> As it is, to build a Ruby API to Solr and provide access to the
> stats, there has to be two different parsing mechanisms.  I know he
> meant index stats, not Solr admin stats, but it reminded me of the
> XML pain I'm going to feel in solrb to add Solr stats :)

Specific stats are subject to change... will you be treating them generically?

I think the path of least resistance is to temporarily forget about
the admin pages, add an admin handler, and tailor the output to
programmatic parsing.  Decoupling from the admin GUI means it doesn't
become a requirement to port the current stylesheets and get
everything working again.  The admin pages could be moved over later
when someone who has the time, inclination and gui skills to do so.

-Yonik

Re: listing/enumerating field information

Posted by Yonik Seeley <yo...@apache.org>.
On 1/14/07, Chris Hostetter <ho...@fucit.org> wrote:
> you know what's *really* anoying? .. that my girlfriend lives 3 timezones
> away.

OTOH I'm sure this factors into the extra time you can spend on Solr,
for which I'm sure many around here are glad :-)

-Yonik

Re: listing/enumerating field information

Posted by "J.J. Larrea" <jj...@panix.com>.
Hoss, I'm delighted to have annoyed you, if only *slightly*! ;-)

- J.J.

PS: +1 on Yonik's subsequent comment.

At 8:04 PM -0800 1/14/07, Chris Hostetter wrote:
>:   - Apply the faceting criteria (e.g. facet.zeros, though facet.mincount
>: would have been a more flexible option in all cases)
>
>you know what's *really* anoying? .. that my girlfriend lives 3 timezones
>away.
>
>you know what's *slightly* anoying? .. writing code that seems really
>generic and reusable, and then having someone point out months later that
>a numeric "minimum" argument is a billion times more generic and reusable
>then a boolean argument that means "ignore zero" -- and realizing that
>the numeric argument could have been done in the same amount of code.
>
>	:)
>
>Nice catch!
>
>
>-Hoss


Re: listing/enumerating field information

Posted by Chris Hostetter <ho...@fucit.org>.
:   - Apply the faceting criteria (e.g. facet.zeros, though facet.mincount
: would have been a more flexible option in all cases)

you know what's *really* anoying? .. that my girlfriend lives 3 timezones
away.

you know what's *slightly* anoying? .. writing code that seems really
generic and reusable, and then having someone point out months later that
a numeric "minimum" argument is a billion times more generic and reusable
then a boolean argument that means "ignore zero" -- and realizing that
the numeric argument could have been done in the same amount of code.

	:)

Nice catch!


-Hoss


Re: listing/enumerating field information

Posted by "J.J. Larrea" <jj...@panix.com>.
At 5:06 AM -0500 1/12/07, Erik Hatcher wrote:
>What the user-interface needs is a way to ask Solr for terms that begin with a specified prefix, as the user types.   Paging via start/rows is necessary, and also sorting by frequency given some specified constraints.  I like the start/end term idea also, though I can't think of a scenario in my application where this would be different than having a prefix parameter.  If I want all the 1860's, prefix=186&field=year, for example.

I also have exactly this requirement: Paging through the terms (and getting the document count for each term) optionally limited to those matching a supplied prefix (there can be thousands of terms for a prefix so start/rows is absolutely necessary even when prefixing). Choosing whether terms were sorted by index-order or document-count order would be a plus.

I would love to have this be provided by an extension to the Faceting logic, as suggested by Yonik and Hoss, incorporating the non-query pathway raised by Erik:
  - Assemble the list of term/frequency pairs for a field either by tallying the term references found in a DocList, or by using the term frequency information found in the index (optimization for non-query case)
  - Apply a criterion (RegExp based would obviously be most flexible -- no need for full Lucene query syntax -- but prefix-only might be an optimization that could be applied in the non-query case) to filter the terms, either during assembly or post-facto.
  - Apply the faceting criteria (e.g. facet.zeros, though facet.mincount would have been a more flexible option in all cases)
  - Optionally pass through the BoundedTreeSet/PriorityQueue mechanism to sort by frequency and in that case optionally keep only the top facet.limit terms
  - Cache the results with the query (including a special key for the non-query case) so paging could be done without any requerying, retallying, or resorting
  - Return in the response a subrange of the list
  - Naturally allow the full complement of response encodings
  - (Am I missing anything?)

While a commendable endeavor, this is a fair bit of work, and it may take a while before someone (perhaps me even) steps up to the plate, for performance if not functional considerations.  So IMHO it would also be worthwhile to craft a simpler index-only version.

>I would be thrilled if this just magically appeared in Solr's codebase before I have a chance to build it. :)

Well, after my current deadline (next week) passes, this functionality is on my  task list for my next milestone... so I'd be equally elated if I didn't have to write it myself. :-)

And adding 2 cents to the other topic in this thread...

>As for Hoss's suggestion of a Stats handler - I still hold the opinion that all of the admin JSPs really ought to be first class request handlers that go through the whole ResponseWriter stuff, so I can get all of that great capability in Ruby format instead of XML. 

Agreed in principle, though I'm an XML-person.

>As it is, to build a Ruby API to Solr and provide access to the stats, there has to be two different parsing mechanisms.  I know he meant index stats, not Solr admin stats, but it reminded me of the XML pain I'm going to feel in solrb to add Solr stats :)

I am happy to merely be a spectator of the Rubification of SOLR!

Also,

>On Jan 11, 2007, at 3:13 PM, Yonik Seeley wrote:
>>> Attempting to enumerating
>>>all of the values for a field could be dangerous
>>
>>We do it for faceting :-)  But we don't drag it all into memory at once...

Not entirely true: The FieldCache pathway of faceting single-valued fields does just that.  In some cases I've set multivalued=true even when it's not accurate in order to force the cached-filter pathway.

- J.J.

Re: listing/enumerating field information

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Let's take this a step further, like I do with a (messy) custom  
request handler in Collex.  For an example, go to http:// 
www.nines.org/collex and type "sol" into the (slightly misnamed from  
our technical perspective) "phrase" text box.  The drop-down shows  
all terms beginning with "sol", and _also_ the counts of the  
documents *within the current constraints*.  Add some constraints and  
you'll see the results in the drop-down change.

       Terms are facets too!

My hacked code is here: <http://patacriticism.svn.sourceforge.net/ 
viewvc/patacriticism/collex/trunk/src/solr/org/nines/ 
FacetRequestHandler.java?revision=483&view=markup>, starting around  
line 152.

   152             TermEnum termEnum = reader.terms(new Term(field,  
prefix));
   153             while (true) {
   154               Term term = termEnum.term();
   155               if (term == null || !term.field().equals(field)  
|| !term.text().startsWith(prefix)) break;
   156
   157               DocSet docSet = searcher.getDocSet(new TermQuery 
(term));
   158               int size = docSet.intersectionSize(constraintMask);
   159               if (size > 0) map.put(term.text(), size);
   160
   161               if (! termEnum.next()) break;
   162             }

Don't bother critiquing the code, I know its an unscalable hack :/   
As you'll see if you're crazy enough to peruse the rest of that code,  
the whole thing can practically be replaced with the Solr faceting,  
but I've got little custom things like this that make it trickier to  
replace than meets the eye.

Part of my Flare effort is to distill goodies from Collex (at least  
idea-wise, likely not copy/paste-wise).

What the user-interface needs is a way to ask Solr for terms that  
begin with a specified prefix, as the user types.   Paging via start/ 
rows is necessary, and also sorting by frequency given some specified  
constraints.  I like the start/end term idea also, though I can't  
think of a scenario in my application where this would be different  
than having a prefix parameter.  If I want all the 1860's,  
prefix=186&field=year, for example.

I would be thrilled if this just magically appeared in Solr's  
codebase before I have a chance to build it. :)

As for Hoss's suggestion of a Stats handler - I still hold the  
opinion that all of the admin JSPs really ought to be first class  
request handlers that go through the whole ResponseWriter stuff, so I  
can get all of that great capability in Ruby format instead of XML.   
As it is, to build a Ruby API to Solr and provide access to the  
stats, there has to be two different parsing mechanisms.  I know he  
meant index stats, not Solr admin stats, but it reminded me of the  
XML pain I'm going to feel in solrb to add Solr stats :)

	Erik


On Jan 11, 2007, at 3:13 PM, Yonik Seeley wrote:

> On 1/11/07, Chris Hostetter <ho...@fucit.org> wrote:
>> Writing a more generic "Stats" request handler that does what you're
>> describing certianly seems like a good idea.
>
> Hmmm, I hadn't thought of it as a separate handler, but as long as
> these types of requests aren't related to a base query, and not needed
> along with every query, I guess that could make sense.
>
>>  Attempting to enumerating
>> all of the values for a field could be dangerous
>
> We do it for faceting :-)  But we don't drag it all into memory at  
> once...
>
>> but an API where the
>> clienc specifies a starting term and a number of terms and we use the
>> TermEnum.seek() would be fairly straight forward.
>
> Adding a start and end (like a range query) is a great idea!
> Additionally, I think adding support to incrementally write all the
> terms to the response might be important... loading them all into
> memory doesn't seem like a great idea.
>
> Perhaps adding Iterator or Iterable to the list of supported types in
> TextWriter would be a nice general way to go.
>
> -Yonik


Re: listing/enumerating field information

Posted by Yonik Seeley <yo...@apache.org>.
On 1/11/07, Chris Hostetter <ho...@fucit.org> wrote:
> Writing a more generic "Stats" request handler that does what you're
> describing certianly seems like a good idea.

Hmmm, I hadn't thought of it as a separate handler, but as long as
these types of requests aren't related to a base query, and not needed
along with every query, I guess that could make sense.

>  Attempting to enumerating
> all of the values for a field could be dangerous

We do it for faceting :-)  But we don't drag it all into memory at once...

> but an API where the
> clienc specifies a starting term and a number of terms and we use the
> TermEnum.seek() would be fairly straight forward.

Adding a start and end (like a range query) is a great idea!
Additionally, I think adding support to incrementally write all the
terms to the response might be important... loading them all into
memory doesn't seem like a great idea.

Perhaps adding Iterator or Iterable to the list of supported types in
TextWriter would be a nice general way to go.

-Yonik

Re: listing/enumerating field information

Posted by Chris Hostetter <ho...@fucit.org>.
: Code-searching for relevant lucene classes led me to try adding
:    <requestHandler name="test" class="solr.tst.TestRequestHandler"/>
: to my solrconfig.xml

holy cow, i forgot that thing even existed! ... as you can see by
skimmingthe code it's a hodge podge of misc crap that was used early on as
a simple way to test that things were working.

Writing a more generic "Stats" request handler that does what you're
describing certianly seems like a good idea.  Attempting to enumerating
all of the values for a field could be dangerous but an API where the
clienc specifies a starting term and a number of terms and we use the
TermEnum.seek() would be fairly straight forward.



-Hoss


Re: listing/enumerating field information

Posted by Tracey Jaquith <tr...@archive.org>.
interesting! 

Code-searching for relevant lucene classes led me to try adding
   <requestHandler name="test" class="solr.tst.TestRequestHandler"/>
to my solrconfig.xml

This allowed me to try this request...
   http://localhost:8983/solr/select?rows=0&qt=test&q=fields
which I think gets me (2) below.

--tracey


Tracey Jaquith wrote:
>
> The Internet Archive is getting close to going live with Solr.
> I have two remaining classes of problems.
>
> 1) across the entire index, enumerate all the unique values for a 
> given field.
> 2) we use unrestricted dynamicField additions from documents.  (that 
> is our users are free to add any named field they like to their 
> document's data (which is metadata for their item)).  we want to list 
> all the unique field names in the index.
>
> Eg:
> <doc>
>   ...
>  <mediatype>audio</mediatype>
> </doc>
> <doc>
>   ...
>   <mediatype>movies</mediatype>
>   <collection>prelinger</collection>
> </doc>
>
> 1) would yield a list of audio and movies if the field passed in was 
> mediatype
> 2) would yield a list of  mediatype and collection
>
>
> From our prior implementation of a java + lucene search engine, we already
> ran in to queries that our SE could not handle.  So we nightly build a 
> cache
> structure to handle those other queries.  We *could* solve 1) and 2) in
> this nightly cache, but ideally we'd like to use Solr if possible.
>
> thanks!
> --tracey
>
>
> -- 
> *       --Tracey Jaquith - http://www.archive.org/~tracey 
> <http://www.archive.org/%7Etracey> --*

-- 
*       --Tracey Jaquith - http://www.archive.org/~tracey 
<http://www.archive.org/%7Etracey> --*