You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2006/05/17 03:19:51 UTC

solr-suggestion - terms that "start with"...

User story: We have a lot of peoples names in our data ("agents" that  
in some way contributed to a 19th century work).  We're refactoring  
our user interface to have a better navigation of these names, such  
that someone can just start typing and immediately (google-suggest  
style) see terms and their document frequency within a set of  
filters.  Someone types "yo", pauses, and "Yonik Seely (37)"  
appears.  Also it would appear if someone typed "see".

Falling back on my Lucene know-how, I've gotten Solr to respond with  
almost what I need using this code:

       TreeMap map = new TreeMap();
       String prefix = req.getParam("prefix");

       try {
         TermEnum enumerator = reader.terms(new Term(facet, prefix));

         do {
           Term term = enumerator.term();
           if (term != null && term.field().equals(facet) && term.text 
().startsWith(prefix)) {
             DocSet docSet = searcher.getDocSet(new TermQuery(term));
             BitSet bits = docSet.getBits();
             bits.and(constraintMask);
             map.put(term.text(), bits.cardinality());
           } else {
             break;
           }
         }
         while (enumerator.next());
       } catch (IOException e) {
         rsp.setException(e);
         numErrors++;
         return;
       }

       rsp.add(facet, map);

I'm going on gut feeling that Solr provides some handy benefits for  
me in this regard.  For quick-and-dirty's sake I used DocSet.getBits 
() and did things the way I know how in order to AND it with an  
existing constraintMask BitSet (built earlier in my custom request  
handler based on constraint parameters passed in).

The thing I'm missing is retrieving the stored field value and using  
that instead of term.text() in the data sent back to the client.  In  
the example mentioned above, I currently get back "yonik (37)" if  
"yo" was sent in as a prefix.  But I want the full stored field name,  
not the analyzed tokens.

Advice on how to implement what I'm after using Solr's infrastructure  
(or just Lucene's) is welcome.

Thanks,
	Erik

Re: solr-suggestion - terms that "start with"...

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On May 19, 2006, at 4:53 PM, Chris Hostetter wrote:
> : it has is case sensitivity.  I could lowercase everything, but then
> : the terms the user sees will be in all lowercase and that simply
> : won't do for my scholarly audience :)
>
> picky, picky users.

Yeah, if it wasn't for them, I'd have it easy :)

> : It seems like what I really need is simply a separate index (or
> : rather a partition of the main Solr one) where a Document represents
> : an "agent", and do a PrefixQuery or TermEnum and get all unique  
> agents.
>
> i've let it roll arround in my head for a few days, and i think that's
> exactly what i would do if it were me ... in fact, what you  
> describe is
> pretty much exactly what i do for product categories, except htat i  
> think
> i store more metadata about each category then you would store  
> about your
> "agents".  what you really need is a way to search for agents by  
> term or
> term prefix, get a list of matching agents, and then use each agent  
> as a
> facet for your "works" .. I do the same thing, except my term  
> queries are
> on the unique id for the category, and my "prefix" queries are for the
> null prefix (ie: look at all categories) .. then once i have a  
> category, i
> have other data that helps me with further facets (ie: for digital
> camera's, "resolution" is a good facet).

In fact, I've implemented this locally as a custom SolrCache that  
holds a RAMDirectory.  I TermEnum the agents in the main index on warm 
() and index all the agents into the RAMDirectory.  It is working well.

> i could imagine the same extension eventually unfolding for your  
> agents
> ... i don't know much about literary works, but if we transition it  
> to art
> in general, you might have information for one artist about different
> "labels" that apply to the art he produced in his life (sculpture,
> painting, cubist, impressionist, modern, "blue period", etc..) and  
> once
> your user has selected a specific artist, you could use the list of  
> labels
> from a stored field of the artists metadata doc to decide which  
> facets to
> offer the user in refining further.

We have metadata out the wazoo for this stuff.  We have "genres"  
which is a categorization of the type of work like "Painting",  
"Poetry", etc.  We have agents classified into roles.  The same  
person could be the author of one work, and a figure in a painting of  
another work, and the editor of another.  So even within agent the  
user interface will display the break down of each agent by the  
various roles.  *whew*

> : Maybe I need to build some sort of term -> agent cache during  
> warming
> : that makes this a no brainer?
>
> that's another way to go ... but if you make one doc per agent,  
> then this
> is just a subset of the filter cache .... i personally love the filter
> cache :)

I opted for the RAMDirectory so I can leverage Lucene scoring for the  
ordering of agents, rather than only alphabetical and frequency options.

	Erik

Re: solr-suggestion - terms that "start with"...

Posted by Chris Hostetter <ho...@fucit.org>.

: it has is case sensitivity.  I could lowercase everything, but then
: the terms the user sees will be in all lowercase and that simply
: won't do for my scholarly audience :)

picky, picky users.

: It seems like what I really need is simply a separate index (or
: rather a partition of the main Solr one) where a Document represents
: an "agent", and do a PrefixQuery or TermEnum and get all unique agents.

i've let it roll arround in my head for a few days, and i think that's
exactly what i would do if it were me ... in fact, what you describe is
pretty much exactly what i do for product categories, except htat i think
i store more metadata about each category then you would store about your
"agents".  what you really need is a way to search for agents by term or
term prefix, get a list of matching agents, and then use each agent as a
facet for your "works" .. I do the same thing, except my term queries are
on the unique id for the category, and my "prefix" queries are for the
null prefix (ie: look at all categories) .. then once i have a category, i
have other data that helps me with further facets (ie: for digital
camera's, "resolution" is a good facet).

i could imagine the same extension eventually unfolding for your agents
... i don't know much about literary works, but if we transition it to art
in general, you might have information for one artist about different
"labels" that apply to the art he produced in his life (sculpture,
painting, cubist, impressionist, modern, "blue period", etc..) and once
your user has selected a specific artist, you could use the list of labels
from a stored field of the artists metadata doc to decide which facets to
offer the user in refining further.

: Maybe I need to build some sort of term -> agent cache during warming
: that makes this a no brainer?

that's another way to go ... but if you make one doc per agent, then this
is just a subset of the filter cache .... i personally love the filter
cache :)



-Hoss

Re: solr-suggestion - terms that "start with"...

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On May 17, 2006, at 2:18 PM, Chris Hostetter wrote:
> Off the top of my head, i can't think of any cool way Solr cna help  
> you
> with this.  My best bet on how to solve the problem in general  
> would be an
> analyzer that doesn't to any tokenizing, but does create multiple  
> tokens
> at the same position after "rotating" all of the words, ie for the
> inputtext...
>
> 	"Dante Gabriel Rossetti"
>
> create the following tokens, all at the same position...
>
> 	"Dante Gabriel Rossetti"
> 	"Gabriel Rossetti, Dante"
> 	"Rossetti, Dante Gabriel"
>
> ...and then keep using TermEnum.

That's a pretty clever solution, actually.  However, the one negative  
it has is case sensitivity.  I could lowercase everything, but then  
the terms the user sees will be in all lowercase and that simply  
won't do for my scholarly audience :)

I suppose the general way to do this accurately in a rough fashion  
would be to tokenize (StandardAnalyzer would be sufficient) the agent  
field, then during TermEnum'ing look up all the Documents for that  
term, grab the "agent" field, and display that - except the fiddly  
bit about agent being multivalued.

It seems like what I really need is simply a separate index (or  
rather a partition of the main Solr one) where a Document represents  
an "agent", and do a PrefixQuery or TermEnum and get all unique agents.

But partitioning the Solr index into various document types at this  
point seems overkill - though it is an area I'd like to explore.

Maybe I need to build some sort of term -> agent cache during warming  
that makes this a no brainer?

Thanks for all the excellent feedback.

And, I was successful in producing an auto-suggest lookup using the  
non-tokenized case sensitive terms and performance was quite fine -  
again this is a Ruby on Rails front-end hitting Solr, with the only  
caching occurring on the Solr side of things currently.

	Erik

Re: solr-suggestion - terms that "start with"...

Posted by Chris Hostetter <ho...@fucit.org>.

: That is currently how I have it set up.  The agent field is not
: tokenized.  However, I need it to be.  Here's a concrete example.
: "Dante Gabriel Rossetti" is one of the agents in our system.  Users
: should be able to find him by typing either "d", "g", or "r" (case
: insensitive) and they'd see "Dante Gabriel Rossetti (42)" in the
: suggest popup where 42 is the number of documents he's involved in
: given the constraints.

Ohhhh... i was totally missing your point. i thought you had a tokenized
field for searching and you just needed a non tokenized field for facets.

Hmmmmm....

Off the top of my head, i can't think of any cool way Solr cna help you
with this.  My best bet on how to solve the problem in general would be an
analyzer that doesn't to any tokenizing, but does create multiple tokens
at the same position after "rotating" all of the words, ie for the
inputtext...

	"Dante Gabriel Rossetti"

create the following tokens, all at the same position...

	"Dante Gabriel Rossetti"
	"Gabriel Rossetti, Dante"
	"Rossetti, Dante Gabriel"

...and then keep using TermEnum.

: Here's a concrete multivalued example... a work has two agents "Otis
: Hatcher" and "Erik Gospodnetic".  The user types "g" and "Erik
: Gospodnetic (2)" pops up, or types "o" and "Otis Hatcher (1)" pops

that should still work with an analyzer like i described correct? you just
get the part of the agents name the user was type completing first .. not
sure if that's a deal breaker.



-Hoss

Re: solr-suggestion - terms that "start with"...

Posted by Yonik Seeley <ys...@gmail.com>.

On 5/17/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> how does searcher.numDocs() compare to using docSet.intersectionSize
> (constraintDocSet)?

No current performance difference... numDocs() is slightly more
abstracted so that something like an intersection-cache could be added
in the future if needed, or alternate ways of taking intersections
could be done (non-cached simple term queries could use the same
strategy as ConjunctionScorer, etc).

It must have been late when I came up with the name "numDocs()" ;-)
And I should have probably generalized it to
numDocs(Collection<Query>, Collection<DocSet>) or something like a
ChainedDocSet...

-Yonik

Re: solr-suggestion - terms that "start with"...

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On May 16, 2006, at 10:47 PM, Chris Hostetter wrote:
> : I've just improved the code to be a better DocSet citizen and it now
> : does this:
> :
> : 	      BitDocSet constraintDocSet = new BitDocSet(constraintMask);
> :                ...
> :                map.put(term.text(), docSet.intersectionSize
> : (constraintDocSet));
>
> how are you building constraintMask come from? ... if it's a BitSet  
> you
> are building up by executing a bunch of queries, getting their  
> DocSets,
> asking those DocSets for their bits, and then unioning/interescting  
> them
> then that's probably the best place where there's likely to be benefit
> from Solr that you aren't taking advantage of already (except that  
> i seem
> to recall you wanting to do things that DocSets don't currently  
> support:
> like invert .. so maybe this is hte best way)

Yeah, I'm building up constraintMask using the refactoring to use  
Solr's caching - so its still BitSet's within my FacetCache, but  
these are all pre-loaded during warming.  I'll eventually refactor it  
further to use DocSet's, but for now the speed and memory usage are  
all more than acceptable (we have hundreds, not thousands or  
millions, of facet values).

> Of the cuff: the one thing i would do differnetly if it were me, is...
>
>   BitDocSet constraintDocSet = new BitDocSet(constraintMask);
>   ...
>   if (term != null && term.field().equals(facet) && term.text 
> ().startsWith(prefix)) {
>      map.put(term.text(), searcher.numDocs(new TermQuery(term),
>                                            constraintDocSet);
>   } else {
>   ...
>
> ...there's no performacne gain, but it makes your code a little  
> cleaner.

how does searcher.numDocs() compare to using docSet.intersectionSize 
(constraintDocSet)?

> As for issue of how you get the values based on your prefix, i  
> would keep
> using a TermEnum, but build it on a field that isn't tokenized.

That is currently how I have it set up.  The agent field is not  
tokenized.  However, I need it to be.  Here's a concrete example.   
"Dante Gabriel Rossetti" is one of the agents in our system.  Users  
should be able to find him by typing either "d", "g", or "r" (case  
insensitive) and they'd see "Dante Gabriel Rossetti (42)" in the  
suggest popup where 42 is the number of documents he's involved in  
given the constraints.

> : Oh, one other wrinkle to getting the stored field value is that the
> : agent field is multi-valued, so several people could collaborate and
> : have their individual names associated with a work.  So there are
>
> this won't be a problem with the multiValued="true" option ... it does
> what you expect regardless of wether the field is text,string,integer,
> tokenized/non-tokenized.
>
> (well, it does what *I* expect ... if you exepct something and it  
> doesn't
> do that -- let us know)

Here's a concrete multivalued example... a work has two agents "Otis  
Hatcher" and "Erik Gospodnetic".  The user types "g" and "Erik  
Gospodnetic (2)" pops up, or types "o" and "Otis Hatcher (1)" pops  
up.  So still not quite there - looks like I'll have to walk TermDocs  
and get the stored agent fields, but even then thats not refined  
enough as I wouldn't know which agent field in the array of stored  
values was the match.

	Erik

Re: solr-suggestion - terms that "start with"...

Posted by Chris Hostetter <ho...@fucit.org>.

: I've just improved the code to be a better DocSet citizen and it now
: does this:
:
: 	      BitDocSet constraintDocSet = new BitDocSet(constraintMask);
:                ...
:                map.put(term.text(), docSet.intersectionSize
: (constraintDocSet));

how are you building constraintMask come from? ... if it's a BitSet you
are building up by executing a bunch of queries, getting their DocSets,
asking those DocSets for their bits, and then unioning/interescting them
then that's probably the best place where there's likely to be benefit
from Solr that you aren't taking advantage of already (except that i seem
to recall you wanting to do things that DocSets don't currently support:
like invert .. so maybe this is hte best way)

Of the cuff: the one thing i would do differnetly if it were me, is...

  BitDocSet constraintDocSet = new BitDocSet(constraintMask);
  ...
  if (term != null && term.field().equals(facet) && term.text().startsWith(prefix)) {
     map.put(term.text(), searcher.numDocs(new TermQuery(term),
                                           constraintDocSet);
  } else {
  ...

...there's no performacne gain, but it makes your code a little cleaner.


As for issue of how you get the values based on your prefix, i would keep
using a TermEnum, but build it on a field that isn't tokenized.  with a
copyField this becomes really easy, if this is your current "agent"
field...

  <fieldtype name="text" class="solr.TextField" ...
  <field name="agent" type="text" indexed="true" stored="true"
                      multiValued="true"/>

..then assuming you already have...
  <fieldtype name="string" class="solr.StrField" />
...add...
  <field name="agentRAW" type="string" indexed="true" stored="false"
                         omitNorms="true" multiValued="true" />
...and farther down...
  <copyField source="agent" dest="agentRAW"/>

...and then make your TermEnum on the "agentRAW" field.

: Oh, one other wrinkle to getting the stored field value is that the
: agent field is multi-valued, so several people could collaborate and
: have their individual names associated with a work.  So there are

this won't be a problem with the multiValued="true" option ... it does
what you expect regardless of wether the field is text,string,integer,
tokenized/non-tokenized.

(well, it does what *I* expect ... if you exepct something and it doesn't
do that -- let us know)


-Hoss

Re: solr-suggestion - terms that "start with"...

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On May 16, 2006, at 9:19 PM, Erik Hatcher wrote:

> User story: We have a lot of peoples names in our data ("agents"  
> that in some way contributed to a 19th century work).  We're  
> refactoring our user interface to have a better navigation of these  
> names, such that someone can just start typing and immediately  
> (google-suggest style) see terms and their document frequency  
> within a set of filters.  Someone types "yo", pauses, and "Yonik  
> Seely (37)" appears.  Also it would appear if someone typed "see".
>
> Falling back on my Lucene know-how, I've gotten Solr to respond  
> with almost what I need using this code:
>
>       TreeMap map = new TreeMap();
>       String prefix = req.getParam("prefix");
>
>       try {
>         TermEnum enumerator = reader.terms(new Term(facet, prefix));
>
>         do {
>           Term term = enumerator.term();
>           if (term != null && term.field().equals(facet) &&  
> term.text().startsWith(prefix)) {
>             DocSet docSet = searcher.getDocSet(new TermQuery(term));
>             BitSet bits = docSet.getBits();
>             bits.and(constraintMask);
>             map.put(term.text(), bits.cardinality());
>           } else {
>             break;
>           }
>         }
>         while (enumerator.next());
>       } catch (IOException e) {
>         rsp.setException(e);
>         numErrors++;
>         return;
>       }
>
>       rsp.add(facet, map);
>
> I'm going on gut feeling that Solr provides some handy benefits for  
> me in this regard.  For quick-and-dirty's sake I used DocSet.getBits 
> () and did things the way I know how in order to AND it with an  
> existing constraintMask BitSet (built earlier in my custom request  
> handler based on constraint parameters passed in).

I've just improved the code to be a better DocSet citizen and it now  
does this:

	      BitDocSet constraintDocSet = new BitDocSet(constraintMask);
               ...
               map.put(term.text(), docSet.intersectionSize 
(constraintDocSet));

Oh, one other wrinkle to getting the stored field value is that the  
agent field is multi-valued, so several people could collaborate and  
have their individual names associated with a work.  So there are  
multiple Lucene stored field values for the "agent" field.  I'm  
guessing that the best way to do this sort of thing is to index just  
these fields into a separate set of documents and query only those.   
Thoughts?

Thanks,
	Erik