You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2006/05/17 03:19:51 UTC
solr-suggestion - terms that "start with"...
User story: We have a lot of peoples names in our data ("agents" that
in some way contributed to a 19th century work). We're refactoring
our user interface to have a better navigation of these names, such
that someone can just start typing and immediately (google-suggest
style) see terms and their document frequency within a set of
filters. Someone types "yo", pauses, and "Yonik Seely (37)"
appears. Also it would appear if someone typed "see".
Falling back on my Lucene know-how, I've gotten Solr to respond with
almost what I need using this code:
TreeMap map = new TreeMap();
String prefix = req.getParam("prefix");
try {
TermEnum enumerator = reader.terms(new Term(facet, prefix));
do {
Term term = enumerator.term();
if (term != null && term.field().equals(facet) && term.text
().startsWith(prefix)) {
DocSet docSet = searcher.getDocSet(new TermQuery(term));
BitSet bits = docSet.getBits();
bits.and(constraintMask);
map.put(term.text(), bits.cardinality());
} else {
break;
}
}
while (enumerator.next());
} catch (IOException e) {
rsp.setException(e);
numErrors++;
return;
}
rsp.add(facet, map);
I'm going on gut feeling that Solr provides some handy benefits for
me in this regard. For quick-and-dirty's sake I used DocSet.getBits
() and did things the way I know how in order to AND it with an
existing constraintMask BitSet (built earlier in my custom request
handler based on constraint parameters passed in).
The thing I'm missing is retrieving the stored field value and using
that instead of term.text() in the data sent back to the client. In
the example mentioned above, I currently get back "yonik (37)" if
"yo" was sent in as a prefix. But I want the full stored field name,
not the analyzed tokens.
Advice on how to implement what I'm after using Solr's infrastructure
(or just Lucene's) is welcome.
Thanks,
Erik
Re: solr-suggestion - terms that "start with"...
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 19, 2006, at 4:53 PM, Chris Hostetter wrote:
> : it has is case sensitivity. I could lowercase everything, but then
> : the terms the user sees will be in all lowercase and that simply
> : won't do for my scholarly audience :)
>
> picky, picky users.
Yeah, if it wasn't for them, I'd have it easy :)
> : It seems like what I really need is simply a separate index (or
> : rather a partition of the main Solr one) where a Document represents
> : an "agent", and do a PrefixQuery or TermEnum and get all unique
> agents.
>
> i've let it roll arround in my head for a few days, and i think that's
> exactly what i would do if it were me ... in fact, what you
> describe is
> pretty much exactly what i do for product categories, except htat i
> think
> i store more metadata about each category then you would store
> about your
> "agents". what you really need is a way to search for agents by
> term or
> term prefix, get a list of matching agents, and then use each agent
> as a
> facet for your "works" .. I do the same thing, except my term
> queries are
> on the unique id for the category, and my "prefix" queries are for the
> null prefix (ie: look at all categories) .. then once i have a
> category, i
> have other data that helps me with further facets (ie: for digital
> camera's, "resolution" is a good facet).
In fact, I've implemented this locally as a custom SolrCache that
holds a RAMDirectory. I TermEnum the agents in the main index on warm
() and index all the agents into the RAMDirectory. It is working well.
> i could imagine the same extension eventually unfolding for your
> agents
> ... i don't know much about literary works, but if we transition it
> to art
> in general, you might have information for one artist about different
> "labels" that apply to the art he produced in his life (sculpture,
> painting, cubist, impressionist, modern, "blue period", etc..) and
> once
> your user has selected a specific artist, you could use the list of
> labels
> from a stored field of the artists metadata doc to decide which
> facets to
> offer the user in refining further.
We have metadata out the wazoo for this stuff. We have "genres"
which is a categorization of the type of work like "Painting",
"Poetry", etc. We have agents classified into roles. The same
person could be the author of one work, and a figure in a painting of
another work, and the editor of another. So even within agent the
user interface will display the break down of each agent by the
various roles. *whew*
> : Maybe I need to build some sort of term -> agent cache during
> warming
> : that makes this a no brainer?
>
> that's another way to go ... but if you make one doc per agent,
> then this
> is just a subset of the filter cache .... i personally love the filter
> cache :)
I opted for the RAMDirectory so I can leverage Lucene scoring for the
ordering of agents, rather than only alphabetical and frequency options.
Erik
Re: solr-suggestion - terms that "start with"...
Posted by Chris Hostetter <ho...@fucit.org>.
: it has is case sensitivity. I could lowercase everything, but then
: the terms the user sees will be in all lowercase and that simply
: won't do for my scholarly audience :)
picky, picky users.
: It seems like what I really need is simply a separate index (or
: rather a partition of the main Solr one) where a Document represents
: an "agent", and do a PrefixQuery or TermEnum and get all unique agents.
i've let it roll arround in my head for a few days, and i think that's
exactly what i would do if it were me ... in fact, what you describe is
pretty much exactly what i do for product categories, except htat i think
i store more metadata about each category then you would store about your
"agents". what you really need is a way to search for agents by term or
term prefix, get a list of matching agents, and then use each agent as a
facet for your "works" .. I do the same thing, except my term queries are
on the unique id for the category, and my "prefix" queries are for the
null prefix (ie: look at all categories) .. then once i have a category, i
have other data that helps me with further facets (ie: for digital
camera's, "resolution" is a good facet).
i could imagine the same extension eventually unfolding for your agents
... i don't know much about literary works, but if we transition it to art
in general, you might have information for one artist about different
"labels" that apply to the art he produced in his life (sculpture,
painting, cubist, impressionist, modern, "blue period", etc..) and once
your user has selected a specific artist, you could use the list of labels
from a stored field of the artists metadata doc to decide which facets to
offer the user in refining further.
: Maybe I need to build some sort of term -> agent cache during warming
: that makes this a no brainer?
that's another way to go ... but if you make one doc per agent, then this
is just a subset of the filter cache .... i personally love the filter
cache :)
-Hoss
Re: solr-suggestion - terms that "start with"...
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 17, 2006, at 2:18 PM, Chris Hostetter wrote:
> Off the top of my head, i can't think of any cool way Solr cna help
> you
> with this. My best bet on how to solve the problem in general
> would be an
> analyzer that doesn't to any tokenizing, but does create multiple
> tokens
> at the same position after "rotating" all of the words, ie for the
> inputtext...
>
> "Dante Gabriel Rossetti"
>
> create the following tokens, all at the same position...
>
> "Dante Gabriel Rossetti"
> "Gabriel Rossetti, Dante"
> "Rossetti, Dante Gabriel"
>
> ...and then keep using TermEnum.
That's a pretty clever solution, actually. However, the one negative
it has is case sensitivity. I could lowercase everything, but then
the terms the user sees will be in all lowercase and that simply
won't do for my scholarly audience :)
I suppose the general way to do this accurately in a rough fashion
would be to tokenize (StandardAnalyzer would be sufficient) the agent
field, then during TermEnum'ing look up all the Documents for that
term, grab the "agent" field, and display that - except the fiddly
bit about agent being multivalued.
It seems like what I really need is simply a separate index (or
rather a partition of the main Solr one) where a Document represents
an "agent", and do a PrefixQuery or TermEnum and get all unique agents.
But partitioning the Solr index into various document types at this
point seems overkill - though it is an area I'd like to explore.
Maybe I need to build some sort of term -> agent cache during warming
that makes this a no brainer?
Thanks for all the excellent feedback.
And, I was successful in producing an auto-suggest lookup using the
non-tokenized case sensitive terms and performance was quite fine -
again this is a Ruby on Rails front-end hitting Solr, with the only
caching occurring on the Solr side of things currently.
Erik
Re: solr-suggestion - terms that "start with"...
Posted by Chris Hostetter <ho...@fucit.org>.
: That is currently how I have it set up. The agent field is not
: tokenized. However, I need it to be. Here's a concrete example.
: "Dante Gabriel Rossetti" is one of the agents in our system. Users
: should be able to find him by typing either "d", "g", or "r" (case
: insensitive) and they'd see "Dante Gabriel Rossetti (42)" in the
: suggest popup where 42 is the number of documents he's involved in
: given the constraints.
Ohhhh... i was totally missing your point. i thought you had a tokenized
field for searching and you just needed a non tokenized field for facets.
Hmmmmm....
Off the top of my head, i can't think of any cool way Solr cna help you
with this. My best bet on how to solve the problem in general would be an
analyzer that doesn't to any tokenizing, but does create multiple tokens
at the same position after "rotating" all of the words, ie for the
inputtext...
"Dante Gabriel Rossetti"
create the following tokens, all at the same position...
"Dante Gabriel Rossetti"
"Gabriel Rossetti, Dante"
"Rossetti, Dante Gabriel"
...and then keep using TermEnum.
: Here's a concrete multivalued example... a work has two agents "Otis
: Hatcher" and "Erik Gospodnetic". The user types "g" and "Erik
: Gospodnetic (2)" pops up, or types "o" and "Otis Hatcher (1)" pops
that should still work with an analyzer like i described correct? you just
get the part of the agents name the user was type completing first .. not
sure if that's a deal breaker.
-Hoss
Re: solr-suggestion - terms that "start with"...
Posted by Yonik Seeley <ys...@gmail.com>.
On 5/17/06, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> how does searcher.numDocs() compare to using docSet.intersectionSize
> (constraintDocSet)?
No current performance difference... numDocs() is slightly more
abstracted so that something like an intersection-cache could be added
in the future if needed, or alternate ways of taking intersections
could be done (non-cached simple term queries could use the same
strategy as ConjunctionScorer, etc).
It must have been late when I came up with the name "numDocs()" ;-)
And I should have probably generalized it to
numDocs(Collection<Query>, Collection<DocSet>) or something like a
ChainedDocSet...
-Yonik
Re: solr-suggestion - terms that "start with"...
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 16, 2006, at 10:47 PM, Chris Hostetter wrote:
> : I've just improved the code to be a better DocSet citizen and it now
> : does this:
> :
> : BitDocSet constraintDocSet = new BitDocSet(constraintMask);
> : ...
> : map.put(term.text(), docSet.intersectionSize
> : (constraintDocSet));
>
> how are you building constraintMask come from? ... if it's a BitSet
> you
> are building up by executing a bunch of queries, getting their
> DocSets,
> asking those DocSets for their bits, and then unioning/interescting
> them
> then that's probably the best place where there's likely to be benefit
> from Solr that you aren't taking advantage of already (except that
> i seem
> to recall you wanting to do things that DocSets don't currently
> support:
> like invert .. so maybe this is hte best way)
Yeah, I'm building up constraintMask using the refactoring to use
Solr's caching - so its still BitSet's within my FacetCache, but
these are all pre-loaded during warming. I'll eventually refactor it
further to use DocSet's, but for now the speed and memory usage are
all more than acceptable (we have hundreds, not thousands or
millions, of facet values).
> Of the cuff: the one thing i would do differnetly if it were me, is...
>
> BitDocSet constraintDocSet = new BitDocSet(constraintMask);
> ...
> if (term != null && term.field().equals(facet) && term.text
> ().startsWith(prefix)) {
> map.put(term.text(), searcher.numDocs(new TermQuery(term),
> constraintDocSet);
> } else {
> ...
>
> ...there's no performacne gain, but it makes your code a little
> cleaner.
how does searcher.numDocs() compare to using docSet.intersectionSize
(constraintDocSet)?
> As for issue of how you get the values based on your prefix, i
> would keep
> using a TermEnum, but build it on a field that isn't tokenized.
That is currently how I have it set up. The agent field is not
tokenized. However, I need it to be. Here's a concrete example.
"Dante Gabriel Rossetti" is one of the agents in our system. Users
should be able to find him by typing either "d", "g", or "r" (case
insensitive) and they'd see "Dante Gabriel Rossetti (42)" in the
suggest popup where 42 is the number of documents he's involved in
given the constraints.
> : Oh, one other wrinkle to getting the stored field value is that the
> : agent field is multi-valued, so several people could collaborate and
> : have their individual names associated with a work. So there are
>
> this won't be a problem with the multiValued="true" option ... it does
> what you expect regardless of wether the field is text,string,integer,
> tokenized/non-tokenized.
>
> (well, it does what *I* expect ... if you exepct something and it
> doesn't
> do that -- let us know)
Here's a concrete multivalued example... a work has two agents "Otis
Hatcher" and "Erik Gospodnetic". The user types "g" and "Erik
Gospodnetic (2)" pops up, or types "o" and "Otis Hatcher (1)" pops
up. So still not quite there - looks like I'll have to walk TermDocs
and get the stored agent fields, but even then thats not refined
enough as I wouldn't know which agent field in the array of stored
values was the match.
Erik
Re: solr-suggestion - terms that "start with"...
Posted by Chris Hostetter <ho...@fucit.org>.
: I've just improved the code to be a better DocSet citizen and it now
: does this:
:
: BitDocSet constraintDocSet = new BitDocSet(constraintMask);
: ...
: map.put(term.text(), docSet.intersectionSize
: (constraintDocSet));
how are you building constraintMask come from? ... if it's a BitSet you
are building up by executing a bunch of queries, getting their DocSets,
asking those DocSets for their bits, and then unioning/interescting them
then that's probably the best place where there's likely to be benefit
from Solr that you aren't taking advantage of already (except that i seem
to recall you wanting to do things that DocSets don't currently support:
like invert .. so maybe this is hte best way)
Of the cuff: the one thing i would do differnetly if it were me, is...
BitDocSet constraintDocSet = new BitDocSet(constraintMask);
...
if (term != null && term.field().equals(facet) && term.text().startsWith(prefix)) {
map.put(term.text(), searcher.numDocs(new TermQuery(term),
constraintDocSet);
} else {
...
...there's no performacne gain, but it makes your code a little cleaner.
As for issue of how you get the values based on your prefix, i would keep
using a TermEnum, but build it on a field that isn't tokenized. with a
copyField this becomes really easy, if this is your current "agent"
field...
<fieldtype name="text" class="solr.TextField" ...
<field name="agent" type="text" indexed="true" stored="true"
multiValued="true"/>
..then assuming you already have...
<fieldtype name="string" class="solr.StrField" />
...add...
<field name="agentRAW" type="string" indexed="true" stored="false"
omitNorms="true" multiValued="true" />
...and farther down...
<copyField source="agent" dest="agentRAW"/>
...and then make your TermEnum on the "agentRAW" field.
: Oh, one other wrinkle to getting the stored field value is that the
: agent field is multi-valued, so several people could collaborate and
: have their individual names associated with a work. So there are
this won't be a problem with the multiValued="true" option ... it does
what you expect regardless of wether the field is text,string,integer,
tokenized/non-tokenized.
(well, it does what *I* expect ... if you exepct something and it doesn't
do that -- let us know)
-Hoss
Re: solr-suggestion - terms that "start with"...
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 16, 2006, at 9:19 PM, Erik Hatcher wrote:
> User story: We have a lot of peoples names in our data ("agents"
> that in some way contributed to a 19th century work). We're
> refactoring our user interface to have a better navigation of these
> names, such that someone can just start typing and immediately
> (google-suggest style) see terms and their document frequency
> within a set of filters. Someone types "yo", pauses, and "Yonik
> Seely (37)" appears. Also it would appear if someone typed "see".
>
> Falling back on my Lucene know-how, I've gotten Solr to respond
> with almost what I need using this code:
>
> TreeMap map = new TreeMap();
> String prefix = req.getParam("prefix");
>
> try {
> TermEnum enumerator = reader.terms(new Term(facet, prefix));
>
> do {
> Term term = enumerator.term();
> if (term != null && term.field().equals(facet) &&
> term.text().startsWith(prefix)) {
> DocSet docSet = searcher.getDocSet(new TermQuery(term));
> BitSet bits = docSet.getBits();
> bits.and(constraintMask);
> map.put(term.text(), bits.cardinality());
> } else {
> break;
> }
> }
> while (enumerator.next());
> } catch (IOException e) {
> rsp.setException(e);
> numErrors++;
> return;
> }
>
> rsp.add(facet, map);
>
> I'm going on gut feeling that Solr provides some handy benefits for
> me in this regard. For quick-and-dirty's sake I used DocSet.getBits
> () and did things the way I know how in order to AND it with an
> existing constraintMask BitSet (built earlier in my custom request
> handler based on constraint parameters passed in).
I've just improved the code to be a better DocSet citizen and it now
does this:
BitDocSet constraintDocSet = new BitDocSet(constraintMask);
...
map.put(term.text(), docSet.intersectionSize
(constraintDocSet));
Oh, one other wrinkle to getting the stored field value is that the
agent field is multi-valued, so several people could collaborate and
have their individual names associated with a work. So there are
multiple Lucene stored field values for the "agent" field. I'm
guessing that the best way to do this sort of thing is to index just
these fields into a separate set of documents and query only those.
Thoughts?
Thanks,
Erik