You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by lee carroll <le...@googlemail.com> on 2010/12/02 09:55:54 UTC
SOLR Thesaurus
Hi List,
Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)
I mean something like:
PT: xxxx
BT: xxx,xxxx,xxxx
NT: xxx,xxxx,xxxx
RT:xxx,xxx,xxx
Scope Note: xxxxxx,xxxx
Like i say bells and whistles
cheers Lee
RE: SOLR Thesaurus
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Lee,
Can you describe your thesaurus format (it's not exactly self-descriptive) and how you would like it to be applied?
I gather you're referring to a thesaurus feature in another product (or product class)? Maybe if you describe that it would help too.
Steve
> -----Original Message-----
> From: lee carroll [mailto:lee.a.carroll@googlemail.com]
> Sent: Thursday, December 02, 2010 3:56 AM
> To: solr-user@lucene.apache.org
> Subject: SOLR Thesaurus
>
> Hi List,
>
> Coming to and end of a proto type evaluation of SOLR (all very good etc
> etc)
> Getting to the point at looking at bells and whistles. Does SOLR have a
> thesuarus. Cant find any refrerence
> to one in the docs and on the wiki etc. (Apart from a few mail threads
> which
> describe the synonym.txt as a thesuarus)
>
> I mean something like:
>
> PT: xxxx
> BT: xxx,xxxx,xxxx
> NT: xxx,xxxx,xxxx
> RT:xxx,xxx,xxx
> Scope Note: xxxxxx,xxxx
>
> Like i say bells and whistles
>
> cheers Lee
Re: SOLR Thesaurus
Posted by Jonathan Rochkind <ro...@jhu.edu>.
No, it doesn't. And it's not entirely clear what (if any) simple way
there is to use Solr to expose hieararchically related documents in a
way that preserves and usefully allows navigation of the relationships.
At least in general, for sophisticated stuff.
On 12/2/2010 3:55 AM, lee carroll wrote:
> Hi List,
>
> Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
> Getting to the point at looking at bells and whistles. Does SOLR have a
> thesuarus. Cant find any refrerence
> to one in the docs and on the wiki etc. (Apart from a few mail threads which
> describe the synonym.txt as a thesuarus)
>
> I mean something like:
>
> PT: xxxx
> BT: xxx,xxxx,xxxx
> NT: xxx,xxxx,xxxx
> RT:xxx,xxx,xxx
> Scope Note: xxxxxx,xxxx
>
> Like i say bells and whistles
>
> cheers Lee
>
Re: SOLR Thesaurus
Posted by lee carroll <le...@googlemail.com>.
Two Peters (or rather a stupid english bloke who can't work out how to type
fancy accents :-)
Sorry Péter (took me 10 minutes to work out i could cut and paste) my reply
was to the clustering post by Peter Sturge. Clustering sounds great but
being able to define a thesaurus scheme excatly would be good too.
2010/12/10 Péter Király <ki...@gmail.com>
> Hi Lee,
>
> according to my vision the user could decide which relationship types
> would he likes to attach to his search, and the application would call
> his attention to other possibilities. So there would be no heuristic
> method applied, because e.g. boarder terms would cause lots of
> misleading results.
>
> Péter
>
> 2010/12/10 lee carroll <le...@googlemail.com>:
> > Hi Peter,
> >
> > Thats way to clever for me :-)
> > Discovering thesuarus relationships would be fantastic but its not clear
> > what heuristics you would need to use to discover broader, narrower,
> related
> > documents etc. Although I might be doing the clustering down i'm
> sceptical
> > about the accuracy.
> >
> > cheers Lee c
> >
> > On 10 December 2010 09:38, Peter Sturge <pe...@gmail.com> wrote:
> >
> >> Hi Lee,
> >>
> >> Perhaps Solr's clustering component might be helpful for your use case?
> >> http://wiki.apache.org/solr/ClusteringComponent
> >>
> >>
> >>
> >>
> >> On Fri, Dec 10, 2010 at 9:17 AM, lee carroll
> >> <le...@googlemail.com> wrote:
> >> > Hi Chris,
> >> >
> >> > Its all a bit early in the morning for this mined :-)
> >> >
> >> > The question asked, in good faith, was does solr support or extend to
> >> > implementing a thesaurus. It looks like it does not which is fine. It
> >> does
> >> > support synonyms and synonym rings which is again fine. The ski
> example
> >> was
> >> > an illustration in response to a follow up question for more
> explanation
> >> on
> >> > what a thesaurus is.
> >> >
> >> > An attempt at an answer of why a thesaurus; is below.
> >> >
> >> > Use case 1: improve facets
> >> >
> >> > Motivation
> >> > Unstructured lists of labels in facets offer very poor user
> experience.
> >> > Similar to tag clouds users find them arbitrary, with out focus and
> often
> >> > overwhelming. Labels in facets which are grouped in meaningful ways
> >> relevant
> >> > to the user increase engagement, perceived relevance and user
> >> satisfaction.
> >> >
> >> > Solution
> >> > A thesaurus of term relationships could be used to group facet labels
> >> >
> >> > Implementation
> >> > (er completely out of my depth at this point)
> >> > Thesaurus relationships defined in a simple text file
> >> > term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
> >> > if a search specifies a facet to be returned the field terms are
> >> identified
> >> > by reading the thesaurus into groups, broader terms, narrower terms,
> >> related
> >> > terms etc
> >> > These groups are returned as part of the response for the UI to
> display
> >> > faceted labels as broader, narrower, related terms etc
> >> >
> >> > Use case 2: Increase synonym search precision
> >> >
> >> > Motivation
> >> > Synonyms rings do not allow differences in synonym to be identified.
> >> Rarely
> >> > are synonyms exactly equivalent. This leads to a decrease in search
> >> > precision.
> >> >
> >> > Solution
> >> > Boost queries based on search term thesaurus relationships
> >> >
> >> > Implementation
> >> > (again completely out of depth here)
> >> > Allow terms in the index to be identified as bt , nt, .. terms of the
> >> search
> >> > term. Allow query parser to boost terms differentially based on these
> >> > thesaurus relationships
> >> >
> >> >
> >> >
> >> > As for the x and y stuff I'm not sure, like i say its quite early in
> the
> >> > morning for me. I'm sure their may well be a different way of
> achieving
> >> the
> >> > above (but note it is more than a hierarchy). However the librarians
> have
> >> > been doing this for 50 years now .
> >> >
> >> > Again though just to repeat this is hardly a killer for us. We've
> looked
> >> at
> >> > solr for a project; created a proto type; generated tons of questions,
> >> had
> >> > them answered in the main by the docs, some on this list and been
> amazed
> >> at
> >> > the fantastic results solr has given us. In fact with a combination of
> >> > keepwords and synonyms we have got a pretty nice simple set of facet
> >> labels
> >> > anyway (my motivation for the original question), so our corpus at the
> >> > moment does not really need a thesaurus! :-)
> >> >
> >> > Thanks Lee
> >> >
> >> >
> >> > On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org>
> >> wrote:
> >> >
> >> >>
> >> >>
> >> >> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
> >> >> Narrower
> >> >> : Terms (NT) Related Terms (RT) etc
> >> >> ...
> >> >> : User supplied Term is say : Ski
> >> >> :
> >> >> : Prefered term: Skiing
> >> >> : Broader terms could be : Ski and Snow Boarding, Mountain Sports,
> >> Sports
> >> >> : Narrower terms: down hill skiing, telemark, cross country
> >> >> : Related terms: boarding, snow boarding, winter holidays
> >> >>
> >> >> I'm still lost.
> >> >>
> >> >> You've described a black box with some sample input ("Ski") and some
> >> >> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but
> you
> >> >> haven't explained what you want to do with tht black box. Assuming
> such
> >> a
> >> >> black box existed in solr what are you expecting/hoping to do with
> it?
> >> >> how would such a black box modify solr's user experience? what is
> your
> >> >> goal?
> >> >>
> >> >> Smells like an XY Problem...
> >> >> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
> <http://people.apache.org/%7Ehossman/#xyproblem>
> >> <http://people.apache.org/%7Ehossman/#xyproblem>
> >> >>
> >> >> Your question appears to be an "XY Problem" ... that is: you are
> dealing
> >> >> with "X", you are assuming "Y" will help you, and you are asking
> about
> >> "Y"
> >> >> without giving more details about the "X" so that we can understand
> the
> >> >> full issue. Perhaps the best solution doesn't involve "Y" at all?
> >> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
> >> >>
> >> >>
> >> >> -Hoss
> >> >>
> >> >
> >>
> >
>
Re: SOLR Thesaurus
Posted by Péter Király <ki...@gmail.com>.
Hi Lee,
according to my vision the user could decide which relationship types
would he likes to attach to his search, and the application would call
his attention to other possibilities. So there would be no heuristic
method applied, because e.g. boarder terms would cause lots of
misleading results.
Péter
2010/12/10 lee carroll <le...@googlemail.com>:
> Hi Peter,
>
> Thats way to clever for me :-)
> Discovering thesuarus relationships would be fantastic but its not clear
> what heuristics you would need to use to discover broader, narrower, related
> documents etc. Although I might be doing the clustering down i'm sceptical
> about the accuracy.
>
> cheers Lee c
>
> On 10 December 2010 09:38, Peter Sturge <pe...@gmail.com> wrote:
>
>> Hi Lee,
>>
>> Perhaps Solr's clustering component might be helpful for your use case?
>> http://wiki.apache.org/solr/ClusteringComponent
>>
>>
>>
>>
>> On Fri, Dec 10, 2010 at 9:17 AM, lee carroll
>> <le...@googlemail.com> wrote:
>> > Hi Chris,
>> >
>> > Its all a bit early in the morning for this mined :-)
>> >
>> > The question asked, in good faith, was does solr support or extend to
>> > implementing a thesaurus. It looks like it does not which is fine. It
>> does
>> > support synonyms and synonym rings which is again fine. The ski example
>> was
>> > an illustration in response to a follow up question for more explanation
>> on
>> > what a thesaurus is.
>> >
>> > An attempt at an answer of why a thesaurus; is below.
>> >
>> > Use case 1: improve facets
>> >
>> > Motivation
>> > Unstructured lists of labels in facets offer very poor user experience.
>> > Similar to tag clouds users find them arbitrary, with out focus and often
>> > overwhelming. Labels in facets which are grouped in meaningful ways
>> relevant
>> > to the user increase engagement, perceived relevance and user
>> satisfaction.
>> >
>> > Solution
>> > A thesaurus of term relationships could be used to group facet labels
>> >
>> > Implementation
>> > (er completely out of my depth at this point)
>> > Thesaurus relationships defined in a simple text file
>> > term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
>> > if a search specifies a facet to be returned the field terms are
>> identified
>> > by reading the thesaurus into groups, broader terms, narrower terms,
>> related
>> > terms etc
>> > These groups are returned as part of the response for the UI to display
>> > faceted labels as broader, narrower, related terms etc
>> >
>> > Use case 2: Increase synonym search precision
>> >
>> > Motivation
>> > Synonyms rings do not allow differences in synonym to be identified.
>> Rarely
>> > are synonyms exactly equivalent. This leads to a decrease in search
>> > precision.
>> >
>> > Solution
>> > Boost queries based on search term thesaurus relationships
>> >
>> > Implementation
>> > (again completely out of depth here)
>> > Allow terms in the index to be identified as bt , nt, .. terms of the
>> search
>> > term. Allow query parser to boost terms differentially based on these
>> > thesaurus relationships
>> >
>> >
>> >
>> > As for the x and y stuff I'm not sure, like i say its quite early in the
>> > morning for me. I'm sure their may well be a different way of achieving
>> the
>> > above (but note it is more than a hierarchy). However the librarians have
>> > been doing this for 50 years now .
>> >
>> > Again though just to repeat this is hardly a killer for us. We've looked
>> at
>> > solr for a project; created a proto type; generated tons of questions,
>> had
>> > them answered in the main by the docs, some on this list and been amazed
>> at
>> > the fantastic results solr has given us. In fact with a combination of
>> > keepwords and synonyms we have got a pretty nice simple set of facet
>> labels
>> > anyway (my motivation for the original question), so our corpus at the
>> > moment does not really need a thesaurus! :-)
>> >
>> > Thanks Lee
>> >
>> >
>> > On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org>
>> wrote:
>> >
>> >>
>> >>
>> >> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
>> >> Narrower
>> >> : Terms (NT) Related Terms (RT) etc
>> >> ...
>> >> : User supplied Term is say : Ski
>> >> :
>> >> : Prefered term: Skiing
>> >> : Broader terms could be : Ski and Snow Boarding, Mountain Sports,
>> Sports
>> >> : Narrower terms: down hill skiing, telemark, cross country
>> >> : Related terms: boarding, snow boarding, winter holidays
>> >>
>> >> I'm still lost.
>> >>
>> >> You've described a black box with some sample input ("Ski") and some
>> >> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
>> >> haven't explained what you want to do with tht black box. Assuming such
>> a
>> >> black box existed in solr what are you expecting/hoping to do with it?
>> >> how would such a black box modify solr's user experience? what is your
>> >> goal?
>> >>
>> >> Smells like an XY Problem...
>> >> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
>> <http://people.apache.org/%7Ehossman/#xyproblem>
>> >>
>> >> Your question appears to be an "XY Problem" ... that is: you are dealing
>> >> with "X", you are assuming "Y" will help you, and you are asking about
>> "Y"
>> >> without giving more details about the "X" so that we can understand the
>> >> full issue. Perhaps the best solution doesn't involve "Y" at all?
>> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>> >>
>> >>
>> >> -Hoss
>> >>
>> >
>>
>
Re: SOLR Thesaurus
Posted by lee carroll <le...@googlemail.com>.
Hi Peter,
Thats way to clever for me :-)
Discovering thesuarus relationships would be fantastic but its not clear
what heuristics you would need to use to discover broader, narrower, related
documents etc. Although I might be doing the clustering down i'm sceptical
about the accuracy.
cheers Lee c
On 10 December 2010 09:38, Peter Sturge <pe...@gmail.com> wrote:
> Hi Lee,
>
> Perhaps Solr's clustering component might be helpful for your use case?
> http://wiki.apache.org/solr/ClusteringComponent
>
>
>
>
> On Fri, Dec 10, 2010 at 9:17 AM, lee carroll
> <le...@googlemail.com> wrote:
> > Hi Chris,
> >
> > Its all a bit early in the morning for this mined :-)
> >
> > The question asked, in good faith, was does solr support or extend to
> > implementing a thesaurus. It looks like it does not which is fine. It
> does
> > support synonyms and synonym rings which is again fine. The ski example
> was
> > an illustration in response to a follow up question for more explanation
> on
> > what a thesaurus is.
> >
> > An attempt at an answer of why a thesaurus; is below.
> >
> > Use case 1: improve facets
> >
> > Motivation
> > Unstructured lists of labels in facets offer very poor user experience.
> > Similar to tag clouds users find them arbitrary, with out focus and often
> > overwhelming. Labels in facets which are grouped in meaningful ways
> relevant
> > to the user increase engagement, perceived relevance and user
> satisfaction.
> >
> > Solution
> > A thesaurus of term relationships could be used to group facet labels
> >
> > Implementation
> > (er completely out of my depth at this point)
> > Thesaurus relationships defined in a simple text file
> > term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
> > if a search specifies a facet to be returned the field terms are
> identified
> > by reading the thesaurus into groups, broader terms, narrower terms,
> related
> > terms etc
> > These groups are returned as part of the response for the UI to display
> > faceted labels as broader, narrower, related terms etc
> >
> > Use case 2: Increase synonym search precision
> >
> > Motivation
> > Synonyms rings do not allow differences in synonym to be identified.
> Rarely
> > are synonyms exactly equivalent. This leads to a decrease in search
> > precision.
> >
> > Solution
> > Boost queries based on search term thesaurus relationships
> >
> > Implementation
> > (again completely out of depth here)
> > Allow terms in the index to be identified as bt , nt, .. terms of the
> search
> > term. Allow query parser to boost terms differentially based on these
> > thesaurus relationships
> >
> >
> >
> > As for the x and y stuff I'm not sure, like i say its quite early in the
> > morning for me. I'm sure their may well be a different way of achieving
> the
> > above (but note it is more than a hierarchy). However the librarians have
> > been doing this for 50 years now .
> >
> > Again though just to repeat this is hardly a killer for us. We've looked
> at
> > solr for a project; created a proto type; generated tons of questions,
> had
> > them answered in the main by the docs, some on this list and been amazed
> at
> > the fantastic results solr has given us. In fact with a combination of
> > keepwords and synonyms we have got a pretty nice simple set of facet
> labels
> > anyway (my motivation for the original question), so our corpus at the
> > moment does not really need a thesaurus! :-)
> >
> > Thanks Lee
> >
> >
> > On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org>
> wrote:
> >
> >>
> >>
> >> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
> >> Narrower
> >> : Terms (NT) Related Terms (RT) etc
> >> ...
> >> : User supplied Term is say : Ski
> >> :
> >> : Prefered term: Skiing
> >> : Broader terms could be : Ski and Snow Boarding, Mountain Sports,
> Sports
> >> : Narrower terms: down hill skiing, telemark, cross country
> >> : Related terms: boarding, snow boarding, winter holidays
> >>
> >> I'm still lost.
> >>
> >> You've described a black box with some sample input ("Ski") and some
> >> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
> >> haven't explained what you want to do with tht black box. Assuming such
> a
> >> black box existed in solr what are you expecting/hoping to do with it?
> >> how would such a black box modify solr's user experience? what is your
> >> goal?
> >>
> >> Smells like an XY Problem...
> >> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
> <http://people.apache.org/%7Ehossman/#xyproblem>
> >>
> >> Your question appears to be an "XY Problem" ... that is: you are dealing
> >> with "X", you are assuming "Y" will help you, and you are asking about
> "Y"
> >> without giving more details about the "X" so that we can understand the
> >> full issue. Perhaps the best solution doesn't involve "Y" at all?
> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
> >>
> >>
> >> -Hoss
> >>
> >
>
Re: SOLR Thesaurus
Posted by Peter Sturge <pe...@gmail.com>.
Hi Lee,
Perhaps Solr's clustering component might be helpful for your use case?
http://wiki.apache.org/solr/ClusteringComponent
On Fri, Dec 10, 2010 at 9:17 AM, lee carroll
<le...@googlemail.com> wrote:
> Hi Chris,
>
> Its all a bit early in the morning for this mined :-)
>
> The question asked, in good faith, was does solr support or extend to
> implementing a thesaurus. It looks like it does not which is fine. It does
> support synonyms and synonym rings which is again fine. The ski example was
> an illustration in response to a follow up question for more explanation on
> what a thesaurus is.
>
> An attempt at an answer of why a thesaurus; is below.
>
> Use case 1: improve facets
>
> Motivation
> Unstructured lists of labels in facets offer very poor user experience.
> Similar to tag clouds users find them arbitrary, with out focus and often
> overwhelming. Labels in facets which are grouped in meaningful ways relevant
> to the user increase engagement, perceived relevance and user satisfaction.
>
> Solution
> A thesaurus of term relationships could be used to group facet labels
>
> Implementation
> (er completely out of my depth at this point)
> Thesaurus relationships defined in a simple text file
> term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
> if a search specifies a facet to be returned the field terms are identified
> by reading the thesaurus into groups, broader terms, narrower terms, related
> terms etc
> These groups are returned as part of the response for the UI to display
> faceted labels as broader, narrower, related terms etc
>
> Use case 2: Increase synonym search precision
>
> Motivation
> Synonyms rings do not allow differences in synonym to be identified. Rarely
> are synonyms exactly equivalent. This leads to a decrease in search
> precision.
>
> Solution
> Boost queries based on search term thesaurus relationships
>
> Implementation
> (again completely out of depth here)
> Allow terms in the index to be identified as bt , nt, .. terms of the search
> term. Allow query parser to boost terms differentially based on these
> thesaurus relationships
>
>
>
> As for the x and y stuff I'm not sure, like i say its quite early in the
> morning for me. I'm sure their may well be a different way of achieving the
> above (but note it is more than a hierarchy). However the librarians have
> been doing this for 50 years now .
>
> Again though just to repeat this is hardly a killer for us. We've looked at
> solr for a project; created a proto type; generated tons of questions, had
> them answered in the main by the docs, some on this list and been amazed at
> the fantastic results solr has given us. In fact with a combination of
> keepwords and synonyms we have got a pretty nice simple set of facet labels
> anyway (my motivation for the original question), so our corpus at the
> moment does not really need a thesaurus! :-)
>
> Thanks Lee
>
>
> On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org> wrote:
>
>>
>>
>> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
>> Narrower
>> : Terms (NT) Related Terms (RT) etc
>> ...
>> : User supplied Term is say : Ski
>> :
>> : Prefered term: Skiing
>> : Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
>> : Narrower terms: down hill skiing, telemark, cross country
>> : Related terms: boarding, snow boarding, winter holidays
>>
>> I'm still lost.
>>
>> You've described a black box with some sample input ("Ski") and some
>> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
>> haven't explained what you want to do with tht black box. Assuming such a
>> black box existed in solr what are you expecting/hoping to do with it?
>> how would such a black box modify solr's user experience? what is your
>> goal?
>>
>> Smells like an XY Problem...
>> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
>>
>> Your question appears to be an "XY Problem" ... that is: you are dealing
>> with "X", you are assuming "Y" will help you, and you are asking about "Y"
>> without giving more details about the "X" so that we can understand the
>> full issue. Perhaps the best solution doesn't involve "Y" at all?
>> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>>
>>
>> -Hoss
>>
>
Re: SOLR Thesaurus
Posted by Chris Hostetter <ho...@fucit.org>.
: The question asked, in good faith, was does solr support or extend to
: implementing a thesaurus. It looks like it does not which is fine. It does
Well, my point was that "thesaurus" is not a feature description. it's a
data structure, and depending on your goals, the existing SynonymFilter
may be perfectly usable out of the box.
: Use case 1: improve facets
:
: Motivation
: Unstructured lists of labels in facets offer very poor user experience.
: Similar to tag clouds users find them arbitrary, with out focus and often
: overwhelming. Labels in facets which are grouped in meaningful ways relevant
: to the user increase engagement, perceived relevance and user satisfaction.
SynonymFilter could definitley be used to help in this situation -- if you
create a synonyms.txt file mapping all of the terms in your thesaurus to
your Prefered Term you could then use SynonymFilter at index time to get a
clean list of facet constraints. (if you wnat a simple list of only the
Prefered Terms)
Alternately...
: Solution
: A thesaurus of term relationships could be used to group facet labels
:
: Implementation
: (er completely out of my depth at this point)
: Thesaurus relationships defined in a simple text file
: term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
: if a search specifies a facet to be returned the field terms are identified
: by reading the thesaurus into groups, broader terms, narrower terms, related
: terms etc
: These groups are returned as part of the response for the UI to display
: faceted labels as broader, narrower, related terms etc
...what you're describing is a hierarchical faceting model. with a
properly structured synonyms.txt used at indexing time and the
"hierarchy" trick i describe on slide #32-25 of this presentation...
http://people.apache.org/~hossman/apachecon2010/facets/
...that should also be posisble.
: Implementation
: (again completely out of depth here)
: Allow terms in the index to be identified as bt , nt, .. terms of the search
: term. Allow query parser to boost terms differentially based on these
: thesaurus relationships
see my earlier reply to Péter Király, what you are describing is only
slightly more complicated then what i describe there ... this is
definitely something that would require a custom QParser, but the heavy
lifting could still be done by SynonymFilter (in the case you describe,
you'd just need to split your thesarus up into distinct mapping files for
BT, NT, etc.. and then have one SynonymFilter for each, and apply the
appropriate boost to the queries generated from them.
: Again though just to repeat this is hardly a killer for us. We've looked at
: solr for a project; created a proto type; generated tons of questions, had
: them answered in the main by the docs, some on this list and been amazed at
: the fantastic results solr has given us. In fact with a combination of
: keepwords and synonyms we have got a pretty nice simple set of facet labels
: anyway (my motivation for the original question), so our corpus at the
: moment does not really need a thesaurus! :-)
glad to hear it -- just didn't want you to think that something wasn't
available just because you couldn't find a feature with a specific name --
what you get "out of the box" can be used in a lot of interesting ways if
you think "out of hte box".
-Hoss
Re: SOLR Thesaurus
Posted by lee carroll <le...@googlemail.com>.
Hi Chris,
Its all a bit early in the morning for this mined :-)
The question asked, in good faith, was does solr support or extend to
implementing a thesaurus. It looks like it does not which is fine. It does
support synonyms and synonym rings which is again fine. The ski example was
an illustration in response to a follow up question for more explanation on
what a thesaurus is.
An attempt at an answer of why a thesaurus; is below.
Use case 1: improve facets
Motivation
Unstructured lists of labels in facets offer very poor user experience.
Similar to tag clouds users find them arbitrary, with out focus and often
overwhelming. Labels in facets which are grouped in meaningful ways relevant
to the user increase engagement, perceived relevance and user satisfaction.
Solution
A thesaurus of term relationships could be used to group facet labels
Implementation
(er completely out of my depth at this point)
Thesaurus relationships defined in a simple text file
term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
if a search specifies a facet to be returned the field terms are identified
by reading the thesaurus into groups, broader terms, narrower terms, related
terms etc
These groups are returned as part of the response for the UI to display
faceted labels as broader, narrower, related terms etc
Use case 2: Increase synonym search precision
Motivation
Synonyms rings do not allow differences in synonym to be identified. Rarely
are synonyms exactly equivalent. This leads to a decrease in search
precision.
Solution
Boost queries based on search term thesaurus relationships
Implementation
(again completely out of depth here)
Allow terms in the index to be identified as bt , nt, .. terms of the search
term. Allow query parser to boost terms differentially based on these
thesaurus relationships
As for the x and y stuff I'm not sure, like i say its quite early in the
morning for me. I'm sure their may well be a different way of achieving the
above (but note it is more than a hierarchy). However the librarians have
been doing this for 50 years now .
Again though just to repeat this is hardly a killer for us. We've looked at
solr for a project; created a proto type; generated tons of questions, had
them answered in the main by the docs, some on this list and been amazed at
the fantastic results solr has given us. In fact with a combination of
keepwords and synonyms we have got a pretty nice simple set of facet labels
anyway (my motivation for the original question), so our corpus at the
moment does not really need a thesaurus! :-)
Thanks Lee
On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
> Narrower
> : Terms (NT) Related Terms (RT) etc
> ...
> : User supplied Term is say : Ski
> :
> : Prefered term: Skiing
> : Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
> : Narrower terms: down hill skiing, telemark, cross country
> : Related terms: boarding, snow boarding, winter holidays
>
> I'm still lost.
>
> You've described a black box with some sample input ("Ski") and some
> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
> haven't explained what you want to do with tht black box. Assuming such a
> black box existed in solr what are you expecting/hoping to do with it?
> how would such a black box modify solr's user experience? what is your
> goal?
>
> Smells like an XY Problem...
> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue. Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
> -Hoss
>
Re: SOLR Thesaurus
Posted by "l.dellipaoli" <l....@reply.it>.
Hi!
I'm facing a similar problem.
I'm dealing with Thesauri on an Oracle RDBMS and I'm trying to integrate
Solr in order to speed up search operations.
Did you succeed in this integration?
Thanks,
Laura
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: SOLR Thesaurus
Posted by Péter Király <ki...@gmail.com>.
Hi Chris,
thanks for your description. I should think about this a little bit
more, then I will ask some details. The main problem is that Synonyms
are one kind of relations, and Thesaurus may contain 6-10 kinds of
relations. And it is depending on the user, which types of relations
he would like to use in a similar fashion as synonyms.
Péter
2010/12/10 Chris Hostetter <ho...@fucit.org>:
>
> : My imaginative use case:
> : - the user enters a term and maybe he turns on a flag to get not just
> : the term, but all terms, which related somehow with this (usually the
> : synonyms and narrower terms).
> : - Solr first find the queried term(s) in the thesaurus, then finds the
> : related terms, modifies and issues the query
> : e.g. query is fruits, and it becames (fruit OR apple OR banana OR ...)
> :
> : This use case is different from the synonym handler, which - as far as
> : I know - modifies the index, and injects synonyms at the position of
> : the original word. My use case suppose, that we maintain thesaurus as
> : a different "database" (maybe another Solr index).
>
> the use case you describe *could* be solved using the SynonymFilter -- you
> can configure it to be used at query time (for query expansion) *or* you
> can configure it to be used at index time (for reduction or expansion)
>
> just express your thesaurus in the synonyms.txt format and configure it in
> your schema.xml
>
> The two gotcha's to watch out for with this kind of appoach is multiword
> synonyms and the way Lucene's QueryParser treats whitespace as a
> metacharacter.
>
> in general, if you're going to do this kind of major query expantion, you
> probably wnat to use something like the "FieldQParser" which doesn't treat
> whitespace as special so user input like...
> United States
> ...makes it to hte analyzer as one chunk of text, and can be looked up as
> is in your thesaurus.
>
> The multiword synonym issue is more complicated - i don't have the energy
> to fully explain it right now, but for query time expansion it can be a
> real pain in the ass. one word arround is to index shingle-esque terms
> instead of hte individual words in your synonyms, but that defeats the
> point of your goal of having an external thesarus that can be modified
> independently of the index.
>
> My suggestion would be to write a simple little ThesarusQParser, that can
> use and instance of the SynonymFilter directly to preprocess the input
> text to get a list of all the Related Terms, and then delegate to another
> QParser to generate an appropate Query for each of them (typically a
> PhraseQuery) which your ThesarusQParser would then combine into a giant
> BooleanQuery (except you may wnat to consider a DisjunctionMaxQuery
> instead because of the scoring factors)
>
> ThesarusQParser would require very little code, because SynonymFilter
> would be doing all the hard work.
>
>
> -Hoss
>
Re: SOLR Thesaurus
Posted by Chris Hostetter <ho...@fucit.org>.
: My imaginative use case:
: - the user enters a term and maybe he turns on a flag to get not just
: the term, but all terms, which related somehow with this (usually the
: synonyms and narrower terms).
: - Solr first find the queried term(s) in the thesaurus, then finds the
: related terms, modifies and issues the query
: e.g. query is fruits, and it becames (fruit OR apple OR banana OR ...)
:
: This use case is different from the synonym handler, which - as far as
: I know - modifies the index, and injects synonyms at the position of
: the original word. My use case suppose, that we maintain thesaurus as
: a different "database" (maybe another Solr index).
the use case you describe *could* be solved using the SynonymFilter -- you
can configure it to be used at query time (for query expansion) *or* you
can configure it to be used at index time (for reduction or expansion)
just express your thesaurus in the synonyms.txt format and configure it in
your schema.xml
The two gotcha's to watch out for with this kind of appoach is multiword
synonyms and the way Lucene's QueryParser treats whitespace as a
metacharacter.
in general, if you're going to do this kind of major query expantion, you
probably wnat to use something like the "FieldQParser" which doesn't treat
whitespace as special so user input like...
United States
...makes it to hte analyzer as one chunk of text, and can be looked up as
is in your thesaurus.
The multiword synonym issue is more complicated - i don't have the energy
to fully explain it right now, but for query time expansion it can be a
real pain in the ass. one word arround is to index shingle-esque terms
instead of hte individual words in your synonyms, but that defeats the
point of your goal of having an external thesarus that can be modified
independently of the index.
My suggestion would be to write a simple little ThesarusQParser, that can
use and instance of the SynonymFilter directly to preprocess the input
text to get a list of all the Related Terms, and then delegate to another
QParser to generate an appropate Query for each of them (typically a
PhraseQuery) which your ThesarusQParser would then combine into a giant
BooleanQuery (except you may wnat to consider a DisjunctionMaxQuery
instead because of the scoring factors)
ThesarusQParser would require very little code, because SynonymFilter
would be doing all the hard work.
-Hoss
Re: SOLR Thesaurus
Posted by Péter Király <ki...@gmail.com>.
I also try to define the problem.
In the library world there are some general and special thesaurus,
which reveal the relations between concepts. The relations have types
as Lee described: Prefered Term (PT), Broader Terms (BT), Narrower
Terms (NT) Related Terms (RT) and others. Some of these thesaurus
covers lots of concepts, e.g. the Hungarian Common Thesaurus has more
than 60 000 concepts.
For searching perspective it would be fine if you can use this
knowledge in search.
My imaginative use case:
- the user enters a term and maybe he turns on a flag to get not just
the term, but all terms, which related somehow with this (usually the
synonyms and narrower terms).
- Solr first find the queried term(s) in the thesaurus, then finds the
related terms, modifies and issues the query
e.g. query is fruits, and it becames (fruit OR apple OR banana OR ...)
This use case is different from the synonym handler, which - as far as
I know - modifies the index, and injects synonyms at the position of
the original word. My use case suppose, that we maintain thesaurus as
a different "database" (maybe another Solr index).
My Solr knowledge is not deep enough to decide, that this use case
could be achive with combination of existing patches or contributed
modules.
If someone would start such a project, I would happily contribute.
Péter
http://eXtensibleCatalog.org
2010/12/10 Chris Hostetter <ho...@fucit.org>:
>
>
> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many Narrower
> : Terms (NT) Related Terms (RT) etc
> ...
> : User supplied Term is say : Ski
> :
> : Prefered term: Skiing
> : Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
> : Narrower terms: down hill skiing, telemark, cross country
> : Related terms: boarding, snow boarding, winter holidays
>
> I'm still lost.
>
> You've described a black box with some sample input ("Ski") and some
> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
> haven't explained what you want to do with tht black box. Assuming such a
> black box existed in solr what are you expecting/hoping to do with it?
> how would such a black box modify solr's user experience? what is your
> goal?
>
> Smells like an XY Problem...
> http://people.apache.org/~hossman/#xyproblem
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue. Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
> -Hoss
>
Re: SOLR Thesaurus
Posted by Chris Hostetter <ho...@fucit.org>.
: a term can have a Prefered Term (PT), many Broader Terms (BT), Many Narrower
: Terms (NT) Related Terms (RT) etc
...
: User supplied Term is say : Ski
:
: Prefered term: Skiing
: Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
: Narrower terms: down hill skiing, telemark, cross country
: Related terms: boarding, snow boarding, winter holidays
I'm still lost.
You've described a black box with some sample input ("Ski") and some
corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
haven't explained what you want to do with tht black box. Assuming such a
black box existed in solr what are you expecting/hoping to do with it?
how would such a black box modify solr's user experience? what is your
goal?
Smells like an XY Problem...
http://people.apache.org/~hossman/#xyproblem
Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue. Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341
-Hoss
Re: SOLR Thesaurus
Posted by lee carroll <le...@googlemail.com>.
Hi
Stephen, yes sorry should have been more plain
a term can have a Prefered Term (PT), many Broader Terms (BT), Many Narrower
Terms (NT) Related Terms (RT) etc
So
User supplied Term is say : Ski
Prefered term: Skiing
Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
Narrower terms: down hill skiing, telemark, cross country
Related terms: boarding, snow boarding, winter holidays
Michael,
yes exactly, SKOS, although maybe without the over wheening ambition to take
over the world.
By the sounds of it though out of the box you get a simple (but pretty
effective synonym list and ring) Anything more we'd need to write it
ourselfs ie your thesaurus filter and plus a change to the response as
broader terms, narrower terms etc would be good to be suggested to the ui.
No plugins out there ?
On 2 December 2010 16:16, Michael Zach <za...@punkt.at> wrote:
> Hello Lee,
>
> these bells sound like "SKOS" ;o)
>
> AFAIK Solr does not support thesauri just plain flat synonym lists.
>
> One could implement a thesaurus filter and put it into the end of the
> analyzer chain of solr.
>
> The filter would then do a thesaurus lookup for each token it receives and
> possibly
> * expand the query
> or
> * kind of "stem" document tokens to some prefered variants according to the
> thesaurus
>
> Maybe even taking term relations from thesaurus into account and boost
> queries or doc fields at index time.
>
> Maybe have a look at http://poolparty.punkt.at/ a full features SKOS
> thesaurus management server.
> It's also providing webservices which could feed such a Solr filter.
>
> Kind regards
> Michael
>
>
> ----- Ursprüngliche Mail -----
> Von: "lee carroll" <le...@googlemail.com>
> An: solr-user@lucene.apache.org
> Gesendet: Donnerstag, 2. Dezember 2010 09:55:54
> Betreff: SOLR Thesaurus
>
> Hi List,
>
> Coming to and end of a proto type evaluation of SOLR (all very good etc
> etc)
> Getting to the point at looking at bells and whistles. Does SOLR have a
> thesuarus. Cant find any refrerence
> to one in the docs and on the wiki etc. (Apart from a few mail threads
> which
> describe the synonym.txt as a thesuarus)
>
> I mean something like:
>
> PT: xxxx
> BT: xxx,xxxx,xxxx
> NT: xxx,xxxx,xxxx
> RT:xxx,xxx,xxx
> Scope Note: xxxxxx,xxxx
>
> Like i say bells and whistles
>
> cheers Lee
>
Re: SOLR Thesaurus
Posted by Michael Zach <za...@punkt.at>.
Hello Lee,
these bells sound like "SKOS" ;o)
AFAIK Solr does not support thesauri just plain flat synonym lists.
One could implement a thesaurus filter and put it into the end of the analyzer chain of solr.
The filter would then do a thesaurus lookup for each token it receives and possibly
* expand the query
or
* kind of "stem" document tokens to some prefered variants according to the thesaurus
Maybe even taking term relations from thesaurus into account and boost queries or doc fields at index time.
Maybe have a look at http://poolparty.punkt.at/ a full features SKOS thesaurus management server.
It's also providing webservices which could feed such a Solr filter.
Kind regards
Michael
----- Ursprüngliche Mail -----
Von: "lee carroll" <le...@googlemail.com>
An: solr-user@lucene.apache.org
Gesendet: Donnerstag, 2. Dezember 2010 09:55:54
Betreff: SOLR Thesaurus
Hi List,
Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)
I mean something like:
PT: xxxx
BT: xxx,xxxx,xxxx
NT: xxx,xxxx,xxxx
RT:xxx,xxx,xxx
Scope Note: xxxxxx,xxxx
Like i say bells and whistles
cheers Lee