You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by lee carroll <le...@googlemail.com> on 2010/12/02 09:55:54 UTC

SOLR Thesaurus

Hi List,

Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)

I mean something like:

PT: xxxx
BT: xxx,xxxx,xxxx
NT: xxx,xxxx,xxxx
RT:xxx,xxx,xxx
Scope Note: xxxxxx,xxxx

Like i say bells and whistles

cheers Lee

RE: SOLR Thesaurus

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Lee,

Can you describe your thesaurus format (it's not exactly self-descriptive) and how you would like it to be applied?

I gather you're referring to a thesaurus feature in another product (or product class)?  Maybe if you describe that it would help too.

Steve

> -----Original Message-----
> From: lee carroll [mailto:lee.a.carroll@googlemail.com]
> Sent: Thursday, December 02, 2010 3:56 AM
> To: solr-user@lucene.apache.org
> Subject: SOLR Thesaurus
> 
> Hi List,
> 
> Coming to and end of a proto type evaluation of SOLR (all very good etc
> etc)
> Getting to the point at looking at bells and whistles. Does SOLR have a
> thesuarus. Cant find any refrerence
> to one in the docs and on the wiki etc. (Apart from a few mail threads
> which
> describe the synonym.txt as a thesuarus)
> 
> I mean something like:
> 
> PT: xxxx
> BT: xxx,xxxx,xxxx
> NT: xxx,xxxx,xxxx
> RT:xxx,xxx,xxx
> Scope Note: xxxxxx,xxxx
> 
> Like i say bells and whistles
> 
> cheers Lee

Re: SOLR Thesaurus

Posted by Jonathan Rochkind <ro...@jhu.edu>.
No, it doesn't.  And it's not entirely clear what (if any) simple way 
there is to use Solr to expose hieararchically related documents in a 
way that preserves and usefully allows navigation of the relationships.  
At least in general, for sophisticated stuff.

On 12/2/2010 3:55 AM, lee carroll wrote:
> Hi List,
>
> Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
> Getting to the point at looking at bells and whistles. Does SOLR have a
> thesuarus. Cant find any refrerence
> to one in the docs and on the wiki etc. (Apart from a few mail threads which
> describe the synonym.txt as a thesuarus)
>
> I mean something like:
>
> PT: xxxx
> BT: xxx,xxxx,xxxx
> NT: xxx,xxxx,xxxx
> RT:xxx,xxx,xxx
> Scope Note: xxxxxx,xxxx
>
> Like i say bells and whistles
>
> cheers Lee
>

Re: SOLR Thesaurus

Posted by lee carroll <le...@googlemail.com>.
Two Peters (or rather a stupid english bloke who can't work out how to type
fancy accents :-)

Sorry Péter (took me 10 minutes to work out i could cut and paste) my reply
was to the clustering post by Peter Sturge. Clustering sounds great but
being able to define a thesaurus scheme excatly would be good too.



2010/12/10 Péter Király <ki...@gmail.com>

> Hi Lee,
>
> according to my vision the user could decide which relationship types
> would he likes to attach to his search, and the application would call
> his attention to other possibilities. So there would be no heuristic
> method applied, because e.g. boarder terms would cause lots of
> misleading results.
>
> Péter
>
> 2010/12/10 lee carroll <le...@googlemail.com>:
> > Hi Peter,
> >
> > Thats way to clever for me :-)
> > Discovering thesuarus relationships would be fantastic but its not clear
> > what heuristics you would need to use to discover broader, narrower,
> related
> > documents etc. Although I might be doing the clustering down i'm
> sceptical
> > about the accuracy.
> >
> > cheers Lee c
> >
> > On 10 December 2010 09:38, Peter Sturge <pe...@gmail.com> wrote:
> >
> >> Hi Lee,
> >>
> >> Perhaps Solr's clustering component might be helpful for your use case?
> >> http://wiki.apache.org/solr/ClusteringComponent
> >>
> >>
> >>
> >>
> >> On Fri, Dec 10, 2010 at 9:17 AM, lee carroll
> >> <le...@googlemail.com> wrote:
> >> > Hi Chris,
> >> >
> >> > Its all a bit early in the morning for this mined :-)
> >> >
> >> > The question asked, in good faith, was does solr support or extend to
> >> > implementing a thesaurus. It looks like it does not which is fine. It
> >> does
> >> > support synonyms and synonym rings which is again fine. The ski
> example
> >> was
> >> > an illustration in response to a follow up question for more
> explanation
> >> on
> >> > what a thesaurus is.
> >> >
> >> > An attempt at an answer of why a thesaurus; is below.
> >> >
> >> > Use case 1: improve facets
> >> >
> >> > Motivation
> >> > Unstructured lists of labels in facets offer very poor user
> experience.
> >> > Similar to tag clouds users find them arbitrary, with out focus and
> often
> >> > overwhelming. Labels in facets which are grouped in meaningful ways
> >> relevant
> >> > to the user increase engagement, perceived relevance and user
> >> satisfaction.
> >> >
> >> > Solution
> >> > A thesaurus of term relationships could be used to group facet labels
> >> >
> >> > Implementation
> >> > (er completely out of my depth at this point)
> >> > Thesaurus relationships defined in a simple text file
> >> > term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
> >> > if a search specifies a facet to be returned the field terms are
> >> identified
> >> > by reading the thesaurus into groups, broader terms, narrower terms,
> >> related
> >> > terms etc
> >> > These groups are returned as part of the response for the UI to
> display
> >> > faceted labels as broader, narrower, related terms etc
> >> >
> >> > Use case 2: Increase synonym search precision
> >> >
> >> > Motivation
> >> > Synonyms rings do not allow differences in synonym to be identified.
> >> Rarely
> >> > are synonyms exactly equivalent. This leads to a decrease in search
> >> > precision.
> >> >
> >> > Solution
> >> > Boost queries based on search term thesaurus relationships
> >> >
> >> > Implementation
> >> > (again completely  out of depth here)
> >> > Allow terms in the index to be identified as bt , nt, .. terms of the
> >> search
> >> > term. Allow query parser to boost terms differentially based on these
> >> > thesaurus relationships
> >> >
> >> >
> >> >
> >> > As for the x and y stuff I'm not sure, like i say its quite early in
> the
> >> > morning for me. I'm sure their may well be a different way of
> achieving
> >> the
> >> > above (but note it is more than a hierarchy). However the librarians
> have
> >> > been doing this for 50 years now .
> >> >
> >> > Again though just to repeat this is hardly a killer for us. We've
> looked
> >> at
> >> > solr for a project; created a proto type; generated tons of questions,
> >> had
> >> > them answered in the main by the docs, some on this list and been
> amazed
> >> at
> >> > the fantastic results solr has given us. In fact with a combination of
> >> > keepwords and synonyms we have got a pretty nice simple set of facet
> >> labels
> >> > anyway (my motivation for the original question), so our corpus at the
> >> > moment does not really need a thesaurus! :-)
> >> >
> >> > Thanks Lee
> >> >
> >> >
> >> > On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org>
> >> wrote:
> >> >
> >> >>
> >> >>
> >> >> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
> >> >> Narrower
> >> >> : Terms (NT) Related Terms (RT) etc
> >> >>         ...
> >> >> : User supplied Term is say : Ski
> >> >> :
> >> >> : Prefered term: Skiing
> >> >> : Broader terms could be : Ski and Snow Boarding, Mountain Sports,
> >> Sports
> >> >> : Narrower terms: down hill skiing, telemark, cross country
> >> >> : Related terms: boarding, snow boarding, winter holidays
> >> >>
> >> >> I'm still lost.
> >> >>
> >> >> You've described a black box with some sample input ("Ski") and some
> >> >> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but
> you
> >> >> haven't explained what you want to do with tht black box.  Assuming
> such
> >> a
> >> >> black box existed in solr what are you expecting/hoping to do with
> it?
> >> >> how would such a black box modify solr's user experience?  what is
> your
> >> >> goal?
> >> >>
> >> >> Smells like an XY Problem...
> >> >> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
> <http://people.apache.org/%7Ehossman/#xyproblem>
> >> <http://people.apache.org/%7Ehossman/#xyproblem>
> >> >>
> >> >> Your question appears to be an "XY Problem" ... that is: you are
> dealing
> >> >> with "X", you are assuming "Y" will help you, and you are asking
> about
> >> "Y"
> >> >> without giving more details about the "X" so that we can understand
> the
> >> >> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> >> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
> >> >>
> >> >>
> >> >> -Hoss
> >> >>
> >> >
> >>
> >
>

Re: SOLR Thesaurus

Posted by Péter Király <ki...@gmail.com>.
Hi Lee,

according to my vision the user could decide which relationship types
would he likes to attach to his search, and the application would call
his attention to other possibilities. So there would be no heuristic
method applied, because e.g. boarder terms would cause lots of
misleading results.

Péter

2010/12/10 lee carroll <le...@googlemail.com>:
> Hi Peter,
>
> Thats way to clever for me :-)
> Discovering thesuarus relationships would be fantastic but its not clear
> what heuristics you would need to use to discover broader, narrower, related
> documents etc. Although I might be doing the clustering down i'm sceptical
> about the accuracy.
>
> cheers Lee c
>
> On 10 December 2010 09:38, Peter Sturge <pe...@gmail.com> wrote:
>
>> Hi Lee,
>>
>> Perhaps Solr's clustering component might be helpful for your use case?
>> http://wiki.apache.org/solr/ClusteringComponent
>>
>>
>>
>>
>> On Fri, Dec 10, 2010 at 9:17 AM, lee carroll
>> <le...@googlemail.com> wrote:
>> > Hi Chris,
>> >
>> > Its all a bit early in the morning for this mined :-)
>> >
>> > The question asked, in good faith, was does solr support or extend to
>> > implementing a thesaurus. It looks like it does not which is fine. It
>> does
>> > support synonyms and synonym rings which is again fine. The ski example
>> was
>> > an illustration in response to a follow up question for more explanation
>> on
>> > what a thesaurus is.
>> >
>> > An attempt at an answer of why a thesaurus; is below.
>> >
>> > Use case 1: improve facets
>> >
>> > Motivation
>> > Unstructured lists of labels in facets offer very poor user experience.
>> > Similar to tag clouds users find them arbitrary, with out focus and often
>> > overwhelming. Labels in facets which are grouped in meaningful ways
>> relevant
>> > to the user increase engagement, perceived relevance and user
>> satisfaction.
>> >
>> > Solution
>> > A thesaurus of term relationships could be used to group facet labels
>> >
>> > Implementation
>> > (er completely out of my depth at this point)
>> > Thesaurus relationships defined in a simple text file
>> > term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
>> > if a search specifies a facet to be returned the field terms are
>> identified
>> > by reading the thesaurus into groups, broader terms, narrower terms,
>> related
>> > terms etc
>> > These groups are returned as part of the response for the UI to display
>> > faceted labels as broader, narrower, related terms etc
>> >
>> > Use case 2: Increase synonym search precision
>> >
>> > Motivation
>> > Synonyms rings do not allow differences in synonym to be identified.
>> Rarely
>> > are synonyms exactly equivalent. This leads to a decrease in search
>> > precision.
>> >
>> > Solution
>> > Boost queries based on search term thesaurus relationships
>> >
>> > Implementation
>> > (again completely  out of depth here)
>> > Allow terms in the index to be identified as bt , nt, .. terms of the
>> search
>> > term. Allow query parser to boost terms differentially based on these
>> > thesaurus relationships
>> >
>> >
>> >
>> > As for the x and y stuff I'm not sure, like i say its quite early in the
>> > morning for me. I'm sure their may well be a different way of achieving
>> the
>> > above (but note it is more than a hierarchy). However the librarians have
>> > been doing this for 50 years now .
>> >
>> > Again though just to repeat this is hardly a killer for us. We've looked
>> at
>> > solr for a project; created a proto type; generated tons of questions,
>> had
>> > them answered in the main by the docs, some on this list and been amazed
>> at
>> > the fantastic results solr has given us. In fact with a combination of
>> > keepwords and synonyms we have got a pretty nice simple set of facet
>> labels
>> > anyway (my motivation for the original question), so our corpus at the
>> > moment does not really need a thesaurus! :-)
>> >
>> > Thanks Lee
>> >
>> >
>> > On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org>
>> wrote:
>> >
>> >>
>> >>
>> >> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
>> >> Narrower
>> >> : Terms (NT) Related Terms (RT) etc
>> >>         ...
>> >> : User supplied Term is say : Ski
>> >> :
>> >> : Prefered term: Skiing
>> >> : Broader terms could be : Ski and Snow Boarding, Mountain Sports,
>> Sports
>> >> : Narrower terms: down hill skiing, telemark, cross country
>> >> : Related terms: boarding, snow boarding, winter holidays
>> >>
>> >> I'm still lost.
>> >>
>> >> You've described a black box with some sample input ("Ski") and some
>> >> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
>> >> haven't explained what you want to do with tht black box.  Assuming such
>> a
>> >> black box existed in solr what are you expecting/hoping to do with it?
>> >> how would such a black box modify solr's user experience?  what is your
>> >> goal?
>> >>
>> >> Smells like an XY Problem...
>> >> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
>> <http://people.apache.org/%7Ehossman/#xyproblem>
>> >>
>> >> Your question appears to be an "XY Problem" ... that is: you are dealing
>> >> with "X", you are assuming "Y" will help you, and you are asking about
>> "Y"
>> >> without giving more details about the "X" so that we can understand the
>> >> full issue.  Perhaps the best solution doesn't involve "Y" at all?
>> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>> >>
>> >>
>> >> -Hoss
>> >>
>> >
>>
>

Re: SOLR Thesaurus

Posted by lee carroll <le...@googlemail.com>.
Hi Peter,

Thats way to clever for me :-)
Discovering thesuarus relationships would be fantastic but its not clear
what heuristics you would need to use to discover broader, narrower, related
documents etc. Although I might be doing the clustering down i'm sceptical
about the accuracy.

cheers Lee c

On 10 December 2010 09:38, Peter Sturge <pe...@gmail.com> wrote:

> Hi Lee,
>
> Perhaps Solr's clustering component might be helpful for your use case?
> http://wiki.apache.org/solr/ClusteringComponent
>
>
>
>
> On Fri, Dec 10, 2010 at 9:17 AM, lee carroll
> <le...@googlemail.com> wrote:
> > Hi Chris,
> >
> > Its all a bit early in the morning for this mined :-)
> >
> > The question asked, in good faith, was does solr support or extend to
> > implementing a thesaurus. It looks like it does not which is fine. It
> does
> > support synonyms and synonym rings which is again fine. The ski example
> was
> > an illustration in response to a follow up question for more explanation
> on
> > what a thesaurus is.
> >
> > An attempt at an answer of why a thesaurus; is below.
> >
> > Use case 1: improve facets
> >
> > Motivation
> > Unstructured lists of labels in facets offer very poor user experience.
> > Similar to tag clouds users find them arbitrary, with out focus and often
> > overwhelming. Labels in facets which are grouped in meaningful ways
> relevant
> > to the user increase engagement, perceived relevance and user
> satisfaction.
> >
> > Solution
> > A thesaurus of term relationships could be used to group facet labels
> >
> > Implementation
> > (er completely out of my depth at this point)
> > Thesaurus relationships defined in a simple text file
> > term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
> > if a search specifies a facet to be returned the field terms are
> identified
> > by reading the thesaurus into groups, broader terms, narrower terms,
> related
> > terms etc
> > These groups are returned as part of the response for the UI to display
> > faceted labels as broader, narrower, related terms etc
> >
> > Use case 2: Increase synonym search precision
> >
> > Motivation
> > Synonyms rings do not allow differences in synonym to be identified.
> Rarely
> > are synonyms exactly equivalent. This leads to a decrease in search
> > precision.
> >
> > Solution
> > Boost queries based on search term thesaurus relationships
> >
> > Implementation
> > (again completely  out of depth here)
> > Allow terms in the index to be identified as bt , nt, .. terms of the
> search
> > term. Allow query parser to boost terms differentially based on these
> > thesaurus relationships
> >
> >
> >
> > As for the x and y stuff I'm not sure, like i say its quite early in the
> > morning for me. I'm sure their may well be a different way of achieving
> the
> > above (but note it is more than a hierarchy). However the librarians have
> > been doing this for 50 years now .
> >
> > Again though just to repeat this is hardly a killer for us. We've looked
> at
> > solr for a project; created a proto type; generated tons of questions,
> had
> > them answered in the main by the docs, some on this list and been amazed
> at
> > the fantastic results solr has given us. In fact with a combination of
> > keepwords and synonyms we have got a pretty nice simple set of facet
> labels
> > anyway (my motivation for the original question), so our corpus at the
> > moment does not really need a thesaurus! :-)
> >
> > Thanks Lee
> >
> >
> > On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org>
> wrote:
> >
> >>
> >>
> >> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
> >> Narrower
> >> : Terms (NT) Related Terms (RT) etc
> >>         ...
> >> : User supplied Term is say : Ski
> >> :
> >> : Prefered term: Skiing
> >> : Broader terms could be : Ski and Snow Boarding, Mountain Sports,
> Sports
> >> : Narrower terms: down hill skiing, telemark, cross country
> >> : Related terms: boarding, snow boarding, winter holidays
> >>
> >> I'm still lost.
> >>
> >> You've described a black box with some sample input ("Ski") and some
> >> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
> >> haven't explained what you want to do with tht black box.  Assuming such
> a
> >> black box existed in solr what are you expecting/hoping to do with it?
> >> how would such a black box modify solr's user experience?  what is your
> >> goal?
> >>
> >> Smells like an XY Problem...
> >> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
> <http://people.apache.org/%7Ehossman/#xyproblem>
> >>
> >> Your question appears to be an "XY Problem" ... that is: you are dealing
> >> with "X", you are assuming "Y" will help you, and you are asking about
> "Y"
> >> without giving more details about the "X" so that we can understand the
> >> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
> >>
> >>
> >> -Hoss
> >>
> >
>

Re: SOLR Thesaurus

Posted by Peter Sturge <pe...@gmail.com>.
Hi Lee,

Perhaps Solr's clustering component might be helpful for your use case?
http://wiki.apache.org/solr/ClusteringComponent




On Fri, Dec 10, 2010 at 9:17 AM, lee carroll
<le...@googlemail.com> wrote:
> Hi Chris,
>
> Its all a bit early in the morning for this mined :-)
>
> The question asked, in good faith, was does solr support or extend to
> implementing a thesaurus. It looks like it does not which is fine. It does
> support synonyms and synonym rings which is again fine. The ski example was
> an illustration in response to a follow up question for more explanation on
> what a thesaurus is.
>
> An attempt at an answer of why a thesaurus; is below.
>
> Use case 1: improve facets
>
> Motivation
> Unstructured lists of labels in facets offer very poor user experience.
> Similar to tag clouds users find them arbitrary, with out focus and often
> overwhelming. Labels in facets which are grouped in meaningful ways relevant
> to the user increase engagement, perceived relevance and user satisfaction.
>
> Solution
> A thesaurus of term relationships could be used to group facet labels
>
> Implementation
> (er completely out of my depth at this point)
> Thesaurus relationships defined in a simple text file
> term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
> if a search specifies a facet to be returned the field terms are identified
> by reading the thesaurus into groups, broader terms, narrower terms, related
> terms etc
> These groups are returned as part of the response for the UI to display
> faceted labels as broader, narrower, related terms etc
>
> Use case 2: Increase synonym search precision
>
> Motivation
> Synonyms rings do not allow differences in synonym to be identified. Rarely
> are synonyms exactly equivalent. This leads to a decrease in search
> precision.
>
> Solution
> Boost queries based on search term thesaurus relationships
>
> Implementation
> (again completely  out of depth here)
> Allow terms in the index to be identified as bt , nt, .. terms of the search
> term. Allow query parser to boost terms differentially based on these
> thesaurus relationships
>
>
>
> As for the x and y stuff I'm not sure, like i say its quite early in the
> morning for me. I'm sure their may well be a different way of achieving the
> above (but note it is more than a hierarchy). However the librarians have
> been doing this for 50 years now .
>
> Again though just to repeat this is hardly a killer for us. We've looked at
> solr for a project; created a proto type; generated tons of questions, had
> them answered in the main by the docs, some on this list and been amazed at
> the fantastic results solr has given us. In fact with a combination of
> keepwords and synonyms we have got a pretty nice simple set of facet labels
> anyway (my motivation for the original question), so our corpus at the
> moment does not really need a thesaurus! :-)
>
> Thanks Lee
>
>
> On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org> wrote:
>
>>
>>
>> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
>> Narrower
>> : Terms (NT) Related Terms (RT) etc
>>         ...
>> : User supplied Term is say : Ski
>> :
>> : Prefered term: Skiing
>> : Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
>> : Narrower terms: down hill skiing, telemark, cross country
>> : Related terms: boarding, snow boarding, winter holidays
>>
>> I'm still lost.
>>
>> You've described a black box with some sample input ("Ski") and some
>> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
>> haven't explained what you want to do with tht black box.  Assuming such a
>> black box existed in solr what are you expecting/hoping to do with it?
>> how would such a black box modify solr's user experience?  what is your
>> goal?
>>
>> Smells like an XY Problem...
>> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
>>
>> Your question appears to be an "XY Problem" ... that is: you are dealing
>> with "X", you are assuming "Y" will help you, and you are asking about "Y"
>> without giving more details about the "X" so that we can understand the
>> full issue.  Perhaps the best solution doesn't involve "Y" at all?
>> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>>
>>
>> -Hoss
>>
>

Re: SOLR Thesaurus

Posted by Chris Hostetter <ho...@fucit.org>.
: The question asked, in good faith, was does solr support or extend to
: implementing a thesaurus. It looks like it does not which is fine. It does

Well, my point was that "thesaurus" is not a feature description.  it's a 
data structure, and depending on your goals, the existing SynonymFilter 
may be perfectly usable out of the box.

: Use case 1: improve facets
: 
: Motivation
: Unstructured lists of labels in facets offer very poor user experience.
: Similar to tag clouds users find them arbitrary, with out focus and often
: overwhelming. Labels in facets which are grouped in meaningful ways relevant
: to the user increase engagement, perceived relevance and user satisfaction.

SynonymFilter could definitley be used to help in this situation -- if you 
create a synonyms.txt file mapping all of the terms in your thesaurus to 
your Prefered Term you could then use SynonymFilter at index time to get a 
clean list of facet constraints. (if you wnat a simple list of only the 
Prefered Terms)

Alternately...

: Solution
: A thesaurus of term relationships could be used to group facet labels
: 
: Implementation
: (er completely out of my depth at this point)
: Thesaurus relationships defined in a simple text file
: term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
: if a search specifies a facet to be returned the field terms are identified
: by reading the thesaurus into groups, broader terms, narrower terms, related
: terms etc
: These groups are returned as part of the response for the UI to display
: faceted labels as broader, narrower, related terms etc

...what you're describing is a hierarchical faceting model.  with a 
properly structured synonyms.txt used at indexing time and the 
"hierarchy" trick i describe on slide #32-25 of this presentation...

http://people.apache.org/~hossman/apachecon2010/facets/

...that should also be posisble.


: Implementation
: (again completely  out of depth here)
: Allow terms in the index to be identified as bt , nt, .. terms of the search
: term. Allow query parser to boost terms differentially based on these
: thesaurus relationships

see my earlier reply to Péter Király, what you are describing is only 
slightly more complicated then what i describe there ... this is 
definitely something that would require a custom QParser, but the heavy 
lifting could still be done by SynonymFilter (in the case you describe, 
you'd just need to split your thesarus up into distinct mapping files for 
BT, NT, etc.. and then have one SynonymFilter for each, and apply the 
appropriate boost to the queries generated from them.

: Again though just to repeat this is hardly a killer for us. We've looked at
: solr for a project; created a proto type; generated tons of questions, had
: them answered in the main by the docs, some on this list and been amazed at
: the fantastic results solr has given us. In fact with a combination of
: keepwords and synonyms we have got a pretty nice simple set of facet labels
: anyway (my motivation for the original question), so our corpus at the
: moment does not really need a thesaurus! :-)

glad to hear it -- just didn't want you to think that something wasn't 
available just because you couldn't find a feature with a specific name -- 
what you get "out of the box" can be used in a lot of interesting ways if 
you think "out of hte box".

-Hoss

Re: SOLR Thesaurus

Posted by lee carroll <le...@googlemail.com>.
Hi Chris,

Its all a bit early in the morning for this mined :-)

The question asked, in good faith, was does solr support or extend to
implementing a thesaurus. It looks like it does not which is fine. It does
support synonyms and synonym rings which is again fine. The ski example was
an illustration in response to a follow up question for more explanation on
what a thesaurus is.

An attempt at an answer of why a thesaurus; is below.

Use case 1: improve facets

Motivation
Unstructured lists of labels in facets offer very poor user experience.
Similar to tag clouds users find them arbitrary, with out focus and often
overwhelming. Labels in facets which are grouped in meaningful ways relevant
to the user increase engagement, perceived relevance and user satisfaction.

Solution
A thesaurus of term relationships could be used to group facet labels

Implementation
(er completely out of my depth at this point)
Thesaurus relationships defined in a simple text file
term, bt=>term,term nt=> term, term rt=>term, term, pt=>term
if a search specifies a facet to be returned the field terms are identified
by reading the thesaurus into groups, broader terms, narrower terms, related
terms etc
These groups are returned as part of the response for the UI to display
faceted labels as broader, narrower, related terms etc

Use case 2: Increase synonym search precision

Motivation
Synonyms rings do not allow differences in synonym to be identified. Rarely
are synonyms exactly equivalent. This leads to a decrease in search
precision.

Solution
Boost queries based on search term thesaurus relationships

Implementation
(again completely  out of depth here)
Allow terms in the index to be identified as bt , nt, .. terms of the search
term. Allow query parser to boost terms differentially based on these
thesaurus relationships



As for the x and y stuff I'm not sure, like i say its quite early in the
morning for me. I'm sure their may well be a different way of achieving the
above (but note it is more than a hierarchy). However the librarians have
been doing this for 50 years now .

Again though just to repeat this is hardly a killer for us. We've looked at
solr for a project; created a proto type; generated tons of questions, had
them answered in the main by the docs, some on this list and been amazed at
the fantastic results solr has given us. In fact with a combination of
keepwords and synonyms we have got a pretty nice simple set of facet labels
anyway (my motivation for the original question), so our corpus at the
moment does not really need a thesaurus! :-)

Thanks Lee


On 9 December 2010 23:38, Chris Hostetter <ho...@fucit.org> wrote:

>
>
> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
> Narrower
> : Terms (NT) Related Terms (RT) etc
>         ...
> : User supplied Term is say : Ski
> :
> : Prefered term: Skiing
> : Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
> : Narrower terms: down hill skiing, telemark, cross country
> : Related terms: boarding, snow boarding, winter holidays
>
> I'm still lost.
>
> You've described a black box with some sample input ("Ski") and some
> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
> haven't explained what you want to do with tht black box.  Assuming such a
> black box existed in solr what are you expecting/hoping to do with it?
> how would such a black box modify solr's user experience?  what is your
> goal?
>
> Smells like an XY Problem...
> http://people.apache.org/~hossman/#xyproblem<http://people.apache.org/%7Ehossman/#xyproblem>
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
> -Hoss
>

Re: SOLR Thesaurus

Posted by "l.dellipaoli" <l....@reply.it>.
Hi!
I'm facing a similar problem. 
I'm dealing with Thesauri on an Oracle RDBMS and I'm trying to integrate
Solr in order to speed up search operations.

Did you succeed in this integration?

Thanks, 
Laura



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: SOLR Thesaurus

Posted by Péter Király <ki...@gmail.com>.
Hi Chris,

thanks for your description. I should think about this a little bit
more, then I will ask some details. The main problem is that Synonyms
are one kind of relations, and Thesaurus may contain 6-10 kinds of
relations. And it is depending on the user, which types of relations
he would like to use in a similar fashion as synonyms.

Péter

2010/12/10 Chris Hostetter <ho...@fucit.org>:
>
> : My imaginative use case:
> : - the user enters a term and maybe he turns on a flag to get not just
> : the term, but all terms, which related somehow with this (usually the
> : synonyms and narrower terms).
> : - Solr first find the queried term(s) in the thesaurus, then finds the
> : related terms, modifies and issues the query
> : e.g. query is fruits, and it becames (fruit OR apple OR banana OR ...)
> :
> : This use case is different from the synonym handler, which - as far as
> : I know - modifies the index, and injects synonyms at the position of
> : the original word. My use case suppose, that we maintain thesaurus as
> : a different "database" (maybe another Solr index).
>
> the use case you describe *could* be solved using the SynonymFilter -- you
> can configure it to be used at query time (for query expansion) *or* you
> can configure it to be used at index time (for reduction or expansion)
>
> just express your thesaurus in the synonyms.txt format and configure it in
> your schema.xml
>
> The two gotcha's to watch out for with this kind of appoach is multiword
> synonyms and the way Lucene's QueryParser treats whitespace as a
> metacharacter.
>
> in general, if you're going to do this kind of major query expantion, you
> probably wnat to use something like the "FieldQParser" which doesn't treat
> whitespace as special so user input like...
>        United States
> ...makes it to hte analyzer as one chunk of text, and can be looked up as
> is in your thesaurus.
>
> The multiword synonym issue is more complicated - i don't have the energy
> to fully explain it right now, but for query time expansion it can be a
> real pain in the ass.  one word arround is to index shingle-esque terms
> instead of hte individual words in your synonyms, but that defeats the
> point of your goal of having an external thesarus that can be modified
> independently of the index.
>
> My suggestion would be to write a simple little ThesarusQParser, that can
> use and instance of the SynonymFilter directly to preprocess the input
> text to get a list of all the Related Terms, and then delegate to another
> QParser to generate an appropate Query for each of them (typically a
> PhraseQuery) which your ThesarusQParser would then combine into a giant
> BooleanQuery (except you may wnat to consider a DisjunctionMaxQuery
> instead because of the scoring factors)
>
> ThesarusQParser would require very little code, because SynonymFilter
> would be doing all the hard work.
>
>
> -Hoss
>

Re: SOLR Thesaurus

Posted by Chris Hostetter <ho...@fucit.org>.
: My imaginative use case:
: - the user enters a term and maybe he turns on a flag to get not just
: the term, but all terms, which related somehow with this (usually the
: synonyms and narrower terms).
: - Solr first find the queried term(s) in the thesaurus, then finds the
: related terms, modifies and issues the query
: e.g. query is fruits, and it becames (fruit OR apple OR banana OR ...)
: 
: This use case is different from the synonym handler, which - as far as
: I know - modifies the index, and injects synonyms at the position of
: the original word. My use case suppose, that we maintain thesaurus as
: a different "database" (maybe another Solr index).

the use case you describe *could* be solved using the SynonymFilter -- you 
can configure it to be used at query time (for query expansion) *or* you 
can configure it to be used at index time (for reduction or expansion)

just express your thesaurus in the synonyms.txt format and configure it in 
your schema.xml

The two gotcha's to watch out for with this kind of appoach is multiword 
synonyms and the way Lucene's QueryParser treats whitespace as a 
metacharacter.

in general, if you're going to do this kind of major query expantion, you 
probably wnat to use something like the "FieldQParser" which doesn't treat 
whitespace as special so user input like...
	United States
...makes it to hte analyzer as one chunk of text, and can be looked up as 
is in your thesaurus.

The multiword synonym issue is more complicated - i don't have the energy 
to fully explain it right now, but for query time expansion it can be a 
real pain in the ass.  one word arround is to index shingle-esque terms 
instead of hte individual words in your synonyms, but that defeats the 
point of your goal of having an external thesarus that can be modified 
independently of the index.

My suggestion would be to write a simple little ThesarusQParser, that can 
use and instance of the SynonymFilter directly to preprocess the input 
text to get a list of all the Related Terms, and then delegate to another 
QParser to generate an appropate Query for each of them (typically a 
PhraseQuery) which your ThesarusQParser would then combine into a giant 
BooleanQuery (except you may wnat to consider a DisjunctionMaxQuery 
instead because of the scoring factors)

ThesarusQParser would require very little code, because SynonymFilter 
would be doing all the hard work.


-Hoss

Re: SOLR Thesaurus

Posted by Péter Király <ki...@gmail.com>.
I also try to define the problem.

In the library world there are some general and special thesaurus,
which reveal the relations between concepts. The relations have types
as Lee described: Prefered Term (PT), Broader Terms (BT), Narrower
Terms (NT) Related Terms (RT) and others. Some of these thesaurus
covers lots of concepts, e.g. the Hungarian Common Thesaurus has more
than 60 000 concepts.

For searching perspective it would be fine if you can use this
knowledge in search.

My imaginative use case:
- the user enters a term and maybe he turns on a flag to get not just
the term, but all terms, which related somehow with this (usually the
synonyms and narrower terms).
- Solr first find the queried term(s) in the thesaurus, then finds the
related terms, modifies and issues the query
e.g. query is fruits, and it becames (fruit OR apple OR banana OR ...)

This use case is different from the synonym handler, which - as far as
I know - modifies the index, and injects synonyms at the position of
the original word. My use case suppose, that we maintain thesaurus as
a different "database" (maybe another Solr index).

My Solr knowledge is not deep enough to decide, that this use case
could be achive with combination of existing patches or contributed
modules.

If someone would start such a project, I would happily contribute.

Péter
http://eXtensibleCatalog.org


2010/12/10 Chris Hostetter <ho...@fucit.org>:
>
>
> : a term can have a Prefered Term (PT), many Broader Terms (BT), Many Narrower
> : Terms (NT) Related Terms (RT) etc
>        ...
> : User supplied Term is say : Ski
> :
> : Prefered term: Skiing
> : Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
> : Narrower terms: down hill skiing, telemark, cross country
> : Related terms: boarding, snow boarding, winter holidays
>
> I'm still lost.
>
> You've described a black box with some sample input ("Ski") and some
> corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you
> haven't explained what you want to do with tht black box.  Assuming such a
> black box existed in solr what are you expecting/hoping to do with it?
> how would such a black box modify solr's user experience?  what is your
> goal?
>
> Smells like an XY Problem...
> http://people.apache.org/~hossman/#xyproblem
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
> -Hoss
>

Re: SOLR Thesaurus

Posted by Chris Hostetter <ho...@fucit.org>.

: a term can have a Prefered Term (PT), many Broader Terms (BT), Many Narrower
: Terms (NT) Related Terms (RT) etc
	...
: User supplied Term is say : Ski
: 
: Prefered term: Skiing
: Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
: Narrower terms: down hill skiing, telemark, cross country
: Related terms: boarding, snow boarding, winter holidays

I'm still lost.  

You've described a black box with some sample input ("Ski") and some 
corrisponding sample output (PT=..., BT=..., NT=..., RT=....) -- but you 
haven't explained what you want to do with tht black box.  Assuming such a 
black box existed in solr what are you expecting/hoping to do with it?  
how would such a black box modify solr's user experience?  what is your 
goal?

Smells like an XY Problem...
http://people.apache.org/~hossman/#xyproblem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


-Hoss

Re: SOLR Thesaurus

Posted by lee carroll <le...@googlemail.com>.
Hi

Stephen, yes sorry should have been more plain

a term can have a Prefered Term (PT), many Broader Terms (BT), Many Narrower
Terms (NT) Related Terms (RT) etc

So

User supplied Term is say : Ski

Prefered term: Skiing
Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
Narrower terms: down hill skiing, telemark, cross country
Related terms: boarding, snow boarding, winter holidays

Michael,

yes exactly, SKOS, although maybe without the over wheening ambition to take
over the world.

By the sounds of it though out of the box you get a simple (but pretty
effective synonym list and ring) Anything more we'd need to write it
ourselfs ie your thesaurus filter and plus a change to the response as
broader terms, narrower terms etc would be good to be suggested to the ui.

No plugins out there ?

On 2 December 2010 16:16, Michael Zach <za...@punkt.at> wrote:

> Hello Lee,
>
> these bells sound like "SKOS" ;o)
>
> AFAIK Solr does not support thesauri just plain flat synonym lists.
>
> One could implement a thesaurus filter and put it into the end of the
> analyzer chain of solr.
>
> The filter would then do a thesaurus lookup for each token it receives and
> possibly
> * expand the query
> or
> * kind of "stem" document tokens to some prefered variants according to the
> thesaurus
>
> Maybe even taking term relations from thesaurus into account and boost
> queries or doc fields at index time.
>
> Maybe have a look at http://poolparty.punkt.at/ a full features SKOS
> thesaurus management server.
> It's also providing webservices which could feed such a Solr filter.
>
> Kind regards
> Michael
>
>
> ----- Ursprüngliche Mail -----
> Von: "lee carroll" <le...@googlemail.com>
> An: solr-user@lucene.apache.org
> Gesendet: Donnerstag, 2. Dezember 2010 09:55:54
> Betreff: SOLR Thesaurus
>
> Hi List,
>
> Coming to and end of a proto type evaluation of SOLR (all very good etc
> etc)
> Getting to the point at looking at bells and whistles. Does SOLR have a
> thesuarus. Cant find any refrerence
> to one in the docs and on the wiki etc. (Apart from a few mail threads
> which
> describe the synonym.txt as a thesuarus)
>
> I mean something like:
>
> PT: xxxx
> BT: xxx,xxxx,xxxx
> NT: xxx,xxxx,xxxx
> RT:xxx,xxx,xxx
> Scope Note: xxxxxx,xxxx
>
> Like i say bells and whistles
>
> cheers Lee
>

Re: SOLR Thesaurus

Posted by Michael Zach <za...@punkt.at>.
Hello Lee,

these bells sound like "SKOS" ;o)

AFAIK Solr does not support thesauri just plain flat synonym lists.

One could implement a thesaurus filter and put it into the end of the analyzer chain of solr.

The filter would then do a thesaurus lookup for each token it receives and possibly 
* expand the query 
or
* kind of "stem" document tokens to some prefered variants according to the thesaurus

Maybe even taking term relations from thesaurus into account and boost queries or doc fields at index time.

Maybe have a look at http://poolparty.punkt.at/ a full features SKOS thesaurus management server.
It's also providing webservices which could feed such a Solr filter.

Kind regards
Michael


----- Ursprüngliche Mail -----
Von: "lee carroll" <le...@googlemail.com>
An: solr-user@lucene.apache.org
Gesendet: Donnerstag, 2. Dezember 2010 09:55:54
Betreff: SOLR Thesaurus

Hi List,

Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)

I mean something like:

PT: xxxx
BT: xxx,xxxx,xxxx
NT: xxx,xxxx,xxxx
RT:xxx,xxx,xxx
Scope Note: xxxxxx,xxxx

Like i say bells and whistles

cheers Lee