You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jaco <jd...@gmail.com> on 2009/04/09 20:57:43 UTC

Dictionary lookup possibilities

Hello,

I'm struggling with some ideas, maybe somebody can help me with past
experiences or tips. I have loaded a dictionary into a Solr index, using
stemming and some stopwords in analysis part of the schema. Each record
holds a term from the dictionary, which can consist of multiple words. For
some data analysis work, I want to send pieces of text (sentences actually)
to Solr to retrieve all possible dictionary terms that could occur. Ideally,
I want to construct a query that only returns those Solr records for which
all individual words in that record are matched.

For instance, my dictionary holds the following terms:
1 - a b c d
2 - c d e
3 - a b
4 - a e f g h

If I put the sentence [a b c d f g h] in as a query, I want to recieve
dictionary items 1 (matching all words a b c d) and 3 (matching words a b)
as matches

I have been puzzling about how to do this. The only way I found so far was
to construct an OR query with all words of the sentence in it. In this case,
that would result in all dictionary items being returned. This would then
require some code to go over the search results and analyse each of them
(i.e. by using the highlight function) to kick out 'false' matches, but I am
looking for a more efficient way.

Is there a way to do this with Solr functionality, or do I need to start
looking into the Lucene API ..?

Any help would be much appreciated as usual!

Thanks, bye,

Jaco.

Re: Dictionary lookup possibilities

Posted by Jaco <jd...@gmail.com>.
Hi,

Thanks for the suggestions! It looks like the MemoryIndex is worth having a
detailed look at, so that's what I'll start on.

Thanks again, bye,

Jaco.


2009/4/17 Steven A Rowe <sa...@syr.edu>

> Hi Jaco,
>
> On 4/9/2009 at 2:58 PM, Jaco wrote:
> > I'm struggling with some ideas, maybe somebody can help me with past
> > experiences or tips. I have loaded a dictionary into a Solr index,
> > using stemming and some stopwords in analysis part of the schema.
> > Each record holds a term from the dictionary, which can consist of
> > multiple words. For some data analysis work, I want to send pieces
> > of text (sentences actually) to Solr to retrieve all possible
> > dictionary terms that could occur. Ideally, I want to construct a
> > query that only returns those Solr records for which all individual
> > words in that record are matched.
> >
> > For instance, my dictionary holds the following terms:
> > 1 - a b c d
> > 2 - c d e
> > 3 - a b
> > 4 - a e f g h
> >
> > If I put the sentence [a b c d f g h] in as a query, I want to recieve
> > dictionary items 1 (matching all words a b c d) and 3 (matching words a
> > b) as matches
> >
> > I have been puzzling about how to do this. The only way I found so far
> > was to construct an OR query with all words of the sentence in it. In
> > this case, that would result in all dictionary items being returned.
> > This would then require some code to go over the search results and
> > analyse each of them (i.e. by using the highlight function) to kick
> > out 'false' matches, but I am looking for a more efficient way.
> >
> > Is there a way to do this with Solr functionality, or do I need to
> > start looking into the Lucene API ..?
>
> Your problem could be modeled as a set of standing queries, where your
> dictionary entries are the *queries* (with all words required, maybe using a
> PhraseQuery or a SpanNearQuery), and the sentence is the document.
>
> Solr may not be usable in this context (extremely high volume queries),
> depending on your throughput requirements, but Lucene's MemoryIndex was
> designed for this kind of thing:
>
> <
> http://lucene.apache.org/java/2_4_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html
> >
>
> Steve
>
>

RE: Dictionary lookup possibilities

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Jaco,

On 4/9/2009 at 2:58 PM, Jaco wrote:
> I'm struggling with some ideas, maybe somebody can help me with past
> experiences or tips. I have loaded a dictionary into a Solr index,
> using stemming and some stopwords in analysis part of the schema.
> Each record holds a term from the dictionary, which can consist of
> multiple words. For some data analysis work, I want to send pieces
> of text (sentences actually) to Solr to retrieve all possible
> dictionary terms that could occur. Ideally, I want to construct a
> query that only returns those Solr records for which all individual
> words in that record are matched.
> 
> For instance, my dictionary holds the following terms:
> 1 - a b c d
> 2 - c d e
> 3 - a b
> 4 - a e f g h
> 
> If I put the sentence [a b c d f g h] in as a query, I want to recieve
> dictionary items 1 (matching all words a b c d) and 3 (matching words a
> b) as matches
> 
> I have been puzzling about how to do this. The only way I found so far
> was to construct an OR query with all words of the sentence in it. In
> this case, that would result in all dictionary items being returned.
> This would then require some code to go over the search results and
> analyse each of them (i.e. by using the highlight function) to kick
> out 'false' matches, but I am looking for a more efficient way.
> 
> Is there a way to do this with Solr functionality, or do I need to
> start looking into the Lucene API ..?

Your problem could be modeled as a set of standing queries, where your dictionary entries are the *queries* (with all words required, maybe using a PhraseQuery or a SpanNearQuery), and the sentence is the document.

Solr may not be usable in this context (extremely high volume queries), depending on your throughput requirements, but Lucene's MemoryIndex was designed for this kind of thing:

<http://lucene.apache.org/java/2_4_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html>

Steve


Re: Dictionary lookup possibilities

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Apr 17, 2009 at 3:37 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
>  this is a pretty hard problem in general ... in my mind i call it the
> "longest matching sub-phrase" problem, but i have no idea if it has a real
> name.
>
> the only solution i know of using Lucene is to construct a phrase query
> for each of the sub phrases, giving a bigger query boost to the "longer"
> phrases ... but it might be possible to design a customer query impl for
> solving this problem.
>

There was an issue opened for something similar but there is not patch yet.

https://issues.apache.org/jira/browse/SOLR-633

-- 
Regards,
Shalin Shekhar Mangar.

Re: Dictionary lookup possibilities

Posted by Chris Hostetter <ho...@fucit.org>.
: For instance, my dictionary holds the following terms:
: 1 - a b c d
: 2 - c d e
: 3 - a b
: 4 - a e f g h
: 
: If I put the sentence [a b c d f g h] in as a query, I want to recieve
: dictionary items 1 (matching all words a b c d) and 3 (matching words a b)
: as matches

this is a pretty hard problem in general ... in my mind i call it the 
"longest matching sub-phrase" problem, but i have no idea if it has a real 
name.

the only solution i know of using Lucene is to construct a phrase query 
for each of the sub phrases, giving a bigger query boost to the "longer" 
phrases ... but it might be possible to design a customer query impl for 
solving this problem.

(i've never had an important enough use case to dedicate a significant 
amount of time to figuring it out)





-Hoss