You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steven A Rowe <sa...@syr.edu> on 2010/11/05 23:18:46 UTC

RE: How do I this in Solr?

Hi Varun,

On 10/26/2010 at 11:26 PM, Varun Gupta wrote:
> I will try to implement the two filters suggested by Steven and see how
> the performance matches up.

Have you made any progress?

I was thinking about your use case, and it occurred to me that you could get what you want by reversing the problem, using Lucene's MemoryIndex <http://lucene.apache.org/java/3_0_2/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html>.  (As far as I can tell, this functionality -- i.e. standing queries a.k.a. routing a.k.a. filtering -- is not present in Solr.)

You can load your query (as a document) into a MemoryIndex, and then use each of your documents to query against it, something like (untested!):

	Map<String,Query> documents = new HashMap<String,Query>();
	Analyzer analyzer = new WhitespaceAnalyzer();
	QueryParser parser = new QueryParser("content", analyzer);
	parser.setDefaultOperator(QueryParser.Operator.AND);
	documents.put("ID001", parser.parse("nokia n95"));
	documents.put("ID002", parser.parse("GPS"));
	documents.put("ID003", parser.parse("android"));
	documents.put("ID004", parser.parse("samsung"));
      documents.put("ID005", parser.parse("samsung android"));
      documents.put("ID006", parser.parse("nokia android"));
      documents.put("ID007", parser.parse("mobile with GPS"));

	MemoryIndex index = new MemoryIndex();
	index.addField("content", "samsung with GPS", analyzer);

	for (Map.Entry<String,Query> entry : documents.entrySet()) {
	  Query query = entry.getValue();
	  if (index.search(query) > 0.0f) {
	    String docId = entry.getKey();
	    // Do something with the hits here ...
	  }
	}

In the above example, the documents "samsung", "GPS", "android" and "samsung android" would be hits, and the other documents would not be, just as you wanted.

MemoryIndex is designed to be very fast for this kind of usage, so even 100's of thousands of documents should be feasible.

Steve

> -----Original Message-----
> From: Varun Gupta [mailto:varun.vgupta@gmail.com]
> Sent: Tuesday, October 26, 2010 11:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How do I this in Solr?
> 
> Thanks everybody for the inputs.
> 
> Looks like Steven's solution is the closest one but will lead to
> performance
> issues when the query string has many terms.
> 
> I will try to implement the two filters suggested by Steven and see how
> the
> performance matches up.
> 
> --
> Thanks
> Varun Gupta
> 
> 
> On Wed, Oct 27, 2010 at 8:04 AM, scott chu (朱炎詹)
> <sc...@udngroup.com>wrote:
> 
> > I think you have to write a "yet exact match" handler yourself (I mean
> yet
> > cause it's not quite exact match we normally know). Steve's answer is
> quite
> > near your request. You can do further work based on his solution.
> >
> > At the last step, I'll suggest you eat up all blank within query string
> and
> > query result, respevtively & only returns those results that has equal
> > string length as the query string's.
> >
> > For example, giving:
> > *query string = "Samsung with GPS"
> > *query results:
> > resutl 1 = "Samsung has lots of mobile with GPS"
> > result 2 = "with GPS Samsng"
> > result 3 = "GPS mobile with vendors, such as Sony, Samsung"
> >
> > they become:
> > *query result = "SamsungwithGPS" (length =14)
> > *query results:
> > resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29)
> > result 2 = "withGPSSamsng" (length =14)
> > result 3 = "GPSmobilewithvendors,suchasSony,Samsung" (length =43)
> >
> > so result 2 matches your request.
> >
> > In this way, you can avoid case-sensitive, word-order-rearrange load of
> > works. Furthermore, you can do refined work, such as remove white
> > characters, etc.
> >
> > Scott @ Taiwan
> >
> >
> > ----- Original Message ----- From: "Varun Gupta"
> <va...@gmail.com>
> >
> > To: <so...@lucene.apache.org>
> > Sent: Tuesday, October 26, 2010 9:07 PM
> >
> > Subject: How do I this in Solr?
> >
> >
> >  Hi,
> >>
> >> I have lot of small documents (each containing 1 to 15 words) indexed
> in
> >> Solr. For the search query, I want the search results to contain only
> >> those
> >> documents that satisfy this criteria "All of the words of the search
> >> result
> >> document are present in the search query"
> >>
> >> For example:
> >> If I have the following documents indexed: "nokia n95", "GPS",
> "android",
> >> "samsung", "samsung andriod", "nokia andriod", "mobile with GPS"
> >>
> >> If I search with the text "samsung andriod GPS", search results should
> >> only
> >> conain "samsung", "GPS", "andriod" and "samsung andriod".
> >>
> >> Is there a way to do this in Solr.
> >>
> >> --
> >> Thanks
> >> Varun Gupta
> >>
> >>
> >
> >
> > ------------------------------------------------------------------------
> --------
> >
> >
> >
> > %<&b6G$J0T.'$$'d(l/f,r!C
> > Checked by AVG - www.avg.com
> > Version: 9.0.862 / Virus Database: 271.1.1/3220 - Release Date: 10/26/10
> > 14:34:00
> >
> >

Re: How do I this in Solr?

Posted by Varun Gupta <va...@gmail.com>.
I haven't been able to work on it because of some other commitments. The
MemoryIndex approach seems promising. Only thing I will have to check is the
memory requirement as I have close to 2 million documents.

Will let you know if I can make it work.

Thanks a lot!

--
Varun Gupta

On Sat, Nov 6, 2010 at 3:48 AM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Varun,
>
> On 10/26/2010 at 11:26 PM, Varun Gupta wrote:
> > I will try to implement the two filters suggested by Steven and see how
> > the performance matches up.
>
> Have you made any progress?
>
> I was thinking about your use case, and it occurred to me that you could
> get what you want by reversing the problem, using Lucene's MemoryIndex <
> http://lucene.apache.org/java/3_0_2/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html>.
>  (As far as I can tell, this functionality -- i.e. standing queries a.k.a.
> routing a.k.a. filtering -- is not present in Solr.)
>
> You can load your query (as a document) into a MemoryIndex, and then use
> each of your documents to query against it, something like (untested!):
>
>        Map<String,Query> documents = new HashMap<String,Query>();
>        Analyzer analyzer = new WhitespaceAnalyzer();
>        QueryParser parser = new QueryParser("content", analyzer);
>        parser.setDefaultOperator(QueryParser.Operator.AND);
>        documents.put("ID001", parser.parse("nokia n95"));
>        documents.put("ID002", parser.parse("GPS"));
>        documents.put("ID003", parser.parse("android"));
>        documents.put("ID004", parser.parse("samsung"));
>      documents.put("ID005", parser.parse("samsung android"));
>      documents.put("ID006", parser.parse("nokia android"));
>      documents.put("ID007", parser.parse("mobile with GPS"));
>
>        MemoryIndex index = new MemoryIndex();
>        index.addField("content", "samsung with GPS", analyzer);
>
>        for (Map.Entry<String,Query> entry : documents.entrySet()) {
>          Query query = entry.getValue();
>          if (index.search(query) > 0.0f) {
>            String docId = entry.getKey();
>            // Do something with the hits here ...
>          }
>        }
>
> In the above example, the documents "samsung", "GPS", "android" and
> "samsung android" would be hits, and the other documents would not be, just
> as you wanted.
>
> MemoryIndex is designed to be very fast for this kind of usage, so even
> 100's of thousands of documents should be feasible.
>
> Steve
>
> > -----Original Message-----
> > From: Varun Gupta [mailto:varun.vgupta@gmail.com]
> > Sent: Tuesday, October 26, 2010 11:26 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: How do I this in Solr?
> >
> > Thanks everybody for the inputs.
> >
> > Looks like Steven's solution is the closest one but will lead to
> > performance
> > issues when the query string has many terms.
> >
> > I will try to implement the two filters suggested by Steven and see how
> > the
> > performance matches up.
> >
> > --
> > Thanks
> > Varun Gupta
> >
> >
> > On Wed, Oct 27, 2010 at 8:04 AM, scott chu (朱炎詹)
> > <sc...@udngroup.com>wrote:
> >
> > > I think you have to write a "yet exact match" handler yourself (I mean
> > yet
> > > cause it's not quite exact match we normally know). Steve's answer is
> > quite
> > > near your request. You can do further work based on his solution.
> > >
> > > At the last step, I'll suggest you eat up all blank within query string
> > and
> > > query result, respevtively & only returns those results that has equal
> > > string length as the query string's.
> > >
> > > For example, giving:
> > > *query string = "Samsung with GPS"
> > > *query results:
> > > resutl 1 = "Samsung has lots of mobile with GPS"
> > > result 2 = "with GPS Samsng"
> > > result 3 = "GPS mobile with vendors, such as Sony, Samsung"
> > >
> > > they become:
> > > *query result = "SamsungwithGPS" (length =14)
> > > *query results:
> > > resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29)
> > > result 2 = "withGPSSamsng" (length =14)
> > > result 3 = "GPSmobilewithvendors,suchasSony,Samsung" (length =43)
> > >
> > > so result 2 matches your request.
> > >
> > > In this way, you can avoid case-sensitive, word-order-rearrange load of
> > > works. Furthermore, you can do refined work, such as remove white
> > > characters, etc.
> > >
> > > Scott @ Taiwan
> > >
> > >
> > > ----- Original Message ----- From: "Varun Gupta"
> > <va...@gmail.com>
> > >
> > > To: <so...@lucene.apache.org>
> > > Sent: Tuesday, October 26, 2010 9:07 PM
> > >
> > > Subject: How do I this in Solr?
> > >
> > >
> > >  Hi,
> > >>
> > >> I have lot of small documents (each containing 1 to 15 words) indexed
> > in
> > >> Solr. For the search query, I want the search results to contain only
> > >> those
> > >> documents that satisfy this criteria "All of the words of the search
> > >> result
> > >> document are present in the search query"
> > >>
> > >> For example:
> > >> If I have the following documents indexed: "nokia n95", "GPS",
> > "android",
> > >> "samsung", "samsung andriod", "nokia andriod", "mobile with GPS"
> > >>
> > >> If I search with the text "samsung andriod GPS", search results should
> > >> only
> > >> conain "samsung", "GPS", "andriod" and "samsung andriod".
> > >>
> > >> Is there a way to do this in Solr.
> > >>
> > >> --
> > >> Thanks
> > >> Varun Gupta
> > >>
> > >>
> > >
> > >
> > >
> ------------------------------------------------------------------------
> > --------
> > >
> > >
> > >
> > > %<&b6G$J0T.'$$'d(l/f,r!C
> > > Checked by AVG - www.avg.com
> > > Version: 9.0.862 / Virus Database: 271.1.1/3220 - Release Date:
> 10/26/10
> > > 14:34:00
> > >
> > >
>