You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ruslan Sivak <rs...@istandfor.com> on 2007/02/28 00:25:15 UTC

optimizing single document searches

I am using Lucene in a little bit weird way, instead of searching all 
the documents for a specific query, I am searching a single document for 
many specific queries. 

On a single document of 10k characters, doing about 40k searches takes 
about 5 seconds.  This is not bad, but I was wondering if I can somehow 
speed this up.  It also takes about 5 seconds to generate the 
searchTerms (which is fine, since I will do it once and cache it). 

I'm not sure what information would be needed, but my queries look 
something like this:

"Brooklyn NY"

I am currently using SpanNearQuery with a slop of 0 and inOrder of 
false.  Is there perhaps another type of Query I can use to speed things 
up?  TermQuery doesn't work since I have multiple terms, and PhraseQuery 
seems to take around the same time, and is not compatible with 
SpanNearQuery (I later merge this query with another in a SpanNearQuery). 

I can live without merging this into the SpanNearQuery, as long as I can 
find something that can do the 40k searches faster. 

Russ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: optimizing single document searches

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 28 February 2007 01:01, Russ wrote:
> I will definatelly check it out tommorow.
> 
> I also forgot to mention that I am not interested in the hits themselves, 
only whether or not there was a hit.  Is there something I can use that's 
optimized for this scenario, or should I look into rewriting the search 
method of the indexarsearcher?  Currently I just check hits.size().

For a single document: get the Scorer from the Query via Weight.
Then check the return value of Scorer.next(), it will indicate whether
the only doc matches the query.

Regards,
Paul Elschot.


> 
> Russ
> Sent wirelessly via BlackBerry from T-Mobile.  
> 
> -----Original Message-----
> From: "Erick Erickson" <er...@gmail.com>
> Date: Tue, 27 Feb 2007 18:49:45 
> To:java-user@lucene.apache.org
> Subject: Re: optimizing single document searches
> 
> Which is very, very cool. I wound up using it for hit counting and it
> works like a charm....
> 
> On 2/27/07, karl wettin <ka...@gmail.com> wrote:
> >
> >
> > 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> > ]
> >
> > > On a single document of 10k characters, doing about 40k searches
> > > takes about 5 seconds.  This is not bad, but I was wondering if I
> > > can somehow speed this up.
> >
> > Your corpus contains only one document? Try contrib/memory, an index
> > optimized for that scenario.
> >
> > --
> > karl
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: optimizing single document searches

Posted by Russ <rs...@istandfor.com>.

I will definatelly check it out tommorow.

I also forgot to mention that I am not interested in the hits themselves, only whether or not there was a hit.  Is there something I can use that's optimized for this scenario, or should I look into rewriting the search method of the indexarsearcher?  Currently I just check hits.size().

Russ
Sent wirelessly via BlackBerry from T-Mobile.  

-----Original Message-----
From: "Erick Erickson" <er...@gmail.com>
Date: Tue, 27 Feb 2007 18:49:45 
To:java-user@lucene.apache.org
Subject: Re: optimizing single document searches

Which is very, very cool. I wound up using it for hit counting and it
works like a charm....

On 2/27/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> ]
>
> > On a single document of 10k characters, doing about 40k searches
> > takes about 5 seconds.  This is not bad, but I was wondering if I
> > can somehow speed this up.
>
> Your corpus contains only one document? Try contrib/memory, an index
> optimized for that scenario.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: optimizing single document searches

Posted by Erick Erickson <er...@gmail.com>.

Which is very, very cool. I wound up using it for hit counting and it
works like a charm....

On 2/27/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> ]
>
> > On a single document of 10k characters, doing about 40k searches
> > takes about 5 seconds.  This is not bad, but I was wondering if I
> > can somehow speed this up.
>
> Your corpus contains only one document? Try contrib/memory, an index
> optimized for that scenario.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: optimizing single document searches

Posted by Ruslan Sivak <rs...@istandfor.com>.

karl wettin wrote:
>
> 28 feb 2007 kl. 00.49 skrev Russ:
>
>> Thanks, I will try it tommorow... Is it significantly different from 
>> using a standard index on a ramdir?
>>
>
> A bit different.
>
> You can also try LUCENE-550. It has about the same speed as 
> contrib/memory but can handle multiple documents and use reader, 
> writer and searcher as any other index.
>
> --karl
>
Karl,

Thank you.  I tried the contrib/memory and it's awesome.  Got my search 
time down to 300ms from 5 seconds. 

I'm still having some performance issues on the set up.  I can probably 
live with them, as I'll be caching these terms, but maybe I can optimize 
it somehow.  It currently takes about 3.5 seconds to set up.  I am 
basically creating 40k SpanNearQueries.  Here is my method that creates 
them.  Is there anything I can improve?

private static Analyzer analyzer=new StandardAnalyzer();
public static SpanNearQuery createSpanNearQuery(String string, int slop, 
boolean inOrder)
    {
        Vector terms=new Vector();
        TokenStream tokenizer=Lucene.analyzer.tokenStream("body", new 
StringReader(string));
        Token token = null;
        do {

            try {
                token=tokenizer.next();
            } catch (Exception e) {
                e.printStackTrace();
            }
            if (token!=null)
            {
                terms.add(new SpanTermQuery(new 
Term("body",token.termText())));
            }
        }
        while (token!=null && terms.size()<10);
       
        SpanTermQuery[] termsArray=new SpanTermQuery[terms.size()];
        for (int i=0;i<terms.size();i++)
        {
            termsArray[i]=(SpanTermQuery) terms.get(i);
        }
        return new SpanNearQuery(termsArray,slop,inOrder);
    }


Russ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: optimizing single document searches

Posted by karl wettin <ka...@gmail.com>.

28 feb 2007 kl. 00.49 skrev Russ:

> Thanks, I will try it tommorow... Is it significantly different  
> from using a standard index on a ramdir?
>

A bit different.

You can also try LUCENE-550. It has about the same speed as contrib/ 
memory but can handle multiple documents and use reader, writer and  
searcher as any other index.

-- 
karl

> Russ
> Sent wirelessly via BlackBerry from T-Mobile.
>
> -----Original Message-----
> From: karl wettin <ka...@gmail.com>
> Date: Wed, 28 Feb 2007 00:37:55
> To:java-user@lucene.apache.org
> Subject: Re: optimizing single document searches
>
>
> 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> ]
>
>> On a single document of 10k characters, doing about 40k searches
>> takes about 5 seconds.  This is not bad, but I was wondering if I
>> can somehow speed this up.
>
> Your corpus contains only one document? Try contrib/memory, an index
> optimized for that scenario.
>
> -- 
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: optimizing single document searches

Posted by Russ <rs...@istandfor.com>.

Thanks, I will try it tommorow... Is it significantly different from using a standard index on a ramdir?

Russ
Sent wirelessly via BlackBerry from T-Mobile.  

-----Original Message-----
From: karl wettin <ka...@gmail.com>
Date: Wed, 28 Feb 2007 00:37:55 
To:java-user@lucene.apache.org
Subject: Re: optimizing single document searches


28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
]

> On a single document of 10k characters, doing about 40k searches  
> takes about 5 seconds.  This is not bad, but I was wondering if I  
> can somehow speed this up.

Your corpus contains only one document? Try contrib/memory, an index  
optimized for that scenario.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: optimizing single document searches

Posted by karl wettin <ka...@gmail.com>.

28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
]

> On a single document of 10k characters, doing about 40k searches  
> takes about 5 seconds.  This is not bad, but I was wondering if I  
> can somehow speed this up.

Your corpus contains only one document? Try contrib/memory, an index  
optimized for that scenario.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org