You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ruslan Sivak <rs...@istandfor.com> on 2007/02/28 00:25:15 UTC
optimizing single document searches
I am using Lucene in a little bit weird way, instead of searching all
the documents for a specific query, I am searching a single document for
many specific queries.
On a single document of 10k characters, doing about 40k searches takes
about 5 seconds. This is not bad, but I was wondering if I can somehow
speed this up. It also takes about 5 seconds to generate the
searchTerms (which is fine, since I will do it once and cache it).
I'm not sure what information would be needed, but my queries look
something like this:
"Brooklyn NY"
I am currently using SpanNearQuery with a slop of 0 and inOrder of
false. Is there perhaps another type of Query I can use to speed things
up? TermQuery doesn't work since I have multiple terms, and PhraseQuery
seems to take around the same time, and is not compatible with
SpanNearQuery (I later merge this query with another in a SpanNearQuery).
I can live without merging this into the SpanNearQuery, as long as I can
find something that can do the 40k searches faster.
Russ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: optimizing single document searches
Posted by Paul Elschot <pa...@xs4all.nl>.
On Wednesday 28 February 2007 01:01, Russ wrote:
> I will definatelly check it out tommorow.
>
> I also forgot to mention that I am not interested in the hits themselves,
only whether or not there was a hit. Is there something I can use that's
optimized for this scenario, or should I look into rewriting the search
method of the indexarsearcher? Currently I just check hits.size().
For a single document: get the Scorer from the Query via Weight.
Then check the return value of Scorer.next(), it will indicate whether
the only doc matches the query.
Regards,
Paul Elschot.
>
> Russ
> Sent wirelessly via BlackBerry from T-Mobile.
>
> -----Original Message-----
> From: "Erick Erickson" <er...@gmail.com>
> Date: Tue, 27 Feb 2007 18:49:45
> To:java-user@lucene.apache.org
> Subject: Re: optimizing single document searches
>
> Which is very, very cool. I wound up using it for hit counting and it
> works like a charm....
>
> On 2/27/07, karl wettin <ka...@gmail.com> wrote:
> >
> >
> > 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> > ]
> >
> > > On a single document of 10k characters, doing about 40k searches
> > > takes about 5 seconds. This is not bad, but I was wondering if I
> > > can somehow speed this up.
> >
> > Your corpus contains only one document? Try contrib/memory, an index
> > optimized for that scenario.
> >
> > --
> > karl
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: optimizing single document searches
Posted by Russ <rs...@istandfor.com>.
I will definatelly check it out tommorow.
I also forgot to mention that I am not interested in the hits themselves, only whether or not there was a hit. Is there something I can use that's optimized for this scenario, or should I look into rewriting the search method of the indexarsearcher? Currently I just check hits.size().
Russ
Sent wirelessly via BlackBerry from T-Mobile.
-----Original Message-----
From: "Erick Erickson" <er...@gmail.com>
Date: Tue, 27 Feb 2007 18:49:45
To:java-user@lucene.apache.org
Subject: Re: optimizing single document searches
Which is very, very cool. I wound up using it for hit counting and it
works like a charm....
On 2/27/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> ]
>
> > On a single document of 10k characters, doing about 40k searches
> > takes about 5 seconds. This is not bad, but I was wondering if I
> > can somehow speed this up.
>
> Your corpus contains only one document? Try contrib/memory, an index
> optimized for that scenario.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: optimizing single document searches
Posted by Erick Erickson <er...@gmail.com>.
Which is very, very cool. I wound up using it for hit counting and it
works like a charm....
On 2/27/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> ]
>
> > On a single document of 10k characters, doing about 40k searches
> > takes about 5 seconds. This is not bad, but I was wondering if I
> > can somehow speed this up.
>
> Your corpus contains only one document? Try contrib/memory, an index
> optimized for that scenario.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: optimizing single document searches
Posted by Ruslan Sivak <rs...@istandfor.com>.
karl wettin wrote:
>
> 28 feb 2007 kl. 00.49 skrev Russ:
>
>> Thanks, I will try it tommorow... Is it significantly different from
>> using a standard index on a ramdir?
>>
>
> A bit different.
>
> You can also try LUCENE-550. It has about the same speed as
> contrib/memory but can handle multiple documents and use reader,
> writer and searcher as any other index.
>
> --karl
>
Karl,
Thank you. I tried the contrib/memory and it's awesome. Got my search
time down to 300ms from 5 seconds.
I'm still having some performance issues on the set up. I can probably
live with them, as I'll be caching these terms, but maybe I can optimize
it somehow. It currently takes about 3.5 seconds to set up. I am
basically creating 40k SpanNearQueries. Here is my method that creates
them. Is there anything I can improve?
private static Analyzer analyzer=new StandardAnalyzer();
public static SpanNearQuery createSpanNearQuery(String string, int slop,
boolean inOrder)
{
Vector terms=new Vector();
TokenStream tokenizer=Lucene.analyzer.tokenStream("body", new
StringReader(string));
Token token = null;
do {
try {
token=tokenizer.next();
} catch (Exception e) {
e.printStackTrace();
}
if (token!=null)
{
terms.add(new SpanTermQuery(new
Term("body",token.termText())));
}
}
while (token!=null && terms.size()<10);
SpanTermQuery[] termsArray=new SpanTermQuery[terms.size()];
for (int i=0;i<terms.size();i++)
{
termsArray[i]=(SpanTermQuery) terms.get(i);
}
return new SpanNearQuery(termsArray,slop,inOrder);
}
Russ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: optimizing single document searches
Posted by karl wettin <ka...@gmail.com>.
28 feb 2007 kl. 00.49 skrev Russ:
> Thanks, I will try it tommorow... Is it significantly different
> from using a standard index on a ramdir?
>
A bit different.
You can also try LUCENE-550. It has about the same speed as contrib/
memory but can handle multiple documents and use reader, writer and
searcher as any other index.
--
karl
> Russ
> Sent wirelessly via BlackBerry from T-Mobile.
>
> -----Original Message-----
> From: karl wettin <ka...@gmail.com>
> Date: Wed, 28 Feb 2007 00:37:55
> To:java-user@lucene.apache.org
> Subject: Re: optimizing single document searches
>
>
> 28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
> ]
>
>> On a single document of 10k characters, doing about 40k searches
>> takes about 5 seconds. This is not bad, but I was wondering if I
>> can somehow speed this up.
>
> Your corpus contains only one document? Try contrib/memory, an index
> optimized for that scenario.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: optimizing single document searches
Posted by Russ <rs...@istandfor.com>.
Thanks, I will try it tommorow... Is it significantly different from using a standard index on a ramdir?
Russ
Sent wirelessly via BlackBerry from T-Mobile.
-----Original Message-----
From: karl wettin <ka...@gmail.com>
Date: Wed, 28 Feb 2007 00:37:55
To:java-user@lucene.apache.org
Subject: Re: optimizing single document searches
28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
]
> On a single document of 10k characters, doing about 40k searches
> takes about 5 seconds. This is not bad, but I was wondering if I
> can somehow speed this up.
Your corpus contains only one document? Try contrib/memory, an index
optimized for that scenario.
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: optimizing single document searches
Posted by karl wettin <ka...@gmail.com>.
28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
]
> On a single document of 10k characters, doing about 40k searches
> takes about 5 seconds. This is not bad, but I was wondering if I
> can somehow speed this up.
Your corpus contains only one document? Try contrib/memory, an index
optimized for that scenario.
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org