You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by François Eric <fr...@jarca.ca> on 2009/12/18 15:08:54 UTC

Query joining 2 indexes

Hello,

I have a performance problem and would need expert advice on how to go 
about fixing it:

I currently have 2 indexes: Daily and Hourly.  The Daily index contains 
about 1,000,000 documents and my Hourly index approximately: 24,000,000 
documents.  My Daily index contains many fields and some of them are IDs 
to my Hourly Index.
What I want to do is fetch data in one request (if possible).
Right now I do it in many requests:
1- Get the matching Daily documents (say it returns 500 documents)
2- For each of these documents, locate the Hourly Index Id and fetch it.

Therefore I make 501 requests to lucene.  This causes some performance 
issues I guess because of the overhead to  making a request to Lucene.

Is it possible to do this in 1 request?  I'm thinking no because I'm not 
sure what the result set would be but maybe I'm missing something.

If not I guess it would be possible to build a query with my 500 hourly 
ids and make a OR between them to make it in 2 requests....but then I 
have to find the matching documents.  Will this overflow if I have 50000 
ids in my query?

Anyway, I just want advice on how one would address this situation.

Thank you very much,

François

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Query joining 2 indexes

Posted by frer <fr...@polymtl.ca>.

Hey Erick,

Ok that's what I thought.  Unfortunately doing that will pretty much take
the same time as what I was doing (at least I think it will)

What I ended up doing is fetching them all with a big query.  I obviously
needed to change the MaxClauseCount but in my case this is an acceptable
solution... unless you (or anyone else) has a better idea.

Here is my resulting code:

            QueryParser parser = new QueryParser("ID", new
StandardAnalyzer());
            StringBuffer idQueryBuffer = new StringBuffer();
            Iterator<String> i = hourlyIds.iterator();
            while(i.hasNext()) {
                String id = i.next();
                idQueryBuffer.append(id).append(" ");
            }
            Query query = parser.parse(idQueryBuffer.toString());

            BooleanQuery booleanQuery = new BooleanQuery();
            booleanQuery.add(query, BooleanClause.Occur.MUST);
            
            TopDocs countDocs = searcher.search(booleanQuery, null, 1);


Thanks a lot for your help Erick and happy holidays to everyone,

François



Erick Erickson wrote:
> 
> From your original post...
> 
> <<<My Daily index contains many fields and some of them are IDs to my
> Hourly
> Index>>>
> 
> 
> So it's just the IDs to your hourly index contained in each Daily
> document.
> Problem here is that this ID is probably NOT the Lucene ID, so my original
> idea needs some refinement. Assuming that your IDs to the hourly index
> are contained in your daily document, you should be able to use TermDocs
> to get the Hourly documents efficiently.
> 
> Note to self: Engage brain.
> 
> FWIW
> Erick
> 
> On Fri, Dec 18, 2009 at 11:42 AM, frer <fr...@polymtl.ca> wrote:
> 
>>
>> Thanks for your answer,
>>
>> I didn't think that using:          Document doc irHourly.doc(<id in
>> hit>);
>> would be much faster than using the searcher. I will try that.
>>
>> I have one question though: what is the <id in hit> you reffer to.  Since
>> I
>> have searched in the daily index, what is the corresponding hit on the
>> hourly index if i haven't searched in the hourly index yet?
>>
>> Thanks for your help,
>>
>> Francois
>>
>> PS: Concerning your idea to merge the hourly and daily index, I did think
>> about that but the quantity of info I have is already way too large in
>> both
>> indexes (about 100 fields in each) to repeat that information so many
>> times.
>> My Index is already of 10GB.
>>
>>
>>
>> Erick Erickson wrote:
>> >
>> > Well, making a large OR clause is definitely more efficient than making
>> > N different requests, but you would have to search the results. It
>> doesn't
>> > sound very performant.
>> >
>> > Could you go to 50,000 ids? yes, but you have to fiddle with
>> > setMaxClauseCount
>> > because Lucene defaults to a max of 1,024.
>> >
>> > There's no way I know of to emulate a DB join. Usually, whenever I try
>> > the answer is to flatten my data but I don't see a good way to do that
>> > in this case.
>> >
>> > Hmmm, what do you think would happen if you opened an IndexReader
>> > into your Hourly index straight from your Daily query? Something like:
>> >
>> > IndexReader irHourly = <open index reader to Hourly>.
>> > for (each hit in Daily) {
>> >     for (each id in this Hourly doc) {
>> >          Document doc irHourly.doc(<id in hit>);
>> >          addToResponse(doc); // your method here.....
>> >     }
>> > }
>> >
>> >
>> > You *probably* want to keep the Hourly reader open between requests,
>> but
>> > since the above isn't searching, you *might* be able to open it each
>> time.
>> > I'd
>> > go for keeping it open between requests if at all possible.
>> >
>> > And here's a wild and crazy idea. Remember that Lucene documents don't
>> > require that *any* fields be in common. It might make your management
>> > easier if you *combined* the indexes. Crudely, prefix each Daily field
>> > with
>> > "D_" and each hourly field with "H_" and put 'em all in the same index.
>> > I'm
>> > not claiming that's a good solution in your case, but I thought I'd
>> > mention
>> > it
>> > as a possibility. You still can't do joins on them though.....
>> >
>> > Erick
>> >
>> >
>> >
>> > On Fri, Dec 18, 2009 at 9:08 AM, François Eric
>> > <fr...@jarca.ca>wrote:
>> >
>> >> Hello,
>> >>
>> >> I have a performance problem and would need expert advice on how to go
>> >> about fixing it:
>> >>
>> >> I currently have 2 indexes: Daily and Hourly.  The Daily index
>> contains
>> >> about 1,000,000 documents and my Hourly index approximately:
>> 24,000,000
>> >> documents.  My Daily index contains many fields and some of them are
>> IDs
>> >> to
>> >> my Hourly Index.
>> >> What I want to do is fetch data in one request (if possible).
>> >> Right now I do it in many requests:
>> >> 1- Get the matching Daily documents (say it returns 500 documents)
>> >> 2- For each of these documents, locate the Hourly Index Id and fetch
>> it.
>> >>
>> >> Therefore I make 501 requests to lucene.  This causes some performance
>> >> issues I guess because of the overhead to  making a request to Lucene.
>> >>
>> >> Is it possible to do this in 1 request?  I'm thinking no because I'm
>> not
>> >> sure what the result set would be but maybe I'm missing something.
>> >>
>> >> If not I guess it would be possible to build a query with my 500
>> hourly
>> >> ids
>> >> and make a OR between them to make it in 2 requests....but then I have
>> to
>> >> find the matching documents.  Will this overflow if I have 50000 ids
>> in
>> >> my
>> >> query?
>> >>
>> >> Anyway, I just want advice on how one would address this situation.
>> >>
>> >> Thank you very much,
>> >>
>> >> François
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Query-joining-2-indexes-tp26843980p26846129.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Query-joining-2-indexes-tp26843980p26848434.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Query joining 2 indexes

Posted by Erick Erickson <er...@gmail.com>.

>From your original post...

<<<My Daily index contains many fields and some of them are IDs to my Hourly
Index>>>


So it's just the IDs to your hourly index contained in each Daily document.
Problem here is that this ID is probably NOT the Lucene ID, so my original
idea needs some refinement. Assuming that your IDs to the hourly index
are contained in your daily document, you should be able to use TermDocs
to get the Hourly documents efficiently.

Note to self: Engage brain.

FWIW
Erick

On Fri, Dec 18, 2009 at 11:42 AM, frer <fr...@polymtl.ca> wrote:

>
> Thanks for your answer,
>
> I didn't think that using:          Document doc irHourly.doc(<id in hit>);
> would be much faster than using the searcher. I will try that.
>
> I have one question though: what is the <id in hit> you reffer to.  Since I
> have searched in the daily index, what is the corresponding hit on the
> hourly index if i haven't searched in the hourly index yet?
>
> Thanks for your help,
>
> Francois
>
> PS: Concerning your idea to merge the hourly and daily index, I did think
> about that but the quantity of info I have is already way too large in both
> indexes (about 100 fields in each) to repeat that information so many
> times.
> My Index is already of 10GB.
>
>
>
> Erick Erickson wrote:
> >
> > Well, making a large OR clause is definitely more efficient than making
> > N different requests, but you would have to search the results. It
> doesn't
> > sound very performant.
> >
> > Could you go to 50,000 ids? yes, but you have to fiddle with
> > setMaxClauseCount
> > because Lucene defaults to a max of 1,024.
> >
> > There's no way I know of to emulate a DB join. Usually, whenever I try
> > the answer is to flatten my data but I don't see a good way to do that
> > in this case.
> >
> > Hmmm, what do you think would happen if you opened an IndexReader
> > into your Hourly index straight from your Daily query? Something like:
> >
> > IndexReader irHourly = <open index reader to Hourly>.
> > for (each hit in Daily) {
> >     for (each id in this Hourly doc) {
> >          Document doc irHourly.doc(<id in hit>);
> >          addToResponse(doc); // your method here.....
> >     }
> > }
> >
> >
> > You *probably* want to keep the Hourly reader open between requests, but
> > since the above isn't searching, you *might* be able to open it each
> time.
> > I'd
> > go for keeping it open between requests if at all possible.
> >
> > And here's a wild and crazy idea. Remember that Lucene documents don't
> > require that *any* fields be in common. It might make your management
> > easier if you *combined* the indexes. Crudely, prefix each Daily field
> > with
> > "D_" and each hourly field with "H_" and put 'em all in the same index.
> > I'm
> > not claiming that's a good solution in your case, but I thought I'd
> > mention
> > it
> > as a possibility. You still can't do joins on them though.....
> >
> > Erick
> >
> >
> >
> > On Fri, Dec 18, 2009 at 9:08 AM, François Eric
> > <fr...@jarca.ca>wrote:
> >
> >> Hello,
> >>
> >> I have a performance problem and would need expert advice on how to go
> >> about fixing it:
> >>
> >> I currently have 2 indexes: Daily and Hourly.  The Daily index contains
> >> about 1,000,000 documents and my Hourly index approximately: 24,000,000
> >> documents.  My Daily index contains many fields and some of them are IDs
> >> to
> >> my Hourly Index.
> >> What I want to do is fetch data in one request (if possible).
> >> Right now I do it in many requests:
> >> 1- Get the matching Daily documents (say it returns 500 documents)
> >> 2- For each of these documents, locate the Hourly Index Id and fetch it.
> >>
> >> Therefore I make 501 requests to lucene.  This causes some performance
> >> issues I guess because of the overhead to  making a request to Lucene.
> >>
> >> Is it possible to do this in 1 request?  I'm thinking no because I'm not
> >> sure what the result set would be but maybe I'm missing something.
> >>
> >> If not I guess it would be possible to build a query with my 500 hourly
> >> ids
> >> and make a OR between them to make it in 2 requests....but then I have
> to
> >> find the matching documents.  Will this overflow if I have 50000 ids in
> >> my
> >> query?
> >>
> >> Anyway, I just want advice on how one would address this situation.
> >>
> >> Thank you very much,
> >>
> >> François
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Query-joining-2-indexes-tp26843980p26846129.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Query joining 2 indexes

Posted by frer <fr...@polymtl.ca>.

Thanks for your answer,

I didn't think that using:          Document doc irHourly.doc(<id in hit>);
would be much faster than using the searcher. I will try that.

I have one question though: what is the <id in hit> you reffer to.  Since I
have searched in the daily index, what is the corresponding hit on the
hourly index if i haven't searched in the hourly index yet?

Thanks for your help,

Francois

PS: Concerning your idea to merge the hourly and daily index, I did think
about that but the quantity of info I have is already way too large in both
indexes (about 100 fields in each) to repeat that information so many times. 
My Index is already of 10GB.



Erick Erickson wrote:
> 
> Well, making a large OR clause is definitely more efficient than making
> N different requests, but you would have to search the results. It doesn't
> sound very performant.
> 
> Could you go to 50,000 ids? yes, but you have to fiddle with
> setMaxClauseCount
> because Lucene defaults to a max of 1,024.
> 
> There's no way I know of to emulate a DB join. Usually, whenever I try
> the answer is to flatten my data but I don't see a good way to do that
> in this case.
> 
> Hmmm, what do you think would happen if you opened an IndexReader
> into your Hourly index straight from your Daily query? Something like:
> 
> IndexReader irHourly = <open index reader to Hourly>.
> for (each hit in Daily) {
>     for (each id in this Hourly doc) {
>          Document doc irHourly.doc(<id in hit>);
>          addToResponse(doc); // your method here.....
>     }
> }
> 
> 
> You *probably* want to keep the Hourly reader open between requests, but
> since the above isn't searching, you *might* be able to open it each time.
> I'd
> go for keeping it open between requests if at all possible.
> 
> And here's a wild and crazy idea. Remember that Lucene documents don't
> require that *any* fields be in common. It might make your management
> easier if you *combined* the indexes. Crudely, prefix each Daily field
> with
> "D_" and each hourly field with "H_" and put 'em all in the same index.
> I'm
> not claiming that's a good solution in your case, but I thought I'd
> mention
> it
> as a possibility. You still can't do joins on them though.....
> 
> Erick
> 
> 
> 
> On Fri, Dec 18, 2009 at 9:08 AM, François Eric
> <fr...@jarca.ca>wrote:
> 
>> Hello,
>>
>> I have a performance problem and would need expert advice on how to go
>> about fixing it:
>>
>> I currently have 2 indexes: Daily and Hourly.  The Daily index contains
>> about 1,000,000 documents and my Hourly index approximately: 24,000,000
>> documents.  My Daily index contains many fields and some of them are IDs
>> to
>> my Hourly Index.
>> What I want to do is fetch data in one request (if possible).
>> Right now I do it in many requests:
>> 1- Get the matching Daily documents (say it returns 500 documents)
>> 2- For each of these documents, locate the Hourly Index Id and fetch it.
>>
>> Therefore I make 501 requests to lucene.  This causes some performance
>> issues I guess because of the overhead to  making a request to Lucene.
>>
>> Is it possible to do this in 1 request?  I'm thinking no because I'm not
>> sure what the result set would be but maybe I'm missing something.
>>
>> If not I guess it would be possible to build a query with my 500 hourly
>> ids
>> and make a OR between them to make it in 2 requests....but then I have to
>> find the matching documents.  Will this overflow if I have 50000 ids in
>> my
>> query?
>>
>> Anyway, I just want advice on how one would address this situation.
>>
>> Thank you very much,
>>
>> François
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Query-joining-2-indexes-tp26843980p26846129.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Query joining 2 indexes

Posted by Erick Erickson <er...@gmail.com>.

Well, making a large OR clause is definitely more efficient than making
N different requests, but you would have to search the results. It doesn't
sound very performant.

Could you go to 50,000 ids? yes, but you have to fiddle with
setMaxClauseCount
because Lucene defaults to a max of 1,024.

There's no way I know of to emulate a DB join. Usually, whenever I try
the answer is to flatten my data but I don't see a good way to do that
in this case.

Hmmm, what do you think would happen if you opened an IndexReader
into your Hourly index straight from your Daily query? Something like:

IndexReader irHourly = <open index reader to Hourly>.
for (each hit in Daily) {
    for (each id in this Hourly doc) {
         Document doc irHourly.doc(<id in hit>);
         addToResponse(doc); // your method here.....
    }
}

You *probably* want to keep the Hourly reader open between requests, but
since the above isn't searching, you *might* be able to open it each time.
I'd
go for keeping it open between requests if at all possible.

And here's a wild and crazy idea. Remember that Lucene documents don't
require that *any* fields be in common. It might make your management
easier if you *combined* the indexes. Crudely, prefix each Daily field with
"D_" and each hourly field with "H_" and put 'em all in the same index. I'm
not claiming that's a good solution in your case, but I thought I'd mention
it
as a possibility. You still can't do joins on them though.....

Erick

On Fri, Dec 18, 2009 at 9:08 AM, François Eric <fr...@jarca.ca>wrote:

> Hello,
>
> I have a performance problem and would need expert advice on how to go
> about fixing it:
>
> I currently have 2 indexes: Daily and Hourly.  The Daily index contains
> about 1,000,000 documents and my Hourly index approximately: 24,000,000
> documents.  My Daily index contains many fields and some of them are IDs to
> my Hourly Index.
> What I want to do is fetch data in one request (if possible).
> Right now I do it in many requests:
> 1- Get the matching Daily documents (say it returns 500 documents)
> 2- For each of these documents, locate the Hourly Index Id and fetch it.
>
> Therefore I make 501 requests to lucene.  This causes some performance
> issues I guess because of the overhead to  making a request to Lucene.
>
> Is it possible to do this in 1 request?  I'm thinking no because I'm not
> sure what the result set would be but maybe I'm missing something.
>
> If not I guess it would be possible to build a query with my 500 hourly ids
> and make a OR between them to make it in 2 requests....but then I have to
> find the matching documents.  Will this overflow if I have 50000 ids in my
> query?
>
> Anyway, I just want advice on how one would address this situation.
>
> Thank you very much,
>
> François
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Query joining 2 indexes

Posted by frer <fr...@polymtl.ca>.

Just re-read my post and I don't think it was clear.

The algorithm I ended up doing is:

for all daily data 
        gather hourly ids in a set
        build map with placeholders for hourly values

    get hourly documents from set

    for all daily data
        insert hourly data from documents fetched in placeholders

So the method I showed below is the one to "get hourly documents from set"

François



François Eric wrote:
> 
> Hello,
> 
> I have a performance problem and would need expert advice on how to go 
> about fixing it:
> 
> I currently have 2 indexes: Daily and Hourly.  The Daily index contains 
> about 1,000,000 documents and my Hourly index approximately: 24,000,000 
> documents.  My Daily index contains many fields and some of them are IDs 
> to my Hourly Index.
> What I want to do is fetch data in one request (if possible).
> Right now I do it in many requests:
> 1- Get the matching Daily documents (say it returns 500 documents)
> 2- For each of these documents, locate the Hourly Index Id and fetch it.
> 
> Therefore I make 501 requests to lucene.  This causes some performance 
> issues I guess because of the overhead to  making a request to Lucene.
> 
> Is it possible to do this in 1 request?  I'm thinking no because I'm not 
> sure what the result set would be but maybe I'm missing something.
> 
> If not I guess it would be possible to build a query with my 500 hourly 
> ids and make a OR between them to make it in 2 requests....but then I 
> have to find the matching documents.  Will this overflow if I have 50000 
> ids in my query?
> 
> Anyway, I just want advice on how one would address this situation.
> 
> Thank you very much,
> 
> François
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Query-joining-2-indexes-tp26843980p26848479.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org