You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Kai Hu <ka...@dusee.cn> on 2007/08/10 12:16:33 UTC

答复: 答复: Lucene in large database contexts

Antonello,
	You are right,I think lucene indexsearcher will search the old information if IndexWriter was not closed(I think lucene release the Lock here),so I only add a few documents every time from buffer to implement index "real time".

kai


发件人: antonelloprov@gmail.com [mailto:antonelloprov@gmail.com] 代表 Antonello Provenzano
发送时间: 2007年8月10日 星期五 17:59
收件人: java-user@lucene.apache.org
主题: Re: 答复: Lucene in large database contexts

Kai,

Thanks. The problem I see it's that although I can add a Document
through IndexWriter or IndexModifier, this won't be searchable until
the index is closed and, possibly, optimized, since the score of the
document in the index context must be re-calculated on the basis of
the whole context.

Is this assumption true? or am I completely wrong?

Cheers.
Antonello


On 8/10/07, Kai Hu <ka...@dusee.cn> wrote:
> Hi, Antonello
>         You can use IndexWriter.addDocument(Document document) to add single document,same to update,delete operation.
>
> kai
>
> -----邮件原件-----
> 发件人: Antonello Provenzano [mailto:antonelloprov@gmail.com]
> 发送时间: 2007年8月10日 星期五 17:09
> 收件人: java-user@lucene.apache.org
> 主题: Lucene in large database contexts
>
> Hi There!
>
> I've been working for a while on the implementation of a website
> oriented to contents that would contain millions of entries, most of
> them indexable (such as descriptions, texts, names, etc.).
> The ideal solution to make them searchable would be to use Lucene as
> index and search engine.
>
> The reason I'm posting the mailing list is the following: since all
> the entries will be stored in a database (most likely MySQL InnoDB or
> Oracle), what's the best technique to implement a system that indexes
> in "real time" (eg. when an entry is inserted into the databsse) the
> content and make it searchable? Based on my understanding of Lucene,
> such this thing is not possible, since the index must be re-created to
> be able to search the indexed contents. Is this true?
>
> Eventually, could anyone point me to a working example about how to
> implement such a similar context?
>
>
> Thank you for the support.
> Antonello
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 答复: 答复: Lucene in large database contexts

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

With Compass, indexing is linked to your database transaction, when  
your object is persisted, it's indexed too.
All your questions are managed cleanly and silently by Compass, just  
have a look to the source code if you don't wont to use this product.

M.

Le 10 août 07 à 12:24, Antonello Provenzano a écrit :

> Kai,
>
> The context I'm going to work with requires a continuous addition of
> documents to the indexes, since it's user-driven content, and this
> would require the content to be always up-to-date.
> This is the problem I'm facing, since I cannot rebuild a 1Gb (at
> least) index every time a user inserts a new entry into the database.
>
> I know Digg, for instance, is using Lucene as search engine: since the
> amount of data they're dealing with is much higher than mine, I would
> like to understand the way they used to implement this kind of
> solution.
>
> Thank you again.
> Antonello
>
>
> On 8/10/07, Kai Hu <ka...@dusee.cn> wrote:
>> Antonello,
>>         You are right,I think lucene indexsearcher will search the  
>> old information if IndexWriter was not closed(I think lucene  
>> release the Lock here),so I only add a few documents every time  
>> from buffer to implement index "real time".
>>
>> kai
>>
>>
>> 发件人: antonelloprov@gmail.com  
>> [mailto:antonelloprov@gmail.com] 代表 Antonello Provenzano
>> 发送时间: 2007年8月10日 星期五 17:59
>> 收件人: java-user@lucene.apache.org
>> 主题: Re: 答复: Lucene in large database contexts
>>
>> Kai,
>>
>> Thanks. The problem I see it's that although I can add a Document
>> through IndexWriter or IndexModifier, this won't be searchable until
>> the index is closed and, possibly, optimized, since the score of the
>> document in the index context must be re-calculated on the basis of
>> the whole context.
>>
>> Is this assumption true? or am I completely wrong?
>>
>> Cheers.
>> Antonello
>>
>>
>> On 8/10/07, Kai Hu <ka...@dusee.cn> wrote:
>>> Hi, Antonello
>>>         You can use IndexWriter.addDocument(Document document) to  
>>> add single document,same to update,delete operation.
>>>
>>> kai
>>>
>>> -----邮件原件-----
>>> 发件人: Antonello Provenzano [mailto:antonelloprov@gmail.com]
>>> 发送时间: 2007年8月10日 星期五 17:09
>>> 收件人: java-user@lucene.apache.org
>>> 主题: Lucene in large database contexts
>>>
>>> Hi There!
>>>
>>> I've been working for a while on the implementation of a website
>>> oriented to contents that would contain millions of entries, most of
>>> them indexable (such as descriptions, texts, names, etc.).
>>> The ideal solution to make them searchable would be to use Lucene as
>>> index and search engine.
>>>
>>> The reason I'm posting the mailing list is the following: since all
>>> the entries will be stored in a database (most likely MySQL  
>>> InnoDB or
>>> Oracle), what's the best technique to implement a system that  
>>> indexes
>>> in "real time" (eg. when an entry is inserted into the databsse) the
>>> content and make it searchable? Based on my understanding of Lucene,
>>> such this thing is not possible, since the index must be re- 
>>> created to
>>> be able to search the indexed contents. Is this true?
>>>
>>> Eventually, could anyone point me to a working example about how to
>>> implement such a similar context?
>>>
>>>
>>> Thank you for the support.
>>> Antonello
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: 答复: 答复: Lucene in large database contexts

Posted by Askar Zaidi <as...@gmail.com>.

Hey Guys,

I am trying to do something similar. Make the content search-able as soon as
it is added to the website. The way it can work in my scenario is that , I
create the Index for a every new user account created.

Then, whenever a new document is uploaded, its contents are added to the
users Index using writer.addDocument(...)

As  for closing the writer, yes ! I'll close the writer and optimize after
its added to the index.

I really think this should work. Don't you ?

thanks
AZ

On 8/10/07, Erick Erickson <er...@gmail.com> wrote:
>
> Well, closing/opening an index is MUCH less expensive than
> rebuilding the whole thing, so I don't understand part of your
> statements....
>
> It *may* (but I haven't tried it) be possible to flush the writer rather
> than
> close/open it. But, you MUST close/reopen the reader you search with
> even if flush works like I think it does.
>
> But it's also possible to use a two tiered approach. 1G isn't all that
> big.
> Could
> you read it into a RAMDir and use that for your searches? Then, when you
> add
> data, you add it to *both* indexes, but close/open the RAMdir for
> searching.
>
> It's also possible to keep the RAMdir as the delta between the FSdir and
> "current" states of your index. Add to both and search both. Although
> deletes may be a problem here.
>
> You haven't specified how often you expect changes, though. 100/second?
> 1/minute? How real is "real time"? You could do something like warm up
> a new reader in the background whenever you decided you needed to be
> absolutely up to date and swap your "live" reader for the newly warmed up
> one whenever you deemed it wise.
>
> Or you could just close/open your reader after each modification, fire off
> a
>
> couple of warmup queries at it and let the users live with slow responses
> if they happen to search before your warm-up queries completed.
>
> The point is that there are many options, but to suggest the best one, we
> need some throughput numbers and a better definition of what "real time"
> means. Is a one minute delay acceptable? 10 seconds? a millisecond?
> the answer defines the scope of reasonable solutions.....
>
> Best
> Erick
>
> On 8/10/07, Antonello Provenzano <an...@deveel.com> wrote:
> >
> > Kai,
> >
> > The context I'm going to work with requires a continuous addition of
> > documents to the indexes, since it's user-driven content, and this
> > would require the content to be always up-to-date.
> > This is the problem I'm facing, since I cannot rebuild a 1Gb (at
> > least) index every time a user inserts a new entry into the database.
> >
> > I know Digg, for instance, is using Lucene as search engine: since the
> > amount of data they're dealing with is much higher than mine, I would
> > like to understand the way they used to implement this kind of
> > solution.
> >
> > Thank you again.
> > Antonello
> >
> >
> > On 8/10/07, Kai Hu <ka...@dusee.cn> wrote:
> > > Antonello,
> > >         You are right,I think lucene indexsearcher will search the old
> > information if IndexWriter was not closed(I think lucene release the
> Lock
> > here),so I only add a few documents every time from buffer to implement
> > index "real time".
> > >
> > > kai
> > >
> > >
> > > 发件人: antonelloprov@gmail.com [mailto:antonelloprov@gmail.com] 代表
> > Antonello Provenzano
> > > 发送时间: 2007年8月10日 星期五 17:59
> > > 收件人: java-user@lucene.apache.org
> > > 主题: Re: 答复: Lucene in large database contexts
> > >
> > > Kai,
> > >
> > > Thanks. The problem I see it's that although I can add a Document
> > > through IndexWriter or IndexModifier, this won't be searchable until
> > > the index is closed and, possibly, optimized, since the score of the
> > > document in the index context must be re-calculated on the basis of
> > > the whole context.
> > >
> > > Is this assumption true? or am I completely wrong?
> > >
> > > Cheers.
> > > Antonello
> > >
> > >
> > > On 8/10/07, Kai Hu <ka...@dusee.cn> wrote:
> > > > Hi, Antonello
> > > >         You can use IndexWriter.addDocument(Document document) to
> add
> > single document,same to update,delete operation.
> > > >
> > > > kai
> > > >
> > > > -----邮件原件-----
> > > > 发件人: Antonello Provenzano [mailto:antonelloprov@gmail.com]
> > > > 发送时间: 2007年8月10日 星期五 17:09
> > > > 收件人: java-user@lucene.apache.org
> > > > 主题: Lucene in large database contexts
> > > >
> > > > Hi There!
> > > >
> > > > I've been working for a while on the implementation of a website
> > > > oriented to contents that would contain millions of entries, most of
> > > > them indexable (such as descriptions, texts, names, etc.).
> > > > The ideal solution to make them searchable would be to use Lucene as
> > > > index and search engine.
> > > >
> > > > The reason I'm posting the mailing list is the following: since all
> > > > the entries will be stored in a database (most likely MySQL InnoDB
> or
> > > > Oracle), what's the best technique to implement a system that
> indexes
> > > > in "real time" (eg. when an entry is inserted into the databsse) the
> > > > content and make it searchable? Based on my understanding of Lucene,
> > > > such this thing is not possible, since the index must be re-created
> to
> > > > be able to search the indexed contents. Is this true?
> > > >
> > > > Eventually, could anyone point me to a working example about how to
> > > > implement such a similar context?
> > > >
> > > >
> > > > Thank you for the support.
> > > > Antonello
> > > >
> > > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Re: 答复: 答复: Lucene in large database contexts

Posted by Erick Erickson <er...@gmail.com>.

Well, closing/opening an index is MUCH less expensive than
rebuilding the whole thing, so I don't understand part of your
statements....

It *may* (but I haven't tried it) be possible to flush the writer rather
than
close/open it. But, you MUST close/reopen the reader you search with
even if flush works like I think it does.

But it's also possible to use a two tiered approach. 1G isn't all that big.
Could
you read it into a RAMDir and use that for your searches? Then, when you add
data, you add it to *both* indexes, but close/open the RAMdir for searching.

It's also possible to keep the RAMdir as the delta between the FSdir and
"current" states of your index. Add to both and search both. Although
deletes may be a problem here.

You haven't specified how often you expect changes, though. 100/second?
1/minute? How real is "real time"? You could do something like warm up
a new reader in the background whenever you decided you needed to be
absolutely up to date and swap your "live" reader for the newly warmed up
one whenever you deemed it wise.

Or you could just close/open your reader after each modification, fire off a

couple of warmup queries at it and let the users live with slow responses
if they happen to search before your warm-up queries completed.

The point is that there are many options, but to suggest the best one, we
need some throughput numbers and a better definition of what "real time"
means. Is a one minute delay acceptable? 10 seconds? a millisecond?
the answer defines the scope of reasonable solutions.....

Best
Erick

On 8/10/07, Antonello Provenzano <an...@deveel.com> wrote:
>
> Kai,
>
> The context I'm going to work with requires a continuous addition of
> documents to the indexes, since it's user-driven content, and this
> would require the content to be always up-to-date.
> This is the problem I'm facing, since I cannot rebuild a 1Gb (at
> least) index every time a user inserts a new entry into the database.
>
> I know Digg, for instance, is using Lucene as search engine: since the
> amount of data they're dealing with is much higher than mine, I would
> like to understand the way they used to implement this kind of
> solution.
>
> Thank you again.
> Antonello
>
>
> On 8/10/07, Kai Hu <ka...@dusee.cn> wrote:
> > Antonello,
> >         You are right,I think lucene indexsearcher will search the old
> information if IndexWriter was not closed(I think lucene release the Lock
> here),so I only add a few documents every time from buffer to implement
> index "real time".
> >
> > kai
> >
> >
> > 发件人: antonelloprov@gmail.com [mailto:antonelloprov@gmail.com] 代表
> Antonello Provenzano
> > 发送时间: 2007年8月10日 星期五 17:59
> > 收件人: java-user@lucene.apache.org
> > 主题: Re: 答复: Lucene in large database contexts
> >
> > Kai,
> >
> > Thanks. The problem I see it's that although I can add a Document
> > through IndexWriter or IndexModifier, this won't be searchable until
> > the index is closed and, possibly, optimized, since the score of the
> > document in the index context must be re-calculated on the basis of
> > the whole context.
> >
> > Is this assumption true? or am I completely wrong?
> >
> > Cheers.
> > Antonello
> >
> >
> > On 8/10/07, Kai Hu <ka...@dusee.cn> wrote:
> > > Hi, Antonello
> > >         You can use IndexWriter.addDocument(Document document) to add
> single document,same to update,delete operation.
> > >
> > > kai
> > >
> > > -----邮件原件-----
> > > 发件人: Antonello Provenzano [mailto:antonelloprov@gmail.com]
> > > 发送时间: 2007年8月10日 星期五 17:09
> > > 收件人: java-user@lucene.apache.org
> > > 主题: Lucene in large database contexts
> > >
> > > Hi There!
> > >
> > > I've been working for a while on the implementation of a website
> > > oriented to contents that would contain millions of entries, most of
> > > them indexable (such as descriptions, texts, names, etc.).
> > > The ideal solution to make them searchable would be to use Lucene as
> > > index and search engine.
> > >
> > > The reason I'm posting the mailing list is the following: since all
> > > the entries will be stored in a database (most likely MySQL InnoDB or
> > > Oracle), what's the best technique to implement a system that indexes
> > > in "real time" (eg. when an entry is inserted into the databsse) the
> > > content and make it searchable? Based on my understanding of Lucene,
> > > such this thing is not possible, since the index must be re-created to
> > > be able to search the indexed contents. Is this true?
> > >
> > > Eventually, could anyone point me to a working example about how to
> > > implement such a similar context?
> > >
> > >
> > > Thank you for the support.
> > > Antonello
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: 答复: 答复: Lucene in large database contexts

Posted by Antonello Provenzano <an...@deveel.com>.

Kai,

The context I'm going to work with requires a continuous addition of
documents to the indexes, since it's user-driven content, and this
would require the content to be always up-to-date.
This is the problem I'm facing, since I cannot rebuild a 1Gb (at
least) index every time a user inserts a new entry into the database.

I know Digg, for instance, is using Lucene as search engine: since the
amount of data they're dealing with is much higher than mine, I would
like to understand the way they used to implement this kind of
solution.

Thank you again.
Antonello


On 8/10/07, Kai Hu <ka...@dusee.cn> wrote:
> Antonello,
>         You are right,I think lucene indexsearcher will search the old information if IndexWriter was not closed(I think lucene release the Lock here),so I only add a few documents every time from buffer to implement index "real time".
>
> kai
>
>
> 发件人: antonelloprov@gmail.com [mailto:antonelloprov@gmail.com] 代表 Antonello Provenzano
> 发送时间: 2007年8月10日 星期五 17:59
> 收件人: java-user@lucene.apache.org
> 主题: Re: 答复: Lucene in large database contexts
>
> Kai,
>
> Thanks. The problem I see it's that although I can add a Document
> through IndexWriter or IndexModifier, this won't be searchable until
> the index is closed and, possibly, optimized, since the score of the
> document in the index context must be re-calculated on the basis of
> the whole context.
>
> Is this assumption true? or am I completely wrong?
>
> Cheers.
> Antonello
>
>
> On 8/10/07, Kai Hu <ka...@dusee.cn> wrote:
> > Hi, Antonello
> >         You can use IndexWriter.addDocument(Document document) to add single document,same to update,delete operation.
> >
> > kai
> >
> > -----邮件原件-----
> > 发件人: Antonello Provenzano [mailto:antonelloprov@gmail.com]
> > 发送时间: 2007年8月10日 星期五 17:09
> > 收件人: java-user@lucene.apache.org
> > 主题: Lucene in large database contexts
> >
> > Hi There!
> >
> > I've been working for a while on the implementation of a website
> > oriented to contents that would contain millions of entries, most of
> > them indexable (such as descriptions, texts, names, etc.).
> > The ideal solution to make them searchable would be to use Lucene as
> > index and search engine.
> >
> > The reason I'm posting the mailing list is the following: since all
> > the entries will be stored in a database (most likely MySQL InnoDB or
> > Oracle), what's the best technique to implement a system that indexes
> > in "real time" (eg. when an entry is inserted into the databsse) the
> > content and make it searchable? Based on my understanding of Lucene,
> > such this thing is not possible, since the index must be re-created to
> > be able to search the indexed contents. Is this true?
> >
> > Eventually, could anyone point me to a working example about how to
> > implement such a similar context?
> >
> >
> > Thank you for the support.
> > Antonello
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>