You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Benson Margulies <bi...@gmail.com> on 2012/02/20 03:05:24 UTC

Here a merge thread, there a merge thread ...

A long-running program of mine (which Uwe's read a model of) slowly
keeps adding merge threads. I count 22 at the moment. Each one shows
up, runs for a bit, and then goes to sleep for, seemingly ever. I
don't do anything explicit to control merging behavior.

They name themselves "Lucene Merge Thread #xxx" where xxx is a
non-contiguous but ever-growing number.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Can I detect incorrect language selection after creating an index?

Posted by Glen Newton <gl...@gmail.com>.

Do the check _before_ indexing.
Use https://code.google.com/p/language-detection/  to verify the
language of the text document before you put it in the index.

-Glen Newton
http://zzzoot.blogspot.com/

On Mon, Feb 27, 2012 at 10:53 AM, Ilya Zavorin <iz...@caci.com> wrote:
> Suppose I have a bunch of text documents in language X but I index ithem using an analyzer for language Y. Once the index is created, is it possible to perform some sort of simple "sanity" check to see if the original language selection was wrong? I presume I can try searching for some common word in language Y, but I am not sure how reliable this would be. On the other hand, if languages are from the same group, say X and Y are English and Spanish, I should expect that this sanity check would produce a false match. However, I would be happy if it worked reliably enough for languages using different scripts, e.g. Latin vs Cyrillic vs Arabic vs Chinese etc.
>
>
> Thanks much
>
>
>
> Ilya Zavorin



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Can I detect incorrect language selection after creating an index?

Posted by Ilya Zavorin <iz...@caci.com>.

Suppose I have a bunch of text documents in language X but I index ithem using an analyzer for language Y. Once the index is created, is it possible to perform some sort of simple "sanity" check to see if the original language selection was wrong? I presume I can try searching for some common word in language Y, but I am not sure how reliable this would be. On the other hand, if languages are from the same group, say X and Y are English and Spanish, I should expect that this sanity check would produce a false match. However, I would be happy if it worked reliably enough for languages using different scripts, e.g. Latin vs Cyrillic vs Arabic vs Chinese etc.


Thanks much



Ilya Zavorin

Re: [Bulk] can I make incremental index/search more efficient?

Posted by Ganesh <em...@yahoo.co.in>.

You need to follow the second method.. Loop over all the available docs, check if it is there in the index, if not Index it. Perform search on the list of words you have. Add Document name and its modified date time as part of the index. This helps you could search only the particular document, or document indexed after certain date.

Regards
Ganesh

----- Original Message ----- 
From: "Ilya Zavorin" <iz...@caci.com>
To: <ja...@lucene.apache.org>
Sent: Wednesday, February 22, 2012 2:39 AM
Subject: [Bulk] can I make incremental index/search more efficient?


>I have a fairly straightforward task: I have a collection of N documents and a set of "hot" words. I need to find all occurrences of these words in all the docs.
> 
> 
> 
> The original use case was that I would get all the docs at once. In this case, I:
> 
> 1. Create a single index for all the docs
> 
> 2. Loop over all hot words. For each word, I find all hits in all the docs
> 
> 3. I collect and rearrange the hit info to have all hits for each of the indexed doc
> 
> 
> 
> However, it looks like there might be a different use case: the user might want to add one document at a time to the collection and see the search results immediately. So for this case I am now doing the following:
> 
> 1. Loop over docs i = 1 : N. For each doc:
> 
> 1.1 If i == 1 then create index else update index
> 
> 1.2 Loop over all hot words. For each word, find all hits in all the docs that have been indexed so far, i.e. docs 1 through i
> 
> 1.3 Collect and rearrange
> 
> 
> 
> Of course, this is not particularly efficient, especially because I am forced to do a lot or redundant work by searching though docs 1:i instead of just i at each iteration. This is because, if I understand it corrently, I can't specify "search only the part of index that corresponds to doc X". Or can I?
> 
> 
> 
> Is there any way to make this incremental index/search more efficient? For instance, is it at all possible to restrict where in the index a search for hits is performed? Or any other optimization?
> 
> 
> 
> Thanks much
> 
> 
> 
> Ilya Zavorin
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

can I make incremental index/search more efficient?

Posted by Ilya Zavorin <iz...@caci.com>.

I have a fairly straightforward task: I have a collection of N documents and a set of "hot" words. I need to find all occurrences of these words in all the docs.



The original use case was that I would get all the docs at once. In this case, I:

1. Create a single index for all the docs

2. Loop over all hot words. For each word, I find all hits in all the docs

3. I collect and rearrange the hit info to have all hits for each of the indexed doc



However, it looks like there might be a different use case: the user might want to add one document at a time to the collection and see the search results immediately. So for this case I am now doing the following:

1. Loop over docs i = 1 : N. For each doc:

1.1 If i == 1 then create index else update index

1.2 Loop over all hot words. For each word, find all hits in all the docs that have been indexed so far, i.e. docs 1 through i

1.3 Collect and rearrange



Of course, this is not particularly efficient, especially because I am forced to do a lot or redundant work by searching though docs 1:i instead of just i at each iteration. This is because, if I understand it corrently, I can't specify "search only the part of index that corresponds to doc X". Or can I?



Is there any way to make this incremental index/search more efficient? For instance, is it at all possible to restrict where in the index a search for hits is performed? Or any other optimization?



Thanks much



Ilya Zavorin

RE: Here a merge thread, there a merge thread ...

Posted by Uwe Schindler <uw...@thetaphi.de>.

Lance,

There is no TieredMergeScheduler. You somehow confuse MergeSchedule with MergePolicy. TieredMergePolicy is new, but has nothing to do with the problem here. The MergeScheduler executres the merges serial (SerialMergeScheduler) or parallel (ConcurrentMergeScheduler, the default since long time). The MergePolicy simply tells under which conditions and how segments are merged.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Lance Norskog [mailto:goksron@gmail.com]
> Sent: Sunday, February 26, 2012 1:05 AM
> To: java-user@lucene.apache.org
> Subject: Re: Here a merge thread, there a merge thread ...
> 
> Solr uses TieredMergeScheduler by default now. You might find this works
> more smoothly.
> 
> On Fri, Feb 24, 2012 at 10:03 AM, Benson Margulies
> <bi...@gmail.com> wrote:
> > On Fri, Feb 24, 2012 at 10:59 AM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >> This is from ConcurrentMergeScheduler (the default MergeScheduler).
> >>
> >> But, are you sure the threads are sleeping, not exiting?  (They
> >> should be exiting).
> >>
> >> This merge scheduler starts a new thread when a merge is needed,
> >> allows that thread to do another merge (if one is immediately
> >> available), else the thread exits.
> >
> > They seem to exit eventually, but not quite as soon as they arrive.
> >
> >
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Sun, Feb 19, 2012 at 9:05 PM, Benson Margulies
> <bi...@gmail.com> wrote:
> >>> A long-running program of mine (which Uwe's read a model of) slowly
> >>> keeps adding merge threads. I count 22 at the moment. Each one shows
> >>> up, runs for a bit, and then goes to sleep for, seemingly ever. I
> >>> don't do anything explicit to control merging behavior.
> >>>
> >>> They name themselves "Lucene Merge Thread #xxx" where xxx is a
> >>> non-contiguous but ever-growing number.
> >>>
> >>> --------------------------------------------------------------------
> >>> - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> 
> 
> --
> Lance Norskog
> goksron@gmail.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Here a merge thread, there a merge thread ...

Posted by Lance Norskog <go...@gmail.com>.

Solr uses TieredMergeScheduler by default now. You might find this
works more smoothly.

On Fri, Feb 24, 2012 at 10:03 AM, Benson Margulies
<bi...@gmail.com> wrote:
> On Fri, Feb 24, 2012 at 10:59 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> This is from ConcurrentMergeScheduler (the default MergeScheduler).
>>
>> But, are you sure the threads are sleeping, not exiting?  (They should
>> be exiting).
>>
>> This merge scheduler starts a new thread when a merge is needed,
>> allows that thread to do another merge (if one is immediately
>> available), else the thread exits.
>
> They seem to exit eventually, but not quite as soon as they arrive.
>
>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Sun, Feb 19, 2012 at 9:05 PM, Benson Margulies <bi...@gmail.com> wrote:
>>> A long-running program of mine (which Uwe's read a model of) slowly
>>> keeps adding merge threads. I count 22 at the moment. Each one shows
>>> up, runs for a bit, and then goes to sleep for, seemingly ever. I
>>> don't do anything explicit to control merging behavior.
>>>
>>> They name themselves "Lucene Merge Thread #xxx" where xxx is a
>>> non-contiguous but ever-growing number.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Here a merge thread, there a merge thread ...

Posted by Benson Margulies <bi...@gmail.com>.

On Fri, Feb 24, 2012 at 10:59 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> This is from ConcurrentMergeScheduler (the default MergeScheduler).
>
> But, are you sure the threads are sleeping, not exiting?  (They should
> be exiting).
>
> This merge scheduler starts a new thread when a merge is needed,
> allows that thread to do another merge (if one is immediately
> available), else the thread exits.

They seem to exit eventually, but not quite as soon as they arrive.


>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sun, Feb 19, 2012 at 9:05 PM, Benson Margulies <bi...@gmail.com> wrote:
>> A long-running program of mine (which Uwe's read a model of) slowly
>> keeps adding merge threads. I count 22 at the moment. Each one shows
>> up, runs for a bit, and then goes to sleep for, seemingly ever. I
>> don't do anything explicit to control merging behavior.
>>
>> They name themselves "Lucene Merge Thread #xxx" where xxx is a
>> non-contiguous but ever-growing number.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Here a merge thread, there a merge thread ...

Posted by Michael McCandless <lu...@mikemccandless.com>.

This is from ConcurrentMergeScheduler (the default MergeScheduler).

But, are you sure the threads are sleeping, not exiting?  (They should
be exiting).

This merge scheduler starts a new thread when a merge is needed,
allows that thread to do another merge (if one is immediately
available), else the thread exits.

Mike McCandless

http://blog.mikemccandless.com

On Sun, Feb 19, 2012 at 9:05 PM, Benson Margulies <bi...@gmail.com> wrote:
> A long-running program of mine (which Uwe's read a model of) slowly
> keeps adding merge threads. I count 22 at the moment. Each one shows
> up, runs for a bit, and then goes to sleep for, seemingly ever. I
> don't do anything explicit to control merging behavior.
>
> They name themselves "Lucene Merge Thread #xxx" where xxx is a
> non-contiguous but ever-growing number.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org