You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark Fletcher <ma...@gmail.com> on 2010/03/06 15:17:25 UTC

index merge

Hi,

I have a doubt regarding Index Merging:-

I have set up 2 cores COREX and COREY.
COREX - always serves user requests
COREY - gets updated with the latest values (dataDir is in a different
location from COREX)
I tried merging coreX and coreY at the end of COREY getting updated with the
latest data values so that COREX and COREY are having the latest data. So
the user who always queries COREX gets the latest data.Pls find the various
approaches I followed and the commands used.

I tried these merges:-
COREX = COREX and COREY merged
curl '
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=coreX&indexDir=/opt/solr/coreX/data/index&indexDir=/opt1/solr/coreY/data/index
'

COREX = COREY and COREY merged
curl '
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=coreX&indexDir=/opt/solr/coreY/data/index&indexDir=/opt1/solr/coreY/data/index
'

COREX = COREY and COREA merged (COREA just contains the initial 2 seed
segments.. a dummy core)
curl '
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=coreX&indexDir=/opt/solr/coreY/data/index&indexDir=/opt1/solr/coreA/data/index
'

When I check the record count in COREX and COREY, COREX always contains
about double of what COREY has. Is everything fine here and just the record
count is different or is there something wrong.
Note:- I have only 2 cores here and I tried the X=X+Y approach, X=Y+Y and
X=Y+A approach where A is a dummy index. Never have the record counts
matched after the merging is done.

Can someone please help me understand why this record count difference
occurs and is there anything fundamentally wrong in my approach.

Thanks and Rgds,
Mark.

Re: index merge

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Hi Mark,

On Mon, Mar 8, 2010 at 9:23 PM, Mark Fletcher
<ma...@gmail.com>wrote:

>
> My main purpose of having 2 identical cores
> COREX - always serves user request
> COREY - every day once, takes the updates/latest data and passess it on to
> COREX.
> is:-
>
> Suppose say I have only one COREY and suppose a request comes to COREY
> while the update of the latest data is happening on to it. Wouldn't it
> degrade performance of the requests at that point of time?
>

The thing to note is that both reads and writes are happening on the same
box. So when you swap cores, the OS has to cache the hot segments of the new
(inactive) index. If you were just re-opening the same (active) index, at
least some of the existing files could remain in the OS's file cache. I
think that may just degrade performance further so you should definitely
benchmark before going through with this.

The best practice is to use a master/slave architecture and separate the
writes and reads.

> So I was planning to keep COREX and COREY always identical. Once COREY has
> the latest it should somehow sync with COREX so that COREX also now has the
> latest. COREY keeps on getting the updates at a particular time of day and
> it will again pass it on to COREX. This process continues everyday.
>

You could use the same approach that Solr 1.3's snapinstaller script used.
It deletes the files and creates hard links to the new index files.

-- 
Regards,
Shalin Shekhar Mangar.

Re: index merge

Posted by Mark Fletcher <ma...@gmail.com>.

Hi All,

Thank you for the very valuable suggestions.
I am planning to try using the Master - Slave configuration.

Best Rgds,
Mark.

On Mon, Mar 8, 2010 at 11:17 AM, Mark Miller <ma...@gmail.com> wrote:

> On 03/08/2010 10:53 AM, Mark Fletcher wrote:
>
>> Hi Shalin,
>>
>> Thank you for the mail.
>> My main purpose of having 2 identical cores
>> COREX - always serves user request
>> COREY - every day once, takes the updates/latest data and passess it on to
>> COREX.
>> is:-
>>
>> Suppose say I have only one COREY and suppose a request comes to COREY
>> while
>> the update of the latest data is happening on to it. Wouldn't it degrade
>> performance of the requests at that point of time?
>>
>>
> Yes - but your not going to help anything by using two indexes - best you
> can do it use two boxes. 2 indexes on the same box will actually
> be worse than one if they are identical and you are swapping between them.
> Writes on an index will not affect reads in the way you are thinking - only
> in that its uses IO and CPU that the read process cant. Thats going to
> happen with 2 indexes on the same box too - except now you have way more
> data to cache and flip between, and you can't take any advantage of things
> just being written possibly being in the cache for reads.
>
> Lucene indexes use a write once strategy - when writing new segments, you
> are not touching the segments being read from. Lucene is already doing the
> index juggling for you at the segment level.
>
>
> So I was planning to keep COREX and COREY always identical. Once COREY has
>> the latest it should somehow sync with COREX so that COREX also now has
>> the
>> latest. COREY keeps on getting the updates at a particular time of day and
>> it will again pass it on to COREX. This process continues everyday.
>>
>> What is the best possible way to implement this?
>>
>> Thanks,
>>
>> Mark.
>>
>>
>> On Mon, Mar 8, 2010 at 9:53 AM, Shalin Shekhar Mangar<
>> shalinmangar@gmail.com>  wrote:
>>
>>
>>
>>> Hi Mark,
>>>
>>>  On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher<
>>> mark.fletcher2004@gmail.com>  wrote:
>>>
>>>
>>>
>>>> I ran the SWAP command. Now:-
>>>> COREX has the dataDir pointing to the updated dataDir of COREY. So COREX
>>>> has the latest.
>>>> Again, COREY (on which the update regularly runs) is pointing to the old
>>>> index of COREX. So this now doesnt have the most updated index.
>>>>
>>>> Now shouldn't I update the index of COREY (now pointing to the old
>>>> COREX)
>>>> so that it has the latest footprint as in COREX (having the latest COREY
>>>> index)so that when the update again happens to COREY, it has the latest
>>>> and
>>>> I again do the SWAP.
>>>>
>>>> Is a physical copying of the index  named COREY (the latest and now
>>>> datDir
>>>> of COREX after SWAP) to the index COREX  (now the dataDir of COREY.. the
>>>> orginal non-updated index of COREX) the best way for this or is there
>>>> any
>>>> other better option.
>>>>
>>>> Once again, later when COREY is again updated with the latest, I will
>>>> run
>>>> the SWAP again and it will be fine with COREX again pointing to its
>>>> original
>>>> dataDir (now the updated one).So every even SWAP command run will point
>>>> COREX back to its original dataDir. (same case with COREY).
>>>>
>>>> My only concern is after the SWAP is done, updating the old index (which
>>>> was serving previously and now replaced by the new index). What is the
>>>> best
>>>> way to do that? Physically copy the latest index to the old one and make
>>>> it
>>>> in sync with the latest one so that by the time it is to get the latest
>>>> updates it has the latest in it so that the new ones can be added to
>>>> this
>>>> and it becomes the latest and is again swapped?
>>>>
>>>>
>>>>
>>> Perhaps it is best if we take a step back and understand why you need two
>>> identical cores?
>>>
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>>>
>>>
>>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>

Re: index merge

Posted by Mark Miller <ma...@gmail.com>.

On 03/08/2010 10:53 AM, Mark Fletcher wrote:
> Hi Shalin,
>
> Thank you for the mail.
> My main purpose of having 2 identical cores
> COREX - always serves user request
> COREY - every day once, takes the updates/latest data and passess it on to
> COREX.
> is:-
>
> Suppose say I have only one COREY and suppose a request comes to COREY while
> the update of the latest data is happening on to it. Wouldn't it degrade
> performance of the requests at that point of time?
>    
Yes - but your not going to help anything by using two indexes - best 
you can do it use two boxes. 2 indexes on the same box will actually
be worse than one if they are identical and you are swapping between 
them. Writes on an index will not affect reads in the way you are 
thinking - only in that its uses IO and CPU that the read process cant. 
Thats going to happen with 2 indexes on the same box too - except now 
you have way more data to cache and flip between, and you can't take any 
advantage of things just being written possibly being in the cache for 
reads.

Lucene indexes use a write once strategy - when writing new segments, 
you are not touching the segments being read from. Lucene is already 
doing the index juggling for you at the segment level.

> So I was planning to keep COREX and COREY always identical. Once COREY has
> the latest it should somehow sync with COREX so that COREX also now has the
> latest. COREY keeps on getting the updates at a particular time of day and
> it will again pass it on to COREX. This process continues everyday.
>
> What is the best possible way to implement this?
>
> Thanks,
>
> Mark.
>
>
> On Mon, Mar 8, 2010 at 9:53 AM, Shalin Shekhar Mangar<
> shalinmangar@gmail.com>  wrote:
>
>    
>> Hi Mark,
>>
>>   On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher<
>> mark.fletcher2004@gmail.com>  wrote:
>>
>>      
>>> I ran the SWAP command. Now:-
>>> COREX has the dataDir pointing to the updated dataDir of COREY. So COREX
>>> has the latest.
>>> Again, COREY (on which the update regularly runs) is pointing to the old
>>> index of COREX. So this now doesnt have the most updated index.
>>>
>>> Now shouldn't I update the index of COREY (now pointing to the old COREX)
>>> so that it has the latest footprint as in COREX (having the latest COREY
>>> index)so that when the update again happens to COREY, it has the latest and
>>> I again do the SWAP.
>>>
>>> Is a physical copying of the index  named COREY (the latest and now datDir
>>> of COREX after SWAP) to the index COREX  (now the dataDir of COREY.. the
>>> orginal non-updated index of COREX) the best way for this or is there any
>>> other better option.
>>>
>>> Once again, later when COREY is again updated with the latest, I will run
>>> the SWAP again and it will be fine with COREX again pointing to its original
>>> dataDir (now the updated one).So every even SWAP command run will point
>>> COREX back to its original dataDir. (same case with COREY).
>>>
>>> My only concern is after the SWAP is done, updating the old index (which
>>> was serving previously and now replaced by the new index). What is the best
>>> way to do that? Physically copy the latest index to the old one and make it
>>> in sync with the latest one so that by the time it is to get the latest
>>> updates it has the latest in it so that the new ones can be added to this
>>> and it becomes the latest and is again swapped?
>>>
>>>        
>> Perhaps it is best if we take a step back and understand why you need two
>> identical cores?
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>>      
>    


-- 
- Mark

http://www.lucidimagination.com

Re: index merge

Posted by Mark Fletcher <ma...@gmail.com>.

Hi Shalin,

Thank you for the mail.
My main purpose of having 2 identical cores
COREX - always serves user request
COREY - every day once, takes the updates/latest data and passess it on to
COREX.
is:-

Suppose say I have only one COREY and suppose a request comes to COREY while
the update of the latest data is happening on to it. Wouldn't it degrade
performance of the requests at that point of time?

So I was planning to keep COREX and COREY always identical. Once COREY has
the latest it should somehow sync with COREX so that COREX also now has the
latest. COREY keeps on getting the updates at a particular time of day and
it will again pass it on to COREX. This process continues everyday.

What is the best possible way to implement this?

Thanks,

Mark.

On Mon, Mar 8, 2010 at 9:53 AM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Hi Mark,
>
>  On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher <
> mark.fletcher2004@gmail.com> wrote:
>
>>
>> I ran the SWAP command. Now:-
>> COREX has the dataDir pointing to the updated dataDir of COREY. So COREX
>> has the latest.
>> Again, COREY (on which the update regularly runs) is pointing to the old
>> index of COREX. So this now doesnt have the most updated index.
>>
>> Now shouldn't I update the index of COREY (now pointing to the old COREX)
>> so that it has the latest footprint as in COREX (having the latest COREY
>> index)so that when the update again happens to COREY, it has the latest and
>> I again do the SWAP.
>>
>> Is a physical copying of the index  named COREY (the latest and now datDir
>> of COREX after SWAP) to the index COREX  (now the dataDir of COREY.. the
>> orginal non-updated index of COREX) the best way for this or is there any
>> other better option.
>>
>> Once again, later when COREY is again updated with the latest, I will run
>> the SWAP again and it will be fine with COREX again pointing to its original
>> dataDir (now the updated one).So every even SWAP command run will point
>> COREX back to its original dataDir. (same case with COREY).
>>
>> My only concern is after the SWAP is done, updating the old index (which
>> was serving previously and now replaced by the new index). What is the best
>> way to do that? Physically copy the latest index to the old one and make it
>> in sync with the latest one so that by the time it is to get the latest
>> updates it has the latest in it so that the new ones can be added to this
>> and it becomes the latest and is again swapped?
>>
>
> Perhaps it is best if we take a step back and understand why you need two
> identical cores?
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: index merge

Posted by sudarshan <ch...@gmail.com>.

Hi All,
       I have a basic doubt about index merging in Solr.  The setup that I
have followed is as follows:

Setup:
I used the schema.xml that comes with the solr example. I had three cores -
core0, core1 and core2.   I tried merging the indexes of core 0 and core 1
to core2.  I copied the same schema.xml from SOLR_HOME/example/solr/conf to
core 0 and core 1 but changed the name field alone as core0 and core1
respectively.
 
Operations:
I indexed different files to core0 and core1. The search *:* in Solr showed
6 files and 9 files for core0 and core1 respectively.  Then merged the
indexes of core0 and core1 to core2. As expected the search *:* showed 15
files for core2. I added 2 new files to the index of core0 and 1 file to
core1 and merged again to core2. This time to my surprise "*" showed the
total number of files showed to be 33 = (15+18) instead of just 18. This
duplication continued for each merge operation which is not efficient. Also
the merged files were available for search only after restarting the Jetty
server. Am I missing something or doing things wrongly? Is there a way to
restart only a specific core to read the new index/reflect the merged
changes? Please explain the merge operation.

Thanks,
Sudarshan   



--
View this message in context: http://lucene.472066.n3.nabble.com/index-merge-tp472904p3987121.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: index merge

Posted by uma m <um...@yahoo.co.in>.

Hi All,
 
   The problem is resolved. It is purely due to filesystem. My filesystem is
of 32-bit, running on 64 bit OS. I changed to 64 bit filesystem and all
works as expected.

Uma
-- 
View this message in context: http://lucene.472066.n3.nabble.com/index-merge-tp472904p832053.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: index merge

Posted by Ahmet Arslan <io...@yahoo.com>.

> I am running solr in 64 bit HP-UX system. The total
> index size is about
> 5GB and when i try load any new document, solr tries to
> merge the existing
> segments first and results in following error. I could see
> a temp file is
> growng within index dir around 2GB in size and later it
> fails with this
> exception. It looks like, by reaching Integer.MAXVALUE, the
> exception
> occurs.

<ramBufferSizeMB>32</ramBufferSizeMB> isn't 32MB ramBufferSizeMB too small?

Re: index merge

Posted by uma m <um...@yahoo.co.in>.

Hi All,

  I am running solr in 64 bit HP-UX system. The total index size is about
5GB and when i try load any new document, solr tries to merge the existing
segments first and results in following error. I could see a temp file is
growng within index dir around 2GB in size and later it fails with this
exception. It looks like, by reaching Integer.MAXVALUE, the exception
occurs.

Exception in thread "Lucene Merge Thread #0"
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
File too large (errno:27)
        at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
        at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
Caused by: java.io.IOException: File too large (errno:27)
        at java.io.RandomAccessFile.writeBytes(Native Method)
        at java.io.RandomAccessFile.write(RandomAccessFile.java:456)
        at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexOutput.flushBuffer(SimpleFSDirectory.java:192)
        at
org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96)
        at
org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85)
        at
org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:109)
        at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexOutput.close(SimpleFSDirectory.java:199)
        at org.apache.lucene.index.FieldsWriter.close(FieldsWriter.java:144)
        at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:357)
        at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:153)
        at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5029)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4614)
        at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235)
        at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)

-----------------------------------------------------------------------

The solrconfig.xml contains default values for <indexDefaults>, <mainIndex>
sections as below.

  <indexDefaults>^M
   <!-- Values here affect all index writers and act as a default unless
overridden. -->^M
    <useCompoundFile>false</useCompoundFile>^M
^M
    <mergeFactor>10</mergeFactor>^M
    <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene
will flush^M
     based on whichever limit is hit first.  -->^M
    <!--<maxBufferedDocs>1000</maxBufferedDocs>-->^M
^M
    <!-- Sets the amount of RAM that may be used by Lucene indexing^M
      for buffering added documents and deletions before they are^M
      flushed to the Directory.  -->^M
    <ramBufferSizeMB>32</ramBufferSizeMB>^M
    <!-- <maxMergeDocs>2147483647</maxMergeDocs> -->^M
    <maxFieldLength>10000</maxFieldLength>^M
    <writeLockTimeout>1000</writeLockTimeout>^M
    <commitLockTimeout>10000</commitLockTimeout>^M
 <!--<mergePolicy
class="org.apache.lucene.index.LogByteSizeMergePolicy"/>-->^M
<!--<mergeScheduler
class="org.apache.lucene.index.ConcurrentMergeScheduler"/>-->^M
  </indexDefaults>^
 <mainIndex>^M
    <!-- options specific to the main on-disk lucene index -->^M
    <useCompoundFile>false</useCompoundFile>^M
    <ramBufferSizeMB>32</ramBufferSizeMB>^M
    <mergeFactor>10</mergeFactor>^M
    <!-- Deprecated -->^M
    <!--<maxBufferedDocs>1000</maxBufferedDocs>-->^M
    <!--<maxMergeDocs>2147483647</maxMergeDocs>-->^M
 </mainIndex>^


Could anyone help me to resolve this exception?

Regards,
Uma
-- 
View this message in context: http://lucene.472066.n3.nabble.com/index-merge-tp472904p829810.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: index merge

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Hi Mark,

On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher
<ma...@gmail.com>wrote:

>
> I ran the SWAP command. Now:-
> COREX has the dataDir pointing to the updated dataDir of COREY. So COREX
> has the latest.
> Again, COREY (on which the update regularly runs) is pointing to the old
> index of COREX. So this now doesnt have the most updated index.
>
> Now shouldn't I update the index of COREY (now pointing to the old COREX)
> so that it has the latest footprint as in COREX (having the latest COREY
> index)so that when the update again happens to COREY, it has the latest and
> I again do the SWAP.
>
> Is a physical copying of the index  named COREY (the latest and now datDir
> of COREX after SWAP) to the index COREX  (now the dataDir of COREY.. the
> orginal non-updated index of COREX) the best way for this or is there any
> other better option.
>
> Once again, later when COREY is again updated with the latest, I will run
> the SWAP again and it will be fine with COREX again pointing to its original
> dataDir (now the updated one).So every even SWAP command run will point
> COREX back to its original dataDir. (same case with COREY).
>
> My only concern is after the SWAP is done, updating the old index (which
> was serving previously and now replaced by the new index). What is the best
> way to do that? Physically copy the latest index to the old one and make it
> in sync with the latest one so that by the time it is to get the latest
> updates it has the latest in it so that the new ones can be added to this
> and it becomes the latest and is again swapped?
>

Perhaps it is best if we take a step back and understand why you need two
identical cores?

-- 
Regards,
Shalin Shekhar Mangar.

Re: index merge

Posted by Mark Fletcher <ma...@gmail.com>.

Hi Shalin,

Thank you for the reply.

I got your point. So I understand merge will just duplicate things.

I ran the SWAP command. Now:-
COREX has the dataDir pointing to the updated dataDir of COREY. So COREX has
the latest.
Again, COREY (on which the update regularly runs) is pointing to the old
index of COREX. So this now doesnt have the most updated index.

Now shouldn't I update the index of COREY (now pointing to the old COREX) so
that it has the latest footprint as in COREX (having the latest COREY
index)so that when the update again happens to COREY, it has the latest and
I again do the SWAP.

Is a physical copying of the index  named COREY (the latest and now datDir
of COREX after SWAP) to the index COREX  (now the dataDir of COREY.. the
orginal non-updated index of COREX) the best way for this or is there any
other better option.

Once again, later when COREY is again updated with the latest, I will run
the SWAP again and it will be fine with COREX again pointing to its original
dataDir (now the updated one).So every even SWAP command run will point
COREX back to its original dataDir. (same case with COREY).

My only concern is after the SWAP is done, updating the old index (which was
serving previously and now replaced by the new index). What is the best way
to do that? Physically copy the latest index to the old one and make it in
sync with the latest one so that by the time it is to get the latest updates
it has the latest in it so that the new ones can be added to this and it
becomes the latest and is again swapped?

Please share your opinion. Once again your help is appreciated. I am kind of
going in circles with multiple indexs for some days!

Thanks and Rgds,
Mark.

On Mon, Mar 8, 2010 at 7:45 AM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Hi Mark,
>
> On Sun, Mar 7, 2010 at 6:20 PM, Mark Fletcher
> <ma...@gmail.com>wrote:
>
> >
> > I have created 2  identical cores coreX and coreY (both have different
> > dataDir values, but their index is same).
> > coreX - always serves the request when a user performs a search.
> > coreY - the updates will happen to this core and then I need to
> synchronize
> > it with coreX after the update process, so that coreX also has the
> >           latest data in it.  After coreX and coreY are synchronized,
> both
> > should again be identical again.
> >
> > For this purpose I tried core merging of coreX and coreY once coreY is
> > updated with the latest set of data. But I find coreX to be containing
> > double the record count as in coreY.
> > (coreX = coreX+coreY)
> >
> > Is there a problem in using MERGE concept here. If it is wrong can some
> one
> > pls suggest the best approach. I tried the various merges explained in my
> > previous mail.
> >
> >
> Index merge happens at the Lucene level which has no idea about uniqueKeys.
> Therefore when you merge two indexes containing exactly the same documents
> (by uniqueKey), you get double the document count.
>
> Looking at your scenario, it seems to me that what you want to do is a swap
> operation. coreX is serving the requests, coreY is updated and now you can
> swap coreX with coreY so that new requests hit the updated index. I suggest
> you look at the swap operation instead of index merge.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: index merge

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Hi Mark,

On Sun, Mar 7, 2010 at 6:20 PM, Mark Fletcher
<ma...@gmail.com>wrote:

>
> I have created 2  identical cores coreX and coreY (both have different
> dataDir values, but their index is same).
> coreX - always serves the request when a user performs a search.
> coreY - the updates will happen to this core and then I need to synchronize
> it with coreX after the update process, so that coreX also has the
>           latest data in it.  After coreX and coreY are synchronized, both
> should again be identical again.
>
> For this purpose I tried core merging of coreX and coreY once coreY is
> updated with the latest set of data. But I find coreX to be containing
> double the record count as in coreY.
> (coreX = coreX+coreY)
>
> Is there a problem in using MERGE concept here. If it is wrong can some one
> pls suggest the best approach. I tried the various merges explained in my
> previous mail.
>
>
Index merge happens at the Lucene level which has no idea about uniqueKeys.
Therefore when you merge two indexes containing exactly the same documents
(by uniqueKey), you get double the document count.

Looking at your scenario, it seems to me that what you want to do is a swap
operation. coreX is serving the requests, coreY is updated and now you can
swap coreX with coreY so that new requests hit the updated index. I suggest
you look at the swap operation instead of index merge.

-- 
Regards,
Shalin Shekhar Mangar.

Fwd: index merge

Posted by Mark Fletcher <ma...@gmail.com>.

Hi,

I have created 2  identical cores coreX and coreY (both have different
dataDir values, but their index is same).
coreX - always serves the request when a user performs a search.
coreY - the updates will happen to this core and then I need to synchronize
it with coreX after the update process, so that coreX also has the
           latest data in it.  After coreX and coreY are synchronized, both
should again be identical again.

For this purpose I tried core merging of coreX and coreY once coreY is
updated with the latest set of data. But I find coreX to be containing
double the record count as in coreY.
(coreX = coreX+coreY)

Is there a problem in using MERGE concept here. If it is wrong can some one
pls suggest the best approach. I tried the various merges explained in my
previous mail.

Any help is deeply appreciated.

Thanks and Rgds,
Mark.



---------- Forwarded message ----------
From: Mark Fletcher <ma...@gmail.com>
Date: Sat, Mar 6, 2010 at 9:17 AM
Subject: index merge
To: solr-user@lucene.apache.org
Cc: goksron@gmail.com


Hi,

I have a doubt regarding Index Merging:-

I have set up 2 cores COREX and COREY.
COREX - always serves user requests
COREY - gets updated with the latest values (dataDir is in a different
location from COREX)
I tried merging coreX and coreY at the end of COREY getting updated with the
latest data values so that COREX and COREY are having the latest data. So
the user who always queries COREX gets the latest data.Pls find the various
approaches I followed and the commands used.

I tried these merges:-
COREX = COREX and COREY merged
curl '
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=coreX&indexDir=/opt/solr/coreX/data/index&indexDir=/opt1/solr/coreY/data/index
'

COREX = COREY and COREY merged
curl '
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=coreX&indexDir=/opt/solr/coreY/data/index&indexDir=/opt1/solr/coreY/data/index
'

COREX = COREY and COREA merged (COREA just contains the initial 2 seed
segments.. a dummy core)
curl '
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=coreX&indexDir=/opt/solr/coreY/data/index&indexDir=/opt1/solr/coreA/data/index
'

When I check the record count in COREX and COREY, COREX always contains
about double of what COREY has. Is everything fine here and just the record
count is different or is there something wrong.
Note:- I have only 2 cores here and I tried the X=X+Y approach, X=Y+Y and
X=Y+A approach where A is a dummy index. Never have the record counts
matched after the merging is done.

Can someone please help me understand why this record count difference
occurs and is there anything fundamentally wrong in my approach.

Thanks and Rgds,
Mark.