You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Umashanker, Srividhya" <sr...@hp.com> on 2014/06/20 17:47:35 UTC

Concurrent Indexing

Lucene Experts -

Recently we upgraded to Lucene 4. We want to make use of concurrent flushing feature Of Lucene.

Indexing for us includes certain db operations and writing to lucene ended by commit.  There may be multiple concurrent calls to Indexer to publish single/multiple records.

So far, with older version of lucene, we had our indexing synchronized (1 thread indexing).
Which means waiting time is more, based on concurrency and execution time.

We are moving away from the Synchronized indexing. Which is actually to cut down the waiting period.  Trying to find out if we have to limit the number of threads that adds document and commits.

Below are the tests - to publish just 1000 records with 3 text fields.

Java 7 , JVM config :  -XX:MaxPermSize=384M  -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m -XX:MaxNewSize=100m -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier -Djsse.enableSNIExtension=false

IndexConfiguration being default : We also tried with changes in maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.



Min time  in ms

Max time ms

Avg time ms

1 thread -commit

65

267

85

1 thread -updateDocument

0

40

1









6 thread-commit

83

1449

552.42

6 thread- updateDocument

0

175

1.5









10 thread -Commit

154

2429

874

10 thread- updateDocument

0

243

1.9









20 thread -commit

76

4351

1622

20 thread - updateDocument

0

326

2.1










More the threads trying to write to lucene, the updateDocument and commit() are becoming bottlenecks.  In the above table, 10 and 20 threads  have an average of 1.5 sec for 1000 commits.

Is there some configuration of suggestions to tune the performance of the 2 methods, so that our service performs better, with more concurrency?

-vidhya

RE: Concurrent Indexing

Posted by "Umashanker, Srividhya" <sr...@hp.com>.

I did move out the commit in a separate thread, and using SearchManager refreshing the reader every 2 seconds if required.
Did some testing publishing a single document in each test. (there are a combination of addDocument and updateDocument) happening.

Following are the tests we ran (Below are the performance summary for each run). The environment remains the same as said in the earlier thread.

1. 1 threadpool and publishing using 10000 threads.    (takes 5ms avg to complete)
2. 5 theadpool and publishing 10000 threads   
3. 10 threadpool with 10000 threads publishing
4. 100 threadpool and 100 threads publishing.


Summary :
-------------

Among all of this, the option 1>  seems to perform much better.  With 10 ms on an avg.  And the bottleneck being addDocument and updateDocument method.
This link  shows the jprofiler trace of the updateDocument for run 2> 5 threads.
https://www.dropbox.com/s/3d41mautf12f373/2014-07-03%2012_15_54-IndexTester%20%5B3%5D%20-%20JProfiler%208.0.7.png

Question:
---------------

Is there a way to improve the addDocument and updateDocument performance?


Analysis Numbers for the test runs
---------------------------------------------


1. 1 threadpool and publishing using 10000 threads.    (takes 5ms avg to complete)
--------------------------------------------
Min                         |  3  ms
max                        |  301 ms
avg                          |  5.733  ms
--------------------------------------------	
>5000         ms     |  0  threads
>2000 <5000 ms|  0  threads
>1000 <2000 ms|  0  threads
>500 <1000ms   |  0  threads
>100 <500ms     |  21  threads
>20 <50	ms         |  61  threads
>0 <20 ms           |  9888 threads
--------------------------------------------
2. 5 theadpool and publishing 10000 threads   
--------------------------------------------
Min                         |  3  ms
max                        |  677 ms
avg                          |  14.8305  ms
--------------------------------------------	
>5000         ms     |  0  threads
>2000 <5000 ms|  0  threads
>1000 <2000 ms|  0  threads
>500 <1000ms   |  5  threads
>100 <500ms     |  105 threads
>20 <50	ms         |  753 threads
>0 <20 ms           |  8881  threads
--------------------------------------------

3. 10 threadpool with 10000 threads publishing
--------------------------------------------
Min                         |  3  ms
max                        |  980 ms
avg                          |  31.8305  ms
--------------------------------------------	
>5000         ms     |  0  threads
>2000 <5000 ms|  0  threads
>1000 <2000 ms|  0  threads
>500 <1000ms   |  11  threads
>100 <500ms     |  340 threads
>20 <50	ms         |  4493  threads
>0 <20 ms           |  4095  threads
--------------------------------------------

4. 100 threadpool and 100 threads publishing.
--------------------------------------------
Min                         |  109  ms
max                        |  939 ms
avg                          |  651.42  ms
--------------------------------------------	
>5000         ms     |  0  threads
>2000 <5000 ms|  0  threads
>1000 <2000 ms|  0  threads
>500 <1000ms   |  74  threads
>100 <500ms     |  26 threads
>20 <50	ms         |  0  threads
>0 <20 ms           |  0  threads
--------------------------------------------



-----Original Message-----
From: Vitaly Funstein [mailto:vfunstein@gmail.com] 
Sent: Saturday, June 21, 2014 11:50 AM
To: java-user@lucene.apache.org
Subject: Re: Concurrent Indexing

Hmm, I'm not sure you want to rely on the presence or absence of a particular document in the index to determine the recovery point. It may work for inserts, but not likely for updates or removes. I would look into driving the version numbers from the commiter to the DB, and record them as commit user data for each Lucene index commit. Then on startup or restart of your webapp, you simply grab the version from that data in the reopened Lucene commit point, and query the database for all the records with the version > that number; if you find any, add them to the index, and do one more commit to bring the index up to date.

This is probably beyond the scope of your original query, however.


On Fri, Jun 20, 2014 at 10:46 PM, Umashanker, Srividhya < srividhya.umashanker@hp.com> wrote:

> We do have a way to recover partially with a version number for each 
> transaction. The same version maintained in lucene as one document. 
> During startup these numbers define what has to be syncd up. 
> Unfortunately lucene is used in a webapp, so this happens "only" during a jetty restart.
>
> - Vidhya
>
>
> > On 21-Jun-2014, at 11:08 am, "Vitaly Funstein" <vf...@gmail.com>
> wrote:
> >
> > This is a better idea than what you had before, but I don't think 
> > there's any point in doing any commits manually at all unless you 
> > have a way of detecting and recovering exactly the data that hasn't 
> > been committed. In other words, what difference does it make whether 
> > you lost 1 index record or 1M, if you can't determine which records 
> > were lost and need to reindex everything from the start anyway, to 
> > ensure consistency between SOR and Lucene?
> >
> >
> >
> >
> > On Fri, Jun 20, 2014 at 10:20 PM, Umashanker, Srividhya < 
> > srividhya.umashanker@hp.com> wrote:
> >
> >> Let me try with the NRT and periodic commit  say every 5 mins in a 
> >> committer thread on need basis.
> >>
> >> Is there a threshold limit on how long we can go without committing 
> >> ? I think the buffers get flushed to disk but not to crash proof on disk.
> So we
> >> should be good on memory.
> >>
> >> I should also verify if the time taken for commit() is longer when 
> >> more data piled up to commit.  But definitely should be better than
>  committing
> >> for every thread..
> >>
> >> Will post back after tests.
> >>
> >> - Vidhya
> >>
> >>
> >>> On 21-Jun-2014, at 10:28 am, "Vitaly Funstein" 
> >>> <vf...@gmail.com>
> >> wrote:
> >>>
> >>> Hmm, I might have actually given you a slightly incorrect 
> >>> explanation
> wrt
> >>> what happens when internal buffers fill up. There will definitely 
> >>> be a flush of the buffer, and segment files will be written to, 
> >>> but it's not actually considered a full commit, i.e. an external 
> >>> reader will not see these changes (yet). The exact details elude 
> >>> me but there are quite a
> few
> >>> threads here on what happens during a commit (vs a flush). 
> >>> However,
> when
> >>> you call IndexWriter.close() a commit will definitely happen.
> >>>
> >>> But in any event, if you use an NRT reader to search, then it 
> >>> shouldn't matter to you when the commit actually takes place. Such 
> >>> readers also search uncommitted changes as well as those already 
> >>> on disk. If data durability is not a requirement for you, if i.e. 
> >>> you can (and probably
> >> do)
> >>> reindex your data from SOR on startup, then not doing commits 
> >>> yourself
> >> may
> >>> be the way to go. Or perhaps you could reduce the amount of data 
> >>> you
> need
> >>> to reindex and still call commit() yourself periodically though 
> >>> not for every write transaction, but maybe introduce some 
> >>> watermarking logic whereby you detect the highest watermark 
> >>> committed to Lucene. Then
> >> reindex
> >>> only the data from the DB from that point onward (meaning only
> >> uncommitted
> >>> data is lost and needs to be recovered, but you can figure out 
> >>> exactly where that point is).
> >>>
> >>>
> >>>
> >>> On Fri, Jun 20, 2014 at 8:02 PM, Umashanker, Srividhya < 
> >>> srividhya.umashanker@hp.com> wrote:
> >>>
> >>>> It is non transactional. We first write the same data to database 
> >>>> in a transaction and then call writer addDocument.  If lucene 
> >>>> fails we
> still
> >>>> hold the data to recover.
> >>>>
> >>>> I can avoid the commit if we use NRT reader. We do need this to 
> >>>> be searchable immediately.
> >>>>
> >>>> Another question. I did try removing commit() in each thread and 
> >>>> wait
> >> for
> >>>> lucene to auto commit with maxBufferedDocs set to 100 and
> >> ramBufferedSize
> >>>> set to high values, so docs triggers first. But did not see the 
> >>>> 1st
> 100
> >>>> docs data in lucene even after 500 docs.
> >>>>
> >>>> Is there a way for me to see when lucene auto commits?
> >>>>
> >>>> If we tune the auto commit parameters appropriately, do i still 
> >>>> need
> the
> >>>> committer thread ? Because it's job is to call commit. Anyway 
> >>>> add/updateDocument is already done in my writer threads.
> >>>>
> >>>> Thanks for your time and your suggestions!
> >>>>
> >>>> - Vidhya
> >>>>
> >>>>
> >>>>> On 21-Jun-2014, at 12:09 am, "Vitaly Funstein" 
> >>>>> <vf...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> You could just avoid calling commit() altogether if your
> application's
> >>>>> semantics allow this (i.e. it's non-transactional in nature). 
> >>>>> This
> way,
> >>>>> Lucene will do commits when appropriate, based on the buffering
> >> settings
> >>>>> you chose. It's generally unnecessary and undesirable to call 
> >>>>> commit
> at
> >>>> the
> >>>>> end of each write, unless you see to provide strict durability
> >> guarantees
> >>>>> in your system.
> >>>>>
> >>>>> If you must acknowledge every write after it's been committed, 
> >>>>> set
> up a
> >>>>> single committer thread that does this when there are any work 
> >>>>> tasks
> in
> >>>> the
> >>>>> queue. Then add to that queue from your writer threads...
> >>>>>
> >>>>>
> >>>>> On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya < 
> >>>>> srividhya.umashanker@hp.com> wrote:
> >>>>>
> >>>>>> Lucene Experts -
> >>>>>>
> >>>>>> Recently we upgraded to Lucene 4. We want to make use of 
> >>>>>> concurrent flushing feature Of Lucene.
> >>>>>>
> >>>>>> Indexing for us includes certain db operations and writing to 
> >>>>>> lucene
> >>>> ended
> >>>>>> by commit.  There may be multiple concurrent calls to Indexer 
> >>>>>> to
> >> publish
> >>>>>> single/multiple records.
> >>>>>>
> >>>>>> So far, with older version of lucene, we had our indexing
> synchronized
> >>>> (1
> >>>>>> thread indexing).
> >>>>>> Which means waiting time is more, based on concurrency and 
> >>>>>> execution
> >>>> time.
> >>>>>>
> >>>>>> We are moving away from the Synchronized indexing. Which is 
> >>>>>> actually
> >> to
> >>>>>> cut down the waiting period.  Trying to find out if we have to 
> >>>>>> limit
> >> the
> >>>>>> number of threads that adds document and commits.
> >>>>>>
> >>>>>> Below are the tests - to publish just 1000 records with 3 text
> fields.
> >>>>>>
> >>>>>> Java 7 , JVM config :  -XX:MaxPermSize=384M 
> >>>>>> -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m
> -XX:MaxNewSize=100m
> >>>>>> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier 
> >>>>>> -Djsse.enableSNIExtension=false
> >>>>>>
> >>>>>> IndexConfiguration being default : We also tried with changes 
> >>>>>> in maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Min time  in ms
> >>>>>>
> >>>>>> Max time ms
> >>>>>>
> >>>>>> Avg time ms
> >>>>>>
> >>>>>> 1 thread -commit
> >>>>>>
> >>>>>> 65
> >>>>>>
> >>>>>> 267
> >>>>>>
> >>>>>> 85
> >>>>>>
> >>>>>> 1 thread -updateDocument
> >>>>>>
> >>>>>> 0
> >>>>>>
> >>>>>> 40
> >>>>>>
> >>>>>> 1
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 6 thread-commit
> >>>>>>
> >>>>>> 83
> >>>>>>
> >>>>>> 1449
> >>>>>>
> >>>>>> 552.42
> >>>>>>
> >>>>>> 6 thread- updateDocument
> >>>>>>
> >>>>>> 0
> >>>>>>
> >>>>>> 175
> >>>>>>
> >>>>>> 1.5
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 10 thread -Commit
> >>>>>>
> >>>>>> 154
> >>>>>>
> >>>>>> 2429
> >>>>>>
> >>>>>> 874
> >>>>>>
> >>>>>> 10 thread- updateDocument
> >>>>>>
> >>>>>> 0
> >>>>>>
> >>>>>> 243
> >>>>>>
> >>>>>> 1.9
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 20 thread -commit
> >>>>>>
> >>>>>> 76
> >>>>>>
> >>>>>> 4351
> >>>>>>
> >>>>>> 1622
> >>>>>>
> >>>>>> 20 thread - updateDocument
> >>>>>>
> >>>>>> 0
> >>>>>>
> >>>>>> 326
> >>>>>>
> >>>>>> 2.1
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> More the threads trying to write to lucene, the updateDocument 
> >>>>>> and
> >>>>>> commit() are becoming bottlenecks.  In the above table, 10 and 
> >>>>>> 20
> >>>> threads
> >>>>>> have an average of 1.5 sec for 1000 commits.
> >>>>>>
> >>>>>> Is there some configuration of suggestions to tune the 
> >>>>>> performance
> of
> >>>> the
> >>>>>> 2 methods, so that our service performs better, with more
> concurrency?
> >>>>>>
> >>>>>> -vidhya
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>> -----------------------------------------------------------------
> >>>> ---- To unsubscribe, e-mail: 
> >>>> java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>
> >> -------------------------------------------------------------------
> >> -- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Concurrent Indexing

Posted by Vitaly Funstein <vf...@gmail.com>.

Hmm, I'm not sure you want to rely on the presence or absence of a
particular document in the index to determine the recovery point. It may
work for inserts, but not likely for updates or removes. I would look into
driving the version numbers from the commiter to the DB, and record them as
commit user data for each Lucene index commit. Then on startup or restart
of your webapp, you simply grab the version from that data in the reopened
Lucene commit point, and query the database for all the records with the
version > that number; if you find any, add them to the index, and do one
more commit to bring the index up to date.

This is probably beyond the scope of your original query, however.


On Fri, Jun 20, 2014 at 10:46 PM, Umashanker, Srividhya <
srividhya.umashanker@hp.com> wrote:

> We do have a way to recover partially with a version number for each
> transaction. The same version maintained in lucene as one document. During
> startup these numbers define what has to be syncd up. Unfortunately lucene
> is used in a webapp, so this happens "only" during a jetty restart.
>
> - Vidhya
>
>
> > On 21-Jun-2014, at 11:08 am, "Vitaly Funstein" <vf...@gmail.com>
> wrote:
> >
> > This is a better idea than what you had before, but I don't think there's
> > any point in doing any commits manually at all unless you have a way of
> > detecting and recovering exactly the data that hasn't been committed. In
> > other words, what difference does it make whether you lost 1 index record
> > or 1M, if you can't determine which records were lost and need to reindex
> > everything from the start anyway, to ensure consistency between SOR and
> > Lucene?
> >
> >
> >
> >
> > On Fri, Jun 20, 2014 at 10:20 PM, Umashanker, Srividhya <
> > srividhya.umashanker@hp.com> wrote:
> >
> >> Let me try with the NRT and periodic commit  say every 5 mins in a
> >> committer thread on need basis.
> >>
> >> Is there a threshold limit on how long we can go without committing ? I
> >> think the buffers get flushed to disk but not to crash proof on disk.
> So we
> >> should be good on memory.
> >>
> >> I should also verify if the time taken for commit() is longer when more
> >> data piled up to commit.  But definitely should be better than
>  committing
> >> for every thread..
> >>
> >> Will post back after tests.
> >>
> >> - Vidhya
> >>
> >>
> >>> On 21-Jun-2014, at 10:28 am, "Vitaly Funstein" <vf...@gmail.com>
> >> wrote:
> >>>
> >>> Hmm, I might have actually given you a slightly incorrect explanation
> wrt
> >>> what happens when internal buffers fill up. There will definitely be a
> >>> flush of the buffer, and segment files will be written to, but it's not
> >>> actually considered a full commit, i.e. an external reader will not see
> >>> these changes (yet). The exact details elude me but there are quite a
> few
> >>> threads here on what happens during a commit (vs a flush). However,
> when
> >>> you call IndexWriter.close() a commit will definitely happen.
> >>>
> >>> But in any event, if you use an NRT reader to search, then it shouldn't
> >>> matter to you when the commit actually takes place. Such readers also
> >>> search uncommitted changes as well as those already on disk. If data
> >>> durability is not a requirement for you, if i.e. you can (and probably
> >> do)
> >>> reindex your data from SOR on startup, then not doing commits yourself
> >> may
> >>> be the way to go. Or perhaps you could reduce the amount of data you
> need
> >>> to reindex and still call commit() yourself periodically though not for
> >>> every write transaction, but maybe introduce some watermarking logic
> >>> whereby you detect the highest watermark committed to Lucene. Then
> >> reindex
> >>> only the data from the DB from that point onward (meaning only
> >> uncommitted
> >>> data is lost and needs to be recovered, but you can figure out exactly
> >>> where that point is).
> >>>
> >>>
> >>>
> >>> On Fri, Jun 20, 2014 at 8:02 PM, Umashanker, Srividhya <
> >>> srividhya.umashanker@hp.com> wrote:
> >>>
> >>>> It is non transactional. We first write the same data to database in a
> >>>> transaction and then call writer addDocument.  If lucene fails we
> still
> >>>> hold the data to recover.
> >>>>
> >>>> I can avoid the commit if we use NRT reader. We do need this to be
> >>>> searchable immediately.
> >>>>
> >>>> Another question. I did try removing commit() in each thread and wait
> >> for
> >>>> lucene to auto commit with maxBufferedDocs set to 100 and
> >> ramBufferedSize
> >>>> set to high values, so docs triggers first. But did not see the 1st
> 100
> >>>> docs data in lucene even after 500 docs.
> >>>>
> >>>> Is there a way for me to see when lucene auto commits?
> >>>>
> >>>> If we tune the auto commit parameters appropriately, do i still need
> the
> >>>> committer thread ? Because it's job is to call commit. Anyway
> >>>> add/updateDocument is already done in my writer threads.
> >>>>
> >>>> Thanks for your time and your suggestions!
> >>>>
> >>>> - Vidhya
> >>>>
> >>>>
> >>>>> On 21-Jun-2014, at 12:09 am, "Vitaly Funstein" <vf...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> You could just avoid calling commit() altogether if your
> application's
> >>>>> semantics allow this (i.e. it's non-transactional in nature). This
> way,
> >>>>> Lucene will do commits when appropriate, based on the buffering
> >> settings
> >>>>> you chose. It's generally unnecessary and undesirable to call commit
> at
> >>>> the
> >>>>> end of each write, unless you see to provide strict durability
> >> guarantees
> >>>>> in your system.
> >>>>>
> >>>>> If you must acknowledge every write after it's been committed, set
> up a
> >>>>> single committer thread that does this when there are any work tasks
> in
> >>>> the
> >>>>> queue. Then add to that queue from your writer threads...
> >>>>>
> >>>>>
> >>>>> On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya <
> >>>>> srividhya.umashanker@hp.com> wrote:
> >>>>>
> >>>>>> Lucene Experts -
> >>>>>>
> >>>>>> Recently we upgraded to Lucene 4. We want to make use of concurrent
> >>>>>> flushing feature Of Lucene.
> >>>>>>
> >>>>>> Indexing for us includes certain db operations and writing to lucene
> >>>> ended
> >>>>>> by commit.  There may be multiple concurrent calls to Indexer to
> >> publish
> >>>>>> single/multiple records.
> >>>>>>
> >>>>>> So far, with older version of lucene, we had our indexing
> synchronized
> >>>> (1
> >>>>>> thread indexing).
> >>>>>> Which means waiting time is more, based on concurrency and execution
> >>>> time.
> >>>>>>
> >>>>>> We are moving away from the Synchronized indexing. Which is actually
> >> to
> >>>>>> cut down the waiting period.  Trying to find out if we have to limit
> >> the
> >>>>>> number of threads that adds document and commits.
> >>>>>>
> >>>>>> Below are the tests - to publish just 1000 records with 3 text
> fields.
> >>>>>>
> >>>>>> Java 7 , JVM config :  -XX:MaxPermSize=384M
> >>>>>> -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m
> -XX:MaxNewSize=100m
> >>>>>> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier
> >>>>>> -Djsse.enableSNIExtension=false
> >>>>>>
> >>>>>> IndexConfiguration being default : We also tried with changes in
> >>>>>> maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Min time  in ms
> >>>>>>
> >>>>>> Max time ms
> >>>>>>
> >>>>>> Avg time ms
> >>>>>>
> >>>>>> 1 thread -commit
> >>>>>>
> >>>>>> 65
> >>>>>>
> >>>>>> 267
> >>>>>>
> >>>>>> 85
> >>>>>>
> >>>>>> 1 thread -updateDocument
> >>>>>>
> >>>>>> 0
> >>>>>>
> >>>>>> 40
> >>>>>>
> >>>>>> 1
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 6 thread-commit
> >>>>>>
> >>>>>> 83
> >>>>>>
> >>>>>> 1449
> >>>>>>
> >>>>>> 552.42
> >>>>>>
> >>>>>> 6 thread- updateDocument
> >>>>>>
> >>>>>> 0
> >>>>>>
> >>>>>> 175
> >>>>>>
> >>>>>> 1.5
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 10 thread -Commit
> >>>>>>
> >>>>>> 154
> >>>>>>
> >>>>>> 2429
> >>>>>>
> >>>>>> 874
> >>>>>>
> >>>>>> 10 thread- updateDocument
> >>>>>>
> >>>>>> 0
> >>>>>>
> >>>>>> 243
> >>>>>>
> >>>>>> 1.9
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 20 thread -commit
> >>>>>>
> >>>>>> 76
> >>>>>>
> >>>>>> 4351
> >>>>>>
> >>>>>> 1622
> >>>>>>
> >>>>>> 20 thread - updateDocument
> >>>>>>
> >>>>>> 0
> >>>>>>
> >>>>>> 326
> >>>>>>
> >>>>>> 2.1
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> More the threads trying to write to lucene, the updateDocument and
> >>>>>> commit() are becoming bottlenecks.  In the above table, 10 and 20
> >>>> threads
> >>>>>> have an average of 1.5 sec for 1000 commits.
> >>>>>>
> >>>>>> Is there some configuration of suggestions to tune the performance
> of
> >>>> the
> >>>>>> 2 methods, so that our service performs better, with more
> concurrency?
> >>>>>>
> >>>>>> -vidhya
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Concurrent Indexing

Posted by "Umashanker, Srividhya" <sr...@hp.com>.

We do have a way to recover partially with a version number for each transaction. The same version maintained in lucene as one document. During startup these numbers define what has to be syncd up. Unfortunately lucene is used in a webapp, so this happens "only" during a jetty restart. 

- Vidhya


> On 21-Jun-2014, at 11:08 am, "Vitaly Funstein" <vf...@gmail.com> wrote:
> 
> This is a better idea than what you had before, but I don't think there's
> any point in doing any commits manually at all unless you have a way of
> detecting and recovering exactly the data that hasn't been committed. In
> other words, what difference does it make whether you lost 1 index record
> or 1M, if you can't determine which records were lost and need to reindex
> everything from the start anyway, to ensure consistency between SOR and
> Lucene?
> 
> 
> 
> 
> On Fri, Jun 20, 2014 at 10:20 PM, Umashanker, Srividhya <
> srividhya.umashanker@hp.com> wrote:
> 
>> Let me try with the NRT and periodic commit  say every 5 mins in a
>> committer thread on need basis.
>> 
>> Is there a threshold limit on how long we can go without committing ? I
>> think the buffers get flushed to disk but not to crash proof on disk. So we
>> should be good on memory.
>> 
>> I should also verify if the time taken for commit() is longer when more
>> data piled up to commit.  But definitely should be better than  committing
>> for every thread..
>> 
>> Will post back after tests.
>> 
>> - Vidhya
>> 
>> 
>>> On 21-Jun-2014, at 10:28 am, "Vitaly Funstein" <vf...@gmail.com>
>> wrote:
>>> 
>>> Hmm, I might have actually given you a slightly incorrect explanation wrt
>>> what happens when internal buffers fill up. There will definitely be a
>>> flush of the buffer, and segment files will be written to, but it's not
>>> actually considered a full commit, i.e. an external reader will not see
>>> these changes (yet). The exact details elude me but there are quite a few
>>> threads here on what happens during a commit (vs a flush). However, when
>>> you call IndexWriter.close() a commit will definitely happen.
>>> 
>>> But in any event, if you use an NRT reader to search, then it shouldn't
>>> matter to you when the commit actually takes place. Such readers also
>>> search uncommitted changes as well as those already on disk. If data
>>> durability is not a requirement for you, if i.e. you can (and probably
>> do)
>>> reindex your data from SOR on startup, then not doing commits yourself
>> may
>>> be the way to go. Or perhaps you could reduce the amount of data you need
>>> to reindex and still call commit() yourself periodically though not for
>>> every write transaction, but maybe introduce some watermarking logic
>>> whereby you detect the highest watermark committed to Lucene. Then
>> reindex
>>> only the data from the DB from that point onward (meaning only
>> uncommitted
>>> data is lost and needs to be recovered, but you can figure out exactly
>>> where that point is).
>>> 
>>> 
>>> 
>>> On Fri, Jun 20, 2014 at 8:02 PM, Umashanker, Srividhya <
>>> srividhya.umashanker@hp.com> wrote:
>>> 
>>>> It is non transactional. We first write the same data to database in a
>>>> transaction and then call writer addDocument.  If lucene fails we still
>>>> hold the data to recover.
>>>> 
>>>> I can avoid the commit if we use NRT reader. We do need this to be
>>>> searchable immediately.
>>>> 
>>>> Another question. I did try removing commit() in each thread and wait
>> for
>>>> lucene to auto commit with maxBufferedDocs set to 100 and
>> ramBufferedSize
>>>> set to high values, so docs triggers first. But did not see the 1st 100
>>>> docs data in lucene even after 500 docs.
>>>> 
>>>> Is there a way for me to see when lucene auto commits?
>>>> 
>>>> If we tune the auto commit parameters appropriately, do i still need the
>>>> committer thread ? Because it's job is to call commit. Anyway
>>>> add/updateDocument is already done in my writer threads.
>>>> 
>>>> Thanks for your time and your suggestions!
>>>> 
>>>> - Vidhya
>>>> 
>>>> 
>>>>> On 21-Jun-2014, at 12:09 am, "Vitaly Funstein" <vf...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> You could just avoid calling commit() altogether if your application's
>>>>> semantics allow this (i.e. it's non-transactional in nature). This way,
>>>>> Lucene will do commits when appropriate, based on the buffering
>> settings
>>>>> you chose. It's generally unnecessary and undesirable to call commit at
>>>> the
>>>>> end of each write, unless you see to provide strict durability
>> guarantees
>>>>> in your system.
>>>>> 
>>>>> If you must acknowledge every write after it's been committed, set up a
>>>>> single committer thread that does this when there are any work tasks in
>>>> the
>>>>> queue. Then add to that queue from your writer threads...
>>>>> 
>>>>> 
>>>>> On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya <
>>>>> srividhya.umashanker@hp.com> wrote:
>>>>> 
>>>>>> Lucene Experts -
>>>>>> 
>>>>>> Recently we upgraded to Lucene 4. We want to make use of concurrent
>>>>>> flushing feature Of Lucene.
>>>>>> 
>>>>>> Indexing for us includes certain db operations and writing to lucene
>>>> ended
>>>>>> by commit.  There may be multiple concurrent calls to Indexer to
>> publish
>>>>>> single/multiple records.
>>>>>> 
>>>>>> So far, with older version of lucene, we had our indexing synchronized
>>>> (1
>>>>>> thread indexing).
>>>>>> Which means waiting time is more, based on concurrency and execution
>>>> time.
>>>>>> 
>>>>>> We are moving away from the Synchronized indexing. Which is actually
>> to
>>>>>> cut down the waiting period.  Trying to find out if we have to limit
>> the
>>>>>> number of threads that adds document and commits.
>>>>>> 
>>>>>> Below are the tests - to publish just 1000 records with 3 text fields.
>>>>>> 
>>>>>> Java 7 , JVM config :  -XX:MaxPermSize=384M
>>>>>> -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m -XX:MaxNewSize=100m
>>>>>> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier
>>>>>> -Djsse.enableSNIExtension=false
>>>>>> 
>>>>>> IndexConfiguration being default : We also tried with changes in
>>>>>> maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Min time  in ms
>>>>>> 
>>>>>> Max time ms
>>>>>> 
>>>>>> Avg time ms
>>>>>> 
>>>>>> 1 thread -commit
>>>>>> 
>>>>>> 65
>>>>>> 
>>>>>> 267
>>>>>> 
>>>>>> 85
>>>>>> 
>>>>>> 1 thread -updateDocument
>>>>>> 
>>>>>> 0
>>>>>> 
>>>>>> 40
>>>>>> 
>>>>>> 1
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 6 thread-commit
>>>>>> 
>>>>>> 83
>>>>>> 
>>>>>> 1449
>>>>>> 
>>>>>> 552.42
>>>>>> 
>>>>>> 6 thread- updateDocument
>>>>>> 
>>>>>> 0
>>>>>> 
>>>>>> 175
>>>>>> 
>>>>>> 1.5
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 10 thread -Commit
>>>>>> 
>>>>>> 154
>>>>>> 
>>>>>> 2429
>>>>>> 
>>>>>> 874
>>>>>> 
>>>>>> 10 thread- updateDocument
>>>>>> 
>>>>>> 0
>>>>>> 
>>>>>> 243
>>>>>> 
>>>>>> 1.9
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 20 thread -commit
>>>>>> 
>>>>>> 76
>>>>>> 
>>>>>> 4351
>>>>>> 
>>>>>> 1622
>>>>>> 
>>>>>> 20 thread - updateDocument
>>>>>> 
>>>>>> 0
>>>>>> 
>>>>>> 326
>>>>>> 
>>>>>> 2.1
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> More the threads trying to write to lucene, the updateDocument and
>>>>>> commit() are becoming bottlenecks.  In the above table, 10 and 20
>>>> threads
>>>>>> have an average of 1.5 sec for 1000 commits.
>>>>>> 
>>>>>> Is there some configuration of suggestions to tune the performance of
>>>> the
>>>>>> 2 methods, so that our service performs better, with more concurrency?
>>>>>> 
>>>>>> -vidhya
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Concurrent Indexing

Posted by Vitaly Funstein <vf...@gmail.com>.

This is a better idea than what you had before, but I don't think there's
any point in doing any commits manually at all unless you have a way of
detecting and recovering exactly the data that hasn't been committed. In
other words, what difference does it make whether you lost 1 index record
or 1M, if you can't determine which records were lost and need to reindex
everything from the start anyway, to ensure consistency between SOR and
Lucene?




On Fri, Jun 20, 2014 at 10:20 PM, Umashanker, Srividhya <
srividhya.umashanker@hp.com> wrote:

> Let me try with the NRT and periodic commit  say every 5 mins in a
> committer thread on need basis.
>
> Is there a threshold limit on how long we can go without committing ? I
> think the buffers get flushed to disk but not to crash proof on disk. So we
> should be good on memory.
>
> I should also verify if the time taken for commit() is longer when more
> data piled up to commit.  But definitely should be better than  committing
> for every thread..
>
> Will post back after tests.
>
> - Vidhya
>
>
> > On 21-Jun-2014, at 10:28 am, "Vitaly Funstein" <vf...@gmail.com>
> wrote:
> >
> > Hmm, I might have actually given you a slightly incorrect explanation wrt
> > what happens when internal buffers fill up. There will definitely be a
> > flush of the buffer, and segment files will be written to, but it's not
> > actually considered a full commit, i.e. an external reader will not see
> > these changes (yet). The exact details elude me but there are quite a few
> > threads here on what happens during a commit (vs a flush). However, when
> > you call IndexWriter.close() a commit will definitely happen.
> >
> > But in any event, if you use an NRT reader to search, then it shouldn't
> > matter to you when the commit actually takes place. Such readers also
> > search uncommitted changes as well as those already on disk. If data
> > durability is not a requirement for you, if i.e. you can (and probably
> do)
> > reindex your data from SOR on startup, then not doing commits yourself
> may
> > be the way to go. Or perhaps you could reduce the amount of data you need
> > to reindex and still call commit() yourself periodically though not for
> > every write transaction, but maybe introduce some watermarking logic
> > whereby you detect the highest watermark committed to Lucene. Then
> reindex
> > only the data from the DB from that point onward (meaning only
> uncommitted
> > data is lost and needs to be recovered, but you can figure out exactly
> > where that point is).
> >
> >
> >
> > On Fri, Jun 20, 2014 at 8:02 PM, Umashanker, Srividhya <
> > srividhya.umashanker@hp.com> wrote:
> >
> >> It is non transactional. We first write the same data to database in a
> >> transaction and then call writer addDocument.  If lucene fails we still
> >> hold the data to recover.
> >>
> >> I can avoid the commit if we use NRT reader. We do need this to be
> >> searchable immediately.
> >>
> >> Another question. I did try removing commit() in each thread and wait
> for
> >> lucene to auto commit with maxBufferedDocs set to 100 and
> ramBufferedSize
> >> set to high values, so docs triggers first. But did not see the 1st 100
> >> docs data in lucene even after 500 docs.
> >>
> >> Is there a way for me to see when lucene auto commits?
> >>
> >> If we tune the auto commit parameters appropriately, do i still need the
> >> committer thread ? Because it's job is to call commit. Anyway
> >> add/updateDocument is already done in my writer threads.
> >>
> >> Thanks for your time and your suggestions!
> >>
> >> - Vidhya
> >>
> >>
> >>> On 21-Jun-2014, at 12:09 am, "Vitaly Funstein" <vf...@gmail.com>
> >> wrote:
> >>>
> >>> You could just avoid calling commit() altogether if your application's
> >>> semantics allow this (i.e. it's non-transactional in nature). This way,
> >>> Lucene will do commits when appropriate, based on the buffering
> settings
> >>> you chose. It's generally unnecessary and undesirable to call commit at
> >> the
> >>> end of each write, unless you see to provide strict durability
> guarantees
> >>> in your system.
> >>>
> >>> If you must acknowledge every write after it's been committed, set up a
> >>> single committer thread that does this when there are any work tasks in
> >> the
> >>> queue. Then add to that queue from your writer threads...
> >>>
> >>>
> >>> On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya <
> >>> srividhya.umashanker@hp.com> wrote:
> >>>
> >>>> Lucene Experts -
> >>>>
> >>>> Recently we upgraded to Lucene 4. We want to make use of concurrent
> >>>> flushing feature Of Lucene.
> >>>>
> >>>> Indexing for us includes certain db operations and writing to lucene
> >> ended
> >>>> by commit.  There may be multiple concurrent calls to Indexer to
> publish
> >>>> single/multiple records.
> >>>>
> >>>> So far, with older version of lucene, we had our indexing synchronized
> >> (1
> >>>> thread indexing).
> >>>> Which means waiting time is more, based on concurrency and execution
> >> time.
> >>>>
> >>>> We are moving away from the Synchronized indexing. Which is actually
> to
> >>>> cut down the waiting period.  Trying to find out if we have to limit
> the
> >>>> number of threads that adds document and commits.
> >>>>
> >>>> Below are the tests - to publish just 1000 records with 3 text fields.
> >>>>
> >>>> Java 7 , JVM config :  -XX:MaxPermSize=384M
> >>>> -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m -XX:MaxNewSize=100m
> >>>> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier
> >>>> -Djsse.enableSNIExtension=false
> >>>>
> >>>> IndexConfiguration being default : We also tried with changes in
> >>>> maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
> >>>>
> >>>>
> >>>>
> >>>> Min time  in ms
> >>>>
> >>>> Max time ms
> >>>>
> >>>> Avg time ms
> >>>>
> >>>> 1 thread -commit
> >>>>
> >>>> 65
> >>>>
> >>>> 267
> >>>>
> >>>> 85
> >>>>
> >>>> 1 thread -updateDocument
> >>>>
> >>>> 0
> >>>>
> >>>> 40
> >>>>
> >>>> 1
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 6 thread-commit
> >>>>
> >>>> 83
> >>>>
> >>>> 1449
> >>>>
> >>>> 552.42
> >>>>
> >>>> 6 thread- updateDocument
> >>>>
> >>>> 0
> >>>>
> >>>> 175
> >>>>
> >>>> 1.5
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 10 thread -Commit
> >>>>
> >>>> 154
> >>>>
> >>>> 2429
> >>>>
> >>>> 874
> >>>>
> >>>> 10 thread- updateDocument
> >>>>
> >>>> 0
> >>>>
> >>>> 243
> >>>>
> >>>> 1.9
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 20 thread -commit
> >>>>
> >>>> 76
> >>>>
> >>>> 4351
> >>>>
> >>>> 1622
> >>>>
> >>>> 20 thread - updateDocument
> >>>>
> >>>> 0
> >>>>
> >>>> 326
> >>>>
> >>>> 2.1
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> More the threads trying to write to lucene, the updateDocument and
> >>>> commit() are becoming bottlenecks.  In the above table, 10 and 20
> >> threads
> >>>> have an average of 1.5 sec for 1000 commits.
> >>>>
> >>>> Is there some configuration of suggestions to tune the performance of
> >> the
> >>>> 2 methods, so that our service performs better, with more concurrency?
> >>>>
> >>>> -vidhya
> >>>>
> >>>>
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Concurrent Indexing

Posted by "Umashanker, Srividhya" <sr...@hp.com>.

Let me try with the NRT and periodic commit  say every 5 mins in a committer thread on need basis.

Is there a threshold limit on how long we can go without committing ? I think the buffers get flushed to disk but not to crash proof on disk. So we should be good on memory.

I should also verify if the time taken for commit() is longer when more data piled up to commit.  But definitely should be better than  committing for every thread..

Will post back after tests.

- Vidhya


> On 21-Jun-2014, at 10:28 am, "Vitaly Funstein" <vf...@gmail.com> wrote:
> 
> Hmm, I might have actually given you a slightly incorrect explanation wrt
> what happens when internal buffers fill up. There will definitely be a
> flush of the buffer, and segment files will be written to, but it's not
> actually considered a full commit, i.e. an external reader will not see
> these changes (yet). The exact details elude me but there are quite a few
> threads here on what happens during a commit (vs a flush). However, when
> you call IndexWriter.close() a commit will definitely happen.
> 
> But in any event, if you use an NRT reader to search, then it shouldn't
> matter to you when the commit actually takes place. Such readers also
> search uncommitted changes as well as those already on disk. If data
> durability is not a requirement for you, if i.e. you can (and probably do)
> reindex your data from SOR on startup, then not doing commits yourself may
> be the way to go. Or perhaps you could reduce the amount of data you need
> to reindex and still call commit() yourself periodically though not for
> every write transaction, but maybe introduce some watermarking logic
> whereby you detect the highest watermark committed to Lucene. Then reindex
> only the data from the DB from that point onward (meaning only uncommitted
> data is lost and needs to be recovered, but you can figure out exactly
> where that point is).
> 
> 
> 
> On Fri, Jun 20, 2014 at 8:02 PM, Umashanker, Srividhya <
> srividhya.umashanker@hp.com> wrote:
> 
>> It is non transactional. We first write the same data to database in a
>> transaction and then call writer addDocument.  If lucene fails we still
>> hold the data to recover.
>> 
>> I can avoid the commit if we use NRT reader. We do need this to be
>> searchable immediately.
>> 
>> Another question. I did try removing commit() in each thread and wait for
>> lucene to auto commit with maxBufferedDocs set to 100 and ramBufferedSize
>> set to high values, so docs triggers first. But did not see the 1st 100
>> docs data in lucene even after 500 docs.
>> 
>> Is there a way for me to see when lucene auto commits?
>> 
>> If we tune the auto commit parameters appropriately, do i still need the
>> committer thread ? Because it's job is to call commit. Anyway
>> add/updateDocument is already done in my writer threads.
>> 
>> Thanks for your time and your suggestions!
>> 
>> - Vidhya
>> 
>> 
>>> On 21-Jun-2014, at 12:09 am, "Vitaly Funstein" <vf...@gmail.com>
>> wrote:
>>> 
>>> You could just avoid calling commit() altogether if your application's
>>> semantics allow this (i.e. it's non-transactional in nature). This way,
>>> Lucene will do commits when appropriate, based on the buffering settings
>>> you chose. It's generally unnecessary and undesirable to call commit at
>> the
>>> end of each write, unless you see to provide strict durability guarantees
>>> in your system.
>>> 
>>> If you must acknowledge every write after it's been committed, set up a
>>> single committer thread that does this when there are any work tasks in
>> the
>>> queue. Then add to that queue from your writer threads...
>>> 
>>> 
>>> On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya <
>>> srividhya.umashanker@hp.com> wrote:
>>> 
>>>> Lucene Experts -
>>>> 
>>>> Recently we upgraded to Lucene 4. We want to make use of concurrent
>>>> flushing feature Of Lucene.
>>>> 
>>>> Indexing for us includes certain db operations and writing to lucene
>> ended
>>>> by commit.  There may be multiple concurrent calls to Indexer to publish
>>>> single/multiple records.
>>>> 
>>>> So far, with older version of lucene, we had our indexing synchronized
>> (1
>>>> thread indexing).
>>>> Which means waiting time is more, based on concurrency and execution
>> time.
>>>> 
>>>> We are moving away from the Synchronized indexing. Which is actually to
>>>> cut down the waiting period.  Trying to find out if we have to limit the
>>>> number of threads that adds document and commits.
>>>> 
>>>> Below are the tests - to publish just 1000 records with 3 text fields.
>>>> 
>>>> Java 7 , JVM config :  -XX:MaxPermSize=384M
>>>> -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m -XX:MaxNewSize=100m
>>>> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier
>>>> -Djsse.enableSNIExtension=false
>>>> 
>>>> IndexConfiguration being default : We also tried with changes in
>>>> maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
>>>> 
>>>> 
>>>> 
>>>> Min time  in ms
>>>> 
>>>> Max time ms
>>>> 
>>>> Avg time ms
>>>> 
>>>> 1 thread -commit
>>>> 
>>>> 65
>>>> 
>>>> 267
>>>> 
>>>> 85
>>>> 
>>>> 1 thread -updateDocument
>>>> 
>>>> 0
>>>> 
>>>> 40
>>>> 
>>>> 1
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 6 thread-commit
>>>> 
>>>> 83
>>>> 
>>>> 1449
>>>> 
>>>> 552.42
>>>> 
>>>> 6 thread- updateDocument
>>>> 
>>>> 0
>>>> 
>>>> 175
>>>> 
>>>> 1.5
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 10 thread -Commit
>>>> 
>>>> 154
>>>> 
>>>> 2429
>>>> 
>>>> 874
>>>> 
>>>> 10 thread- updateDocument
>>>> 
>>>> 0
>>>> 
>>>> 243
>>>> 
>>>> 1.9
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 20 thread -commit
>>>> 
>>>> 76
>>>> 
>>>> 4351
>>>> 
>>>> 1622
>>>> 
>>>> 20 thread - updateDocument
>>>> 
>>>> 0
>>>> 
>>>> 326
>>>> 
>>>> 2.1
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> More the threads trying to write to lucene, the updateDocument and
>>>> commit() are becoming bottlenecks.  In the above table, 10 and 20
>> threads
>>>> have an average of 1.5 sec for 1000 commits.
>>>> 
>>>> Is there some configuration of suggestions to tune the performance of
>> the
>>>> 2 methods, so that our service performs better, with more concurrency?
>>>> 
>>>> -vidhya
>>>> 
>>>> 
>>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Concurrent Indexing

Posted by Vitaly Funstein <vf...@gmail.com>.

Hmm, I might have actually given you a slightly incorrect explanation wrt
what happens when internal buffers fill up. There will definitely be a
flush of the buffer, and segment files will be written to, but it's not
actually considered a full commit, i.e. an external reader will not see
these changes (yet). The exact details elude me but there are quite a few
threads here on what happens during a commit (vs a flush). However, when
you call IndexWriter.close() a commit will definitely happen.

But in any event, if you use an NRT reader to search, then it shouldn't
matter to you when the commit actually takes place. Such readers also
search uncommitted changes as well as those already on disk. If data
durability is not a requirement for you, if i.e. you can (and probably do)
reindex your data from SOR on startup, then not doing commits yourself may
be the way to go. Or perhaps you could reduce the amount of data you need
to reindex and still call commit() yourself periodically though not for
every write transaction, but maybe introduce some watermarking logic
whereby you detect the highest watermark committed to Lucene. Then reindex
only the data from the DB from that point onward (meaning only uncommitted
data is lost and needs to be recovered, but you can figure out exactly
where that point is).



On Fri, Jun 20, 2014 at 8:02 PM, Umashanker, Srividhya <
srividhya.umashanker@hp.com> wrote:

> It is non transactional. We first write the same data to database in a
> transaction and then call writer addDocument.  If lucene fails we still
> hold the data to recover.
>
> I can avoid the commit if we use NRT reader. We do need this to be
> searchable immediately.
>
> Another question. I did try removing commit() in each thread and wait for
> lucene to auto commit with maxBufferedDocs set to 100 and ramBufferedSize
> set to high values, so docs triggers first. But did not see the 1st 100
> docs data in lucene even after 500 docs.
>
> Is there a way for me to see when lucene auto commits?
>
> If we tune the auto commit parameters appropriately, do i still need the
> committer thread ? Because it's job is to call commit. Anyway
> add/updateDocument is already done in my writer threads.
>
> Thanks for your time and your suggestions!
>
> - Vidhya
>
>
> > On 21-Jun-2014, at 12:09 am, "Vitaly Funstein" <vf...@gmail.com>
> wrote:
> >
> > You could just avoid calling commit() altogether if your application's
> > semantics allow this (i.e. it's non-transactional in nature). This way,
> > Lucene will do commits when appropriate, based on the buffering settings
> > you chose. It's generally unnecessary and undesirable to call commit at
> the
> > end of each write, unless you see to provide strict durability guarantees
> > in your system.
> >
> > If you must acknowledge every write after it's been committed, set up a
> > single committer thread that does this when there are any work tasks in
> the
> > queue. Then add to that queue from your writer threads...
> >
> >
> > On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya <
> > srividhya.umashanker@hp.com> wrote:
> >
> >> Lucene Experts -
> >>
> >> Recently we upgraded to Lucene 4. We want to make use of concurrent
> >> flushing feature Of Lucene.
> >>
> >> Indexing for us includes certain db operations and writing to lucene
> ended
> >> by commit.  There may be multiple concurrent calls to Indexer to publish
> >> single/multiple records.
> >>
> >> So far, with older version of lucene, we had our indexing synchronized
> (1
> >> thread indexing).
> >> Which means waiting time is more, based on concurrency and execution
> time.
> >>
> >> We are moving away from the Synchronized indexing. Which is actually to
> >> cut down the waiting period.  Trying to find out if we have to limit the
> >> number of threads that adds document and commits.
> >>
> >> Below are the tests - to publish just 1000 records with 3 text fields.
> >>
> >> Java 7 , JVM config :  -XX:MaxPermSize=384M
> >> -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m -XX:MaxNewSize=100m
> >> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier
> >> -Djsse.enableSNIExtension=false
> >>
> >> IndexConfiguration being default : We also tried with changes in
> >> maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
> >>
> >>
> >>
> >> Min time  in ms
> >>
> >> Max time ms
> >>
> >> Avg time ms
> >>
> >> 1 thread -commit
> >>
> >> 65
> >>
> >> 267
> >>
> >> 85
> >>
> >> 1 thread -updateDocument
> >>
> >> 0
> >>
> >> 40
> >>
> >> 1
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> 6 thread-commit
> >>
> >> 83
> >>
> >> 1449
> >>
> >> 552.42
> >>
> >> 6 thread- updateDocument
> >>
> >> 0
> >>
> >> 175
> >>
> >> 1.5
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> 10 thread -Commit
> >>
> >> 154
> >>
> >> 2429
> >>
> >> 874
> >>
> >> 10 thread- updateDocument
> >>
> >> 0
> >>
> >> 243
> >>
> >> 1.9
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> 20 thread -commit
> >>
> >> 76
> >>
> >> 4351
> >>
> >> 1622
> >>
> >> 20 thread - updateDocument
> >>
> >> 0
> >>
> >> 326
> >>
> >> 2.1
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> More the threads trying to write to lucene, the updateDocument and
> >> commit() are becoming bottlenecks.  In the above table, 10 and 20
> threads
> >> have an average of 1.5 sec for 1000 commits.
> >>
> >> Is there some configuration of suggestions to tune the performance of
> the
> >> 2 methods, so that our service performs better, with more concurrency?
> >>
> >> -vidhya
> >>
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Concurrent Indexing

Posted by "Umashanker, Srividhya" <sr...@hp.com>.

It is non transactional. We first write the same data to database in a transaction and then call writer addDocument.  If lucene fails we still hold the data to recover.

I can avoid the commit if we use NRT reader. We do need this to be searchable immediately.

Another question. I did try removing commit() in each thread and wait for lucene to auto commit with maxBufferedDocs set to 100 and ramBufferedSize set to high values, so docs triggers first. But did not see the 1st 100 docs data in lucene even after 500 docs.

Is there a way for me to see when lucene auto commits? 
 
If we tune the auto commit parameters appropriately, do i still need the committer thread ? Because it's job is to call commit. Anyway add/updateDocument is already done in my writer threads. 

Thanks for your time and your suggestions!

- Vidhya


> On 21-Jun-2014, at 12:09 am, "Vitaly Funstein" <vf...@gmail.com> wrote:
> 
> You could just avoid calling commit() altogether if your application's
> semantics allow this (i.e. it's non-transactional in nature). This way,
> Lucene will do commits when appropriate, based on the buffering settings
> you chose. It's generally unnecessary and undesirable to call commit at the
> end of each write, unless you see to provide strict durability guarantees
> in your system.
> 
> If you must acknowledge every write after it's been committed, set up a
> single committer thread that does this when there are any work tasks in the
> queue. Then add to that queue from your writer threads...
> 
> 
> On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya <
> srividhya.umashanker@hp.com> wrote:
> 
>> Lucene Experts -
>> 
>> Recently we upgraded to Lucene 4. We want to make use of concurrent
>> flushing feature Of Lucene.
>> 
>> Indexing for us includes certain db operations and writing to lucene ended
>> by commit.  There may be multiple concurrent calls to Indexer to publish
>> single/multiple records.
>> 
>> So far, with older version of lucene, we had our indexing synchronized (1
>> thread indexing).
>> Which means waiting time is more, based on concurrency and execution time.
>> 
>> We are moving away from the Synchronized indexing. Which is actually to
>> cut down the waiting period.  Trying to find out if we have to limit the
>> number of threads that adds document and commits.
>> 
>> Below are the tests - to publish just 1000 records with 3 text fields.
>> 
>> Java 7 , JVM config :  -XX:MaxPermSize=384M
>> -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m -XX:MaxNewSize=100m
>> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier
>> -Djsse.enableSNIExtension=false
>> 
>> IndexConfiguration being default : We also tried with changes in
>> maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
>> 
>> 
>> 
>> Min time  in ms
>> 
>> Max time ms
>> 
>> Avg time ms
>> 
>> 1 thread -commit
>> 
>> 65
>> 
>> 267
>> 
>> 85
>> 
>> 1 thread -updateDocument
>> 
>> 0
>> 
>> 40
>> 
>> 1
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 6 thread-commit
>> 
>> 83
>> 
>> 1449
>> 
>> 552.42
>> 
>> 6 thread- updateDocument
>> 
>> 0
>> 
>> 175
>> 
>> 1.5
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 10 thread -Commit
>> 
>> 154
>> 
>> 2429
>> 
>> 874
>> 
>> 10 thread- updateDocument
>> 
>> 0
>> 
>> 243
>> 
>> 1.9
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 20 thread -commit
>> 
>> 76
>> 
>> 4351
>> 
>> 1622
>> 
>> 20 thread - updateDocument
>> 
>> 0
>> 
>> 326
>> 
>> 2.1
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> More the threads trying to write to lucene, the updateDocument and
>> commit() are becoming bottlenecks.  In the above table, 10 and 20 threads
>> have an average of 1.5 sec for 1000 commits.
>> 
>> Is there some configuration of suggestions to tune the performance of the
>> 2 methods, so that our service performs better, with more concurrency?
>> 
>> -vidhya
>> 
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Concurrent Indexing

Posted by Vitaly Funstein <vf...@gmail.com>.

You could just avoid calling commit() altogether if your application's
semantics allow this (i.e. it's non-transactional in nature). This way,
Lucene will do commits when appropriate, based on the buffering settings
you chose. It's generally unnecessary and undesirable to call commit at the
end of each write, unless you see to provide strict durability guarantees
in your system.

If you must acknowledge every write after it's been committed, set up a
single committer thread that does this when there are any work tasks in the
queue. Then add to that queue from your writer threads...


On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya <
srividhya.umashanker@hp.com> wrote:

> Lucene Experts -
>
> Recently we upgraded to Lucene 4. We want to make use of concurrent
> flushing feature Of Lucene.
>
> Indexing for us includes certain db operations and writing to lucene ended
> by commit.  There may be multiple concurrent calls to Indexer to publish
> single/multiple records.
>
> So far, with older version of lucene, we had our indexing synchronized (1
> thread indexing).
> Which means waiting time is more, based on concurrency and execution time.
>
> We are moving away from the Synchronized indexing. Which is actually to
> cut down the waiting period.  Trying to find out if we have to limit the
> number of threads that adds document and commits.
>
> Below are the tests - to publish just 1000 records with 3 text fields.
>
> Java 7 , JVM config :  -XX:MaxPermSize=384M
>  -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m -XX:MaxNewSize=100m
> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier
> -Djsse.enableSNIExtension=false
>
> IndexConfiguration being default : We also tried with changes in
> maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
>
>
>
> Min time  in ms
>
> Max time ms
>
> Avg time ms
>
> 1 thread -commit
>
> 65
>
> 267
>
> 85
>
> 1 thread -updateDocument
>
> 0
>
> 40
>
> 1
>
>
>
>
>
>
>
>
>
> 6 thread-commit
>
> 83
>
> 1449
>
> 552.42
>
> 6 thread- updateDocument
>
> 0
>
> 175
>
> 1.5
>
>
>
>
>
>
>
>
>
> 10 thread -Commit
>
> 154
>
> 2429
>
> 874
>
> 10 thread- updateDocument
>
> 0
>
> 243
>
> 1.9
>
>
>
>
>
>
>
>
>
> 20 thread -commit
>
> 76
>
> 4351
>
> 1622
>
> 20 thread - updateDocument
>
> 0
>
> 326
>
> 2.1
>
>
>
>
>
>
>
>
>
>
> More the threads trying to write to lucene, the updateDocument and
> commit() are becoming bottlenecks.  In the above table, 10 and 20 threads
>  have an average of 1.5 sec for 1000 commits.
>
> Is there some configuration of suggestions to tune the performance of the
> 2 methods, so that our service performs better, with more concurrency?
>
> -vidhya
>
>
>