You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Pranav Prakash <pr...@gmail.com> on 2011/10/19 15:58:33 UTC

Painfully slow indexing

Hi guys,

I have set up a Solr instance and upon attempting to index document, the
whole process is painfully slow. I will try to put as much info as I can in
this mail. Pl. feel free to ask me anything else that might be required.

I am sending documents in batches not exceeding 2,000. The size of each of
them depends but usually is around 10-15MiB. My indexing script tells me
that Solr took T seconds to add N documents of size S. For the same data,
the Solr Log add QTime is QT. Some of the sample data are:

   N                     S                T               QT
-------------------------------------------------------------------------
 390 docs  |   3,478,804 Bytes   | 14.5s    |  2297
 852 docs  |   6,039,535 Bytes   | 25.3s    |  4237
1345 docs | 11,147,512 Bytes   |  47s      |  8543
1147 docs |   9,457,717 Bytes   |  44s      |  2297
1096 docs | 13,058,204 Bytes   |  54.3s   |   8782

The time T includes the time of converting an array of Hash objects into
XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there
is a huge difference between both the time T and QT. After a lot of efforts,
I have no clue why these times do not match.

The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
-XX:+UseParNewGC

I believe my Indexing is getting slow. Relevant portion from my schema file
are as follows. On a related note, every document has one dynamic field.
Based on this rate, it takes me ~30hrs to do a full index of my database.
I would really appreciate kindness of community in order to get this
indexing faster.

<indexDefaults>

<useCompoundFile>false</useCompoundFile>

<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">

<int name="maxMergeCount">10</int>

<int name="maxThreadCount">10</int>

 </mergeScheduler>

<ramBufferSizeMB>2048</ramBufferSizeMB>

<maxMergeDocs>2147483647</maxMergeDocs>

<maxFieldLength>3000000</maxFieldLength>

<writeLockTimeout>1000</writeLockTimeout>

<maxBufferedDocs>50000</maxBufferedDocs>

<termIndexInterval>256</termIndexInterval>

<mergeFactor>10</mergeFactor>

<useCompoundFile>false</useCompoundFile>

<!-- <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">

 <int name="maxMergeAtOnceExplicit">19</int>

<int name="segmentsPerTier">9</int>

</mergePolicy> -->

</indexDefaults>

<mainIndex>

<unlockOnStartup>true</unlockOnStartup>

<reopenReaders>true</reopenReaders>

<deletionPolicy class="solr.SolrDeletionPolicy">

 <str name="maxCommitsToKeep">1</str>

<str name="maxOptimizedCommitsToKeep">0</str>

</deletionPolicy>

<infoStream file="INFOSTREAM.txt">false</infoStream>

</mainIndex>

<updateHandler class="solr.DirectUpdateHandler2" >

<autoCommit>

 <maxDocs>100000</maxDocs>

</autoCommit>

</updateHandler>


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Painfully slow indexing

Posted by Pranav Prakash <pr...@gmail.com>.

Hey guys,

Your responses are welcome, but I still haven't gained a lot of improvements

*Are you posting through HTTP/SOLRJ?*
I am using RSolr gem, which internally uses Ruby HTTP lib to POST document
to Solr

*Your script time 'T' includes time between sending POST request -to-
the response fetched after successful response ....right??*
Correct. It also includes the time taken to convert all those documents from
a Ruby Hash to XML.


 *generate the ready-for-indexing XML documents on a file system*
Alain, I have somewhere 6m documents for Indexing. You mean to say that I
should convert all of it into one XML file and then index?

*are you calling commit after your batches or do an optimize by any chance?*
I am not optimizing, but I am performing an autocommit every 100000 docs.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Fri, Oct 21, 2011 at 16:32, Simon Willnauer <
simon.willnauer@googlemail.com> wrote:

> On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash <pr...@gmail.com> wrote:
> > Hi guys,
> >
> > I have set up a Solr instance and upon attempting to index document, the
> > whole process is painfully slow. I will try to put as much info as I can
> in
> > this mail. Pl. feel free to ask me anything else that might be required.
> >
> > I am sending documents in batches not exceeding 2,000. The size of each
> of
> > them depends but usually is around 10-15MiB. My indexing script tells me
> > that Solr took T seconds to add N documents of size S. For the same data,
> > the Solr Log add QTime is QT. Some of the sample data are:
> >
> >   N                     S                T               QT
> > -------------------------------------------------------------------------
> >  390 docs  |   3,478,804 Bytes   | 14.5s    |  2297
> >  852 docs  |   6,039,535 Bytes   | 25.3s    |  4237
> > 1345 docs | 11,147,512 Bytes   |  47s      |  8543
> > 1147 docs |   9,457,717 Bytes   |  44s      |  2297
> > 1096 docs | 13,058,204 Bytes   |  54.3s   |   8782
> >
> > The time T includes the time of converting an array of Hash objects into
> > XML, POSTing it to Solr and response acknowledged from Solr. Clearly,
> there
> > is a huge difference between both the time T and QT. After a lot of
> efforts,
> > I have no clue why these times do not match.
> >
> > The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
> > -XX:+UseParNewGC
> >
> > I believe my Indexing is getting slow. Relevant portion from my schema
> file
> > are as follows. On a related note, every document has one dynamic field.
> > Based on this rate, it takes me ~30hrs to do a full index of my database.
> > I would really appreciate kindness of community in order to get this
> > indexing faster.
> >
> > <indexDefaults>
> >
> > <useCompoundFile>false</useCompoundFile>
> >
> > <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
> >
> > <int name="maxMergeCount">10</int>
> >
> > <int name="maxThreadCount">10</int>
> >
> >  </mergeScheduler>
> >
> > <ramBufferSizeMB>2048</ramBufferSizeMB>
> >
> > <maxMergeDocs>2147483647</maxMergeDocs>
> >
> > <maxFieldLength>3000000</maxFieldLength>
> >
> > <writeLockTimeout>1000</writeLockTimeout>
> >
> > <maxBufferedDocs>50000</maxBufferedDocs>
> >
> > <termIndexInterval>256</termIndexInterval>
> >
> > <mergeFactor>10</mergeFactor>
> >
> > <useCompoundFile>false</useCompoundFile>
> >
> > <!-- <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >
> >  <int name="maxMergeAtOnceExplicit">19</int>
> >
> > <int name="segmentsPerTier">9</int>
> >
> > </mergePolicy> -->
> >
> > </indexDefaults>
> >
> > <mainIndex>
> >
> > <unlockOnStartup>true</unlockOnStartup>
> >
> > <reopenReaders>true</reopenReaders>
> >
> > <deletionPolicy class="solr.SolrDeletionPolicy">
> >
> >  <str name="maxCommitsToKeep">1</str>
> >
> > <str name="maxOptimizedCommitsToKeep">0</str>
> >
> > </deletionPolicy>
> >
> > <infoStream file="INFOSTREAM.txt">false</infoStream>
> >
> > </mainIndex>
> >
> > <updateHandler class="solr.DirectUpdateHandler2" >
> >
> > <autoCommit>
> >
> >  <maxDocs>100000</maxDocs>
> >
> > </autoCommit>
> >
> > </updateHandler>
> >
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
> > Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com> |
> > Google <http://www.google.com/profiles/pranny>
> >
>
> hey,
>
> are you calling commit after your batches or do an optimize by any chance?
>
> I would suggest you to stream your documents to solr and try to commit
> only if you really need to. Set your RAM Buffer to something between
> 256 and 320 MB and remove the maxBufferedDocs setting completely. You
> can also experiment with your merge settings a little and 10 merging
> threads seem to be a lot. I know you have lots of CPU but IO will be
> the bottleneck here.
>
> simon
>

Re: Painfully slow indexing

Posted by Simon Willnauer <si...@googlemail.com>.

On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash <pr...@gmail.com> wrote:
> Hi guys,
>
> I have set up a Solr instance and upon attempting to index document, the
> whole process is painfully slow. I will try to put as much info as I can in
> this mail. Pl. feel free to ask me anything else that might be required.
>
> I am sending documents in batches not exceeding 2,000. The size of each of
> them depends but usually is around 10-15MiB. My indexing script tells me
> that Solr took T seconds to add N documents of size S. For the same data,
> the Solr Log add QTime is QT. Some of the sample data are:
>
>   N                     S                T               QT
> -------------------------------------------------------------------------
>  390 docs  |   3,478,804 Bytes   | 14.5s    |  2297
>  852 docs  |   6,039,535 Bytes   | 25.3s    |  4237
> 1345 docs | 11,147,512 Bytes   |  47s      |  8543
> 1147 docs |   9,457,717 Bytes   |  44s      |  2297
> 1096 docs | 13,058,204 Bytes   |  54.3s   |   8782
>
> The time T includes the time of converting an array of Hash objects into
> XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there
> is a huge difference between both the time T and QT. After a lot of efforts,
> I have no clue why these times do not match.
>
> The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
> -XX:+UseParNewGC
>
> I believe my Indexing is getting slow. Relevant portion from my schema file
> are as follows. On a related note, every document has one dynamic field.
> Based on this rate, it takes me ~30hrs to do a full index of my database.
> I would really appreciate kindness of community in order to get this
> indexing faster.
>
> <indexDefaults>
>
> <useCompoundFile>false</useCompoundFile>
>
> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
>
> <int name="maxMergeCount">10</int>
>
> <int name="maxThreadCount">10</int>
>
>  </mergeScheduler>
>
> <ramBufferSizeMB>2048</ramBufferSizeMB>
>
> <maxMergeDocs>2147483647</maxMergeDocs>
>
> <maxFieldLength>3000000</maxFieldLength>
>
> <writeLockTimeout>1000</writeLockTimeout>
>
> <maxBufferedDocs>50000</maxBufferedDocs>
>
> <termIndexInterval>256</termIndexInterval>
>
> <mergeFactor>10</mergeFactor>
>
> <useCompoundFile>false</useCompoundFile>
>
> <!-- <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>
>  <int name="maxMergeAtOnceExplicit">19</int>
>
> <int name="segmentsPerTier">9</int>
>
> </mergePolicy> -->
>
> </indexDefaults>
>
> <mainIndex>
>
> <unlockOnStartup>true</unlockOnStartup>
>
> <reopenReaders>true</reopenReaders>
>
> <deletionPolicy class="solr.SolrDeletionPolicy">
>
>  <str name="maxCommitsToKeep">1</str>
>
> <str name="maxOptimizedCommitsToKeep">0</str>
>
> </deletionPolicy>
>
> <infoStream file="INFOSTREAM.txt">false</infoStream>
>
> </mainIndex>
>
> <updateHandler class="solr.DirectUpdateHandler2" >
>
> <autoCommit>
>
>  <maxDocs>100000</maxDocs>
>
> </autoCommit>
>
> </updateHandler>
>
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
>

hey,

are you calling commit after your batches or do an optimize by any chance?

I would suggest you to stream your documents to solr and try to commit
only if you really need to. Set your RAM Buffer to something between
256 and 320 MB and remove the maxBufferedDocs setting completely. You
can also experiment with your merge settings a little and 10 merging
threads seem to be a lot. I know you have lots of CPU but IO will be
the bottleneck here.

simon

Re: Painfully slow indexing

Posted by Alain Rogister <al...@gmail.com>.

As an alternative, I can suggest this one which worked great for me:

- generate the ready-for-indexing XML documents on a file system
- use curl to feed them into Solr

I am not dealing with huge volumes, but was surprised at how *fast* Solr was
indexing my documents using this simple approach. Also, the workflow is easy
to manage. And the XML contents can easily be provisioned to multiple
systems e.g. for setting up test environments.

Regards,

Alain

On Fri, Oct 21, 2011 at 9:46 AM, pravesh <su...@yahoo.com> wrote:

> Are you posting through HTTP/SOLRJ?
>
> Your script time 'T' includes time between sending POST request -to- the
> response fetched after successful response ....right??
>
> Try sending in small batches like 10-20.  BTW how many documents are u
> indexing???
>
> Regds
> Pravesh
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Painfully-slow-indexing-tp3434399p3440175.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Painfully slow indexing

Posted by pravesh <su...@yahoo.com>.

Are you posting through HTTP/SOLRJ?

Your script time 'T' includes time between sending POST request -to- the
response fetched after successful response ....right??

Try sending in small batches like 10-20.  BTW how many documents are u
indexing???

Regds
Pravesh

--
View this message in context: http://lucene.472066.n3.nabble.com/Painfully-slow-indexing-tp3434399p3440175.html
Sent from the Solr - User mailing list archive at Nabble.com.