You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Vikram Srinivasan <vi...@zettata.com> on 2013/11/05 06:45:16 UTC

Slow Indexing speed for csv files, multi-threaded indexing

Hello,

  I know this has been discussed extensively in past posts. I have tried a
bunch of suggestions and I still have a few questions.

 I am using solr4.4 from tomcat 7. I am using openjdk1.7 and I am using 1
solr core
 I am trying to index a bunch of csv files (total size 13GB). Each csv file
contains a long list of tuples - ( word1 word2, frequency) as shown below.
(bigram frequencies)

E.g: blue sky, 2500
       green grass, 300

My schema.xml is as  simple as can be: I am trying to index these two
fields of type string and long and do not use any tokenizer or analyzer
factories as shown below.


 <fields>
<field name="_version_" type="long" indexed="true" stored="true"
multiValued="false" omitNorms="true" />
                <field name="word" type="string" indexed="true"
stored="true" multiValued="false" omitNorms="true" />

      <field name="frequency" type="long" indexed="true" stored="true"
                        multiValued="false" omitNorms="true" />


        </fields>

In my solrconfig.xml:

My rambuffer size is 100MB, merge factor is 10, maxIndexingThreads is 8.

I am using solrj and concurrentupdatesolrserver (CUSS) to index. I have set
the queue size to 10000 and number of threads to 10 and javabin format.

I run my solrj instance by providing the path to the directory where the
csv files are stored.

I start one instance of CUSS and have multiple threads reading from the
various files simultaneously and writing into the CUSS threads
simutaneously. I do a commit only after all the records have been indexed.
Also my autocommit values for number of documents and commit time are set
to very large numbers.

I have tried indexing a test set of csv files which contains 1.44M records
(total size 21MB).  All my tests have been on different types of Amazon ec2
instances - e.g. m1.xlarge (4vCPU, 15GB RAM) and m3.2xlarge(8vCPU, 30GB
RAM).

I have set my jvm heap size large enough and tuned gc parameters as seen on
various forums.

Observations:

1. My indexing speed for 1.44M records (or row in CSV file) is 240s on the
m1.xlarge instance and 160s on the m3.2xlarge instance.
2. The indexing speed is independent of whether I have one large file with
1.44M rows or 2 files with 720K rows each.
3. My indexing speed is independent of the number of threads and queue size
I specify for CUSS. I have kept set these parameters as low as 1 for both
queue size and number of threads with no difference..
4. My indexing speed is independent of merge factor, rambuffer and number
of indexing threads. I've tried various settings.
5. It appears that I am not really indexing my files in parallel if I use a
single solr core. Is this not possible? What exactly does maxindexthreads
in solrconfig control?
6. My concern is that my indexing speed is way slower than what I've seen
claimed on various forums (e.g., 29GB wikipedia in 13 minutes, 50GB in 39
minutes etc.) even with a single solr core.

What am I doing wrong? How do I speed up my indexing? Any suggestions will
be appreciated.

Thanks,
Vikram

Re: Slow Indexing speed for csv files, multi-threaded indexing

Posted by Erick Erickson <er...@gmail.com>.
Vikram:

An experiment I've found useful: Just comment out the
server.add() bit and run it. That won't index anything, but if
that's also slow then your problem is acquiring the data and
you know where to concentrate your efforts. I've seen this
be the problem with slow indexing more often than not actually.


Here's another thing to try: do it locally. Just spin up
a small Solr instance on your workstation and try your
test. My guess is you'll see vastly improved performance
in which case we're talking network latency here.

Alternatively, you can monitor your CPU utilization on
your ec2 instances and see if you're using it heavily. I
suspect you'll see you're not really exercising Solr, the
bottleneck is the network transmission or some such.

Your point <3> is a bit puzzling. CUSS threads and queue
size is really about network I/O. The idea here is that the
multiple threads are trying to simultaneously send packets
to Solr. Are you batching up documents you're sending
or sending them one at a time? I.e. use the server.add(doclist)
rather then the server.add(doc). What happens if you send, say
1,000 docs at a time?

Best,
Erick



On Tue, Nov 5, 2013 at 12:45 AM, Vikram Srinivasan <
vikram.srinivasan@zettata.com> wrote:

> Hello,
>
>   I know this has been discussed extensively in past posts. I have tried a
> bunch of suggestions and I still have a few questions.
>
>  I am using solr4.4 from tomcat 7. I am using openjdk1.7 and I am using 1
> solr core
>  I am trying to index a bunch of csv files (total size 13GB). Each csv file
> contains a long list of tuples - ( word1 word2, frequency) as shown below.
> (bigram frequencies)
>
> E.g: blue sky, 2500
>        green grass, 300
>
> My schema.xml is as  simple as can be: I am trying to index these two
> fields of type string and long and do not use any tokenizer or analyzer
> factories as shown below.
>
>
>  <fields>
> <field name="_version_" type="long" indexed="true" stored="true"
> multiValued="false" omitNorms="true" />
>                 <field name="word" type="string" indexed="true"
> stored="true" multiValued="false" omitNorms="true" />
>
>       <field name="frequency" type="long" indexed="true" stored="true"
>                         multiValued="false" omitNorms="true" />
>
>
>         </fields>
>
> In my solrconfig.xml:
>
> My rambuffer size is 100MB, merge factor is 10, maxIndexingThreads is 8.
>
> I am using solrj and concurrentupdatesolrserver (CUSS) to index. I have set
> the queue size to 10000 and number of threads to 10 and javabin format.
>
> I run my solrj instance by providing the path to the directory where the
> csv files are stored.
>
> I start one instance of CUSS and have multiple threads reading from the
> various files simultaneously and writing into the CUSS threads
> simutaneously. I do a commit only after all the records have been indexed.
> Also my autocommit values for number of documents and commit time are set
> to very large numbers.
>
> I have tried indexing a test set of csv files which contains 1.44M records
> (total size 21MB).  All my tests have been on different types of Amazon ec2
> instances - e.g. m1.xlarge (4vCPU, 15GB RAM) and m3.2xlarge(8vCPU, 30GB
> RAM).
>
> I have set my jvm heap size large enough and tuned gc parameters as seen on
> various forums.
>
> Observations:
>
> 1. My indexing speed for 1.44M records (or row in CSV file) is 240s on the
> m1.xlarge instance and 160s on the m3.2xlarge instance.
> 2. The indexing speed is independent of whether I have one large file with
> 1.44M rows or 2 files with 720K rows each.
> 3. My indexing speed is independent of the number of threads and queue size
> I specify for CUSS. I have kept set these parameters as low as 1 for both
> queue size and number of threads with no difference..
> 4. My indexing speed is independent of merge factor, rambuffer and number
> of indexing threads. I've tried various settings.
> 5. It appears that I am not really indexing my files in parallel if I use a
> single solr core. Is this not possible? What exactly does maxindexthreads
> in solrconfig control?
> 6. My concern is that my indexing speed is way slower than what I've seen
> claimed on various forums (e.g., 29GB wikipedia in 13 minutes, 50GB in 39
> minutes etc.) even with a single solr core.
>
> What am I doing wrong? How do I speed up my indexing? Any suggestions will
> be appreciated.
>
> Thanks,
> Vikram
>