You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Costi Muraru <co...@gmail.com> on 2014/05/01 22:47:37 UTC

Fastest way to import big amount of documents in SolrCloud

Hi guys,

What would you say it's the fastest way to import data in SolrCloud?
Our use case: each day do a single import of a big number of documents.

Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
import feature in SOLR? I came upon this promising link:
http://wiki.apache.org/solr/UpdateCSV
Any idea on how UpdateCSV is performance-wise compared with
SolrJ/DataImportHandler?

If SolrJ, should we split the data in chunks and start multiple clients at
once? In this way we could perhaps take advantage of the multitude number
of servers in the SolrCloud configuration?

Either way, after the import is finished, should we do an optimize or a
commit or none (
http://wiki.solarium-project.org/index.php/V1:Optimize_command)?

Any tips and tricks to perform this process the right way are gladly
appreciated.

Thanks,
Costi

Re: Fastest way to import big amount of documents in SolrCloud

Posted by Costi Muraru <co...@gmail.com>.
Thanks for the reply, Anshum. Please see my answers to your questions below.

* Why do you want to do a full index everyday?
    Not sure I understand what you mean by full index. Every day we want to
import additional documents to the existing ones. Of course, we want to
remove older ones as well, so the total amount remains roughly the same.
* How much of data are we talking about?
    The number of new documents is around 500k each day.
* What's your SolrCloud setup like?
    We're currently using Solr 3.6 with 16 shards and planning to switch to
SolrCloud, hence the inquiry.
* Do you already have some benchmarks which you're not happy with?
    Not yet. Planning to do some tests quite soon. I was looking for some
guidance before jumping in.

"Also, it helps to set the commit intervals reasonable."
What do you mean by *reasonable*? Also, do you recommend using autoCommit?
We are currently doing an optimize after each import (in Solr 3), in order
to speed up future queries. This is proving to take very long though
(several hours). Doing a commit instead of optimize is usually bringing the
master and slave nodes down. We reverted to calling optimize on every
ingest.



On Thu, May 1, 2014 at 11:57 PM, Anshum Gupta <an...@anshumgupta.net>wrote:

> Hi Costi,
>
> I'd recommend SolrJ, parallelize the inserts. Also, it helps to set the
> commit intervals reasonable.
>
> Just to get a better perspective
> * Why do you want to do a full index everyday?
> * How much of data are we talking about?
> * What's your SolrCloud setup like?
> * Do you already have some benchmarks which you're not happy with?
>
>
>
> On Thu, May 1, 2014 at 1:47 PM, Costi Muraru <co...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > What would you say it's the fastest way to import data in SolrCloud?
> > Our use case: each day do a single import of a big number of documents.
> >
> > Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
> > import feature in SOLR? I came upon this promising link:
> > http://wiki.apache.org/solr/UpdateCSV
> > Any idea on how UpdateCSV is performance-wise compared with
> > SolrJ/DataImportHandler?
> >
> > If SolrJ, should we split the data in chunks and start multiple clients
> at
> > once? In this way we could perhaps take advantage of the multitude number
> > of servers in the SolrCloud configuration?
> >
> > Either way, after the import is finished, should we do an optimize or a
> > commit or none (
> > http://wiki.solarium-project.org/index.php/V1:Optimize_command)?
> >
> > Any tips and tricks to perform this process the right way are gladly
> > appreciated.
> >
> > Thanks,
> > Costi
> >
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>

Re: Fastest way to import big amount of documents in SolrCloud

Posted by Anshum Gupta <an...@anshumgupta.net>.
Hi Costi,

I'd recommend SolrJ, parallelize the inserts. Also, it helps to set the
commit intervals reasonable.

Just to get a better perspective
* Why do you want to do a full index everyday?
* How much of data are we talking about?
* What's your SolrCloud setup like?
* Do you already have some benchmarks which you're not happy with?



On Thu, May 1, 2014 at 1:47 PM, Costi Muraru <co...@gmail.com> wrote:

> Hi guys,
>
> What would you say it's the fastest way to import data in SolrCloud?
> Our use case: each day do a single import of a big number of documents.
>
> Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
> import feature in SOLR? I came upon this promising link:
> http://wiki.apache.org/solr/UpdateCSV
> Any idea on how UpdateCSV is performance-wise compared with
> SolrJ/DataImportHandler?
>
> If SolrJ, should we split the data in chunks and start multiple clients at
> once? In this way we could perhaps take advantage of the multitude number
> of servers in the SolrCloud configuration?
>
> Either way, after the import is finished, should we do an optimize or a
> commit or none (
> http://wiki.solarium-project.org/index.php/V1:Optimize_command)?
>
> Any tips and tricks to perform this process the right way are gladly
> appreciated.
>
> Thanks,
> Costi
>



-- 

Anshum Gupta
http://www.anshumgupta.net

Re: Fastest way to import big amount of documents in SolrCloud

Posted by Erick Erickson <er...@gmail.com>.
re: optimize after every import....

This is not recommended in 4.x unless and until you have evidence that
it really does help, reviews are very mixed, and it's been renamed
force merge  in 4.x just so people don't think "Of course I want to do
this, who wouldn't?".

bq: Doing a commit instead of optimize is usually bringing the master
and slave nodes down
This isn't expected unless you're committing far too frequently. I'd
dis-recommend doing any commits except, possibly, a single commit
after all my clients had finished indexing. But even that isn't
necessary.

In batch modes in SolrCloud, reasonable setups are
autocommit: 15 seconds WITH openSearcher="false"
autosoftcommit: the interval it takes you to run all your indexing.

Seems odd, but here's the backtround:
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best,
Erick

On Thu, May 1, 2014 at 11:12 PM, Alexander Kanarsky
<ka...@gmail.com> wrote:
> If you build your index in Hadoop, read this (it is about the Cloudera
> Search but in my understanding also should work with Solr Hadoop contrib
> since 4.7)
> http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html
>
>
> On Thu, May 1, 2014 at 1:47 PM, Costi Muraru <co...@gmail.com> wrote:
>
>> Hi guys,
>>
>> What would you say it's the fastest way to import data in SolrCloud?
>> Our use case: each day do a single import of a big number of documents.
>>
>> Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
>> import feature in SOLR? I came upon this promising link:
>> http://wiki.apache.org/solr/UpdateCSV
>> Any idea on how UpdateCSV is performance-wise compared with
>> SolrJ/DataImportHandler?
>>
>> If SolrJ, should we split the data in chunks and start multiple clients at
>> once? In this way we could perhaps take advantage of the multitude number
>> of servers in the SolrCloud configuration?
>>
>> Either way, after the import is finished, should we do an optimize or a
>> commit or none (
>> http://wiki.solarium-project.org/index.php/V1:Optimize_command)?
>>
>> Any tips and tricks to perform this process the right way are gladly
>> appreciated.
>>
>> Thanks,
>> Costi
>>

Re: Fastest way to import big amount of documents in SolrCloud

Posted by Alexander Kanarsky <ka...@gmail.com>.
If you build your index in Hadoop, read this (it is about the Cloudera
Search but in my understanding also should work with Solr Hadoop contrib
since 4.7)
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html


On Thu, May 1, 2014 at 1:47 PM, Costi Muraru <co...@gmail.com> wrote:

> Hi guys,
>
> What would you say it's the fastest way to import data in SolrCloud?
> Our use case: each day do a single import of a big number of documents.
>
> Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
> import feature in SOLR? I came upon this promising link:
> http://wiki.apache.org/solr/UpdateCSV
> Any idea on how UpdateCSV is performance-wise compared with
> SolrJ/DataImportHandler?
>
> If SolrJ, should we split the data in chunks and start multiple clients at
> once? In this way we could perhaps take advantage of the multitude number
> of servers in the SolrCloud configuration?
>
> Either way, after the import is finished, should we do an optimize or a
> commit or none (
> http://wiki.solarium-project.org/index.php/V1:Optimize_command)?
>
> Any tips and tricks to perform this process the right way are gladly
> appreciated.
>
> Thanks,
> Costi
>