You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Shankar R <ia...@gmail.com> on 2022/09/29 08:06:05 UTC

Fastest way to index data to solr

Hi,
 We are having nearly 70-80 millions of data which need to be indexed in
solr 8.6.1.
 We want to choose between Java BInary format or direct JSON format.
 Our source data is DBMS which is a structured data.

Regards
Ravi

Re: Fastest way to index data to solr

Posted by Shawn Heisey <ap...@elyograg.org.INVALID>.

On 9/29/22 22:28, Gus Heck wrote:
>> * Do NOT commit during the bulk load, wait until the end
> Unless something changed this is slightly risky. It can lead to very large
> transaction logs and very long playback of the tx log on startup.

It is always good practice to have autoCommit configured with 
openSearcher set to false and a relatively low maxTime value.  I believe 
the configs that Solr ships with set this to 15 seconds (actual value 
being 15000 milliseconds), but I prefer making it 60 seconds just so 
there is less overall stress on the system.  That setting will eliminate 
the problem with huge transaction logs.  I believe this is discussed on 
that Lucidworks article that you linked.

A commit that opens a new searcher should be done at the end of the 
major indexing job.  I would do this as a soft commit, but there's 
nothing wrong with making it a hard commit that has openSearcher set to 
true.  On large indexing jobs there is likely to be little difference in 
performance between the two types of commit.

Thanks,
Shawn

Re: Fastest way to index data to solr

Posted by Gus Heck <gu...@gmail.com>.

>
> * Do NOT commit during the bulk load, wait until the end
>

Unless something changed this is slightly risky. It can lead to very large
transaction logs and very long playback of the tx log on startup. If Solr
goes down during indexing to something like an OOM, it could take a very
long time for it to restart, likely leading to people restarting it because
they think it's stuck... and then it has to start at the beginning
again...  (Ref:
https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/)
... Infrequent commits might be better than none (but definitely you do not
want  not frequent commits, certainly not after every batch, or even worse
after every doc).

-Gus

Re: Fastest way to index data to solr

Posted by Dave <ha...@gmail.com>.

Another way to handle this is have your indexing code fork out to as many cores as the solr indexing server has. It’s way less work to force the code to run itself that many times in parallel, and as long as your sql queries and said tables are properly indexed the database shouldn’t be a bottle neck, just need to make sure the indexing server has the resources needed, since obviously you never index you a query server.  It’s just a copy and tuned different than the indexer for fast reads, not writes. 

> On Sep 29, 2022, at 2:21 PM, Andy Lester <an...@petdance.com> wrote:
> 
> 
> 
>> On Sep 29, 2022, at 4:17 AM, Jan Høydahl <ja...@cominvent.com> wrote:
>> 
>> * Index with multiple threads on the client, experiment to find a good number based on the number of CPUs on receiving side
> 
> That may also mean having multiple clients. We went from taking about 8 hours to index our entire 42M rows to about 1.5 hours because we ran 10 indexer clients at once. Each indexer takes roughly 1/10th of the data and churns away. We don't have any of the clients do a commit. After the indexers are done, we run one more time through the queue with a commit at the end.
> 
> As Jan says, make sure it's not your database that is the bottleneck, and experiment with how many clients you want to have going at once.
> 
> Andy

Re: Fastest way to index data to solr

Posted by Andy Lester <an...@petdance.com>.

> On Sep 29, 2022, at 4:17 AM, Jan Høydahl <ja...@cominvent.com> wrote:
> 
> * Index with multiple threads on the client, experiment to find a good number based on the number of CPUs on receiving side

That may also mean having multiple clients. We went from taking about 8 hours to index our entire 42M rows to about 1.5 hours because we ran 10 indexer clients at once. Each indexer takes roughly 1/10th of the data and churns away. We don't have any of the clients do a commit. After the indexers are done, we run one more time through the queue with a commit at the end.

As Jan says, make sure it's not your database that is the bottleneck, and experiment with how many clients you want to have going at once.

Andy

Re: Fastest way to index data to solr

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

If you want to index fast you shold
* Make sure you have enough hardware on the solr side to handle the bulk load
* Index with multiple threads on the client, experiment to find a good number based on the number of CPUs on receiving side
* If using JAVA on client, use CloudSolrClient which is smart enough to send docs to correct shard
* Do NOT commit during the bulk load, wait until the end
* Experiemnt with batch size, e.g. try sending 500 docs in each update request, then 1000 etc until you find the best compromise
* Use JavaBin if you can, it should be slightly faster than JSON, but probably not much
* Remember that your RDBMS may be the bottleneck at the end of the day, how many rows can it deliver? You may need to partition the data set with SELECT ... WHERE clauses for each client to read in parallell.

Jan

> 29. sep. 2022 kl. 10:06 skrev Shankar R <ia...@gmail.com>:
> 
> Hi,
> We are having nearly 70-80 millions of data which need to be indexed in
> solr 8.6.1.
> We want to choose between Java BInary format or direct JSON format.
> Our source data is DBMS which is a structured data.
> 
> Regards
> Ravi

Re: Fastest way to index data to solr

Posted by Joel Bernstein <jo...@gmail.com>.

Unless something has changed recently, you will have a memory leak if you
don't atleast soft commit during the load. This is due to the in-memory
tlog data used for real-time get. This in-memory tlog data is released when
a new searcher is opened.

So, if you're having memory issues while bulk loading data without a soft
commit, then set autoSoftCommit to an interval that balances load
performance with memory retention.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Sep 30, 2022 at 12:37 PM Andy Lester <an...@petdance.com> wrote:

> I can’t imagine a case where the speed in parsing the input data won’t be
> dwarfed by the time spent on everything else. You’re talking about an
> in-memory operation that does a ton of I/O.
>
> It’s not going to make a noticeable difference once way or the other.
>
> > I have a followup question. Is JSON parsed faster than XML by Solr
>
>

Re: Fastest way to index data to solr

Posted by Andy Lester <an...@petdance.com>.

I can’t imagine a case where the speed in parsing the input data won’t be dwarfed by the time spent on everything else. You’re talking about an in-memory operation that does a ton of I/O. 

It’s not going to make a noticeable difference once way or the other. 

> I have a followup question. Is JSON parsed faster than XML by Solr

Re: Fastest way to index data to solr

Posted by Dave <ha...@gmail.com>.

I don’t have any tests but I know anything is faster than xml. You may as well stick to text files. Xml is garbage that’s why they made yaml which is the parent of json

> On Sep 30, 2022, at 3:47 AM, Thomas Corthals <th...@klascement.net> wrote:
> 
> Hi Gus,
> 
> I have a followup question. Is JSON parsed faster than XML by Solr if they
> represent the exact same documents?
> 
> Thomas
> 
> Op vr 30 sep. 2022 om 06:58 schreef Gus Heck <gu...@gmail.com>:
> 
>> If you are using a non-java language you can use JSON.
>>

Re: Fastest way to index data to solr

Posted by Thomas Corthals <th...@klascement.net>.

Hi Gus,

I have a followup question. Is JSON parsed faster than XML by Solr if they
represent the exact same documents?

Thomas

Op vr 30 sep. 2022 om 06:58 schreef Gus Heck <gu...@gmail.com>:

> If you are using a non-java language you can use JSON.
>

Re: Fastest way to index data to solr

Posted by Gus Heck <gu...@gmail.com>.

70 million can be a lot or a little. Doc count is not even half the story.
How much storage space do these documents occupy in the database? Is the
text tweet sized, or multi-megabyte sized clobs, or links files on a file
store that need to be fetched and parsed (or OCR'd or converted from
audio/video to transcripts)? IOT type docs with very minimal text can be
indexed much faster than 50 page pdf documents. With very large clusters
and indexing systems distributing work across a spark cluster I've seen as
high as 1.3M docs/sec... and 70M would be trivial for that system (they had
hundreds of billions). But text documents are typically much, much slower
than that, especially if the text must be extracted from dirty formats such
as pdf or word data, or complex custom analysis is involved, or additional
fetching of files or data to merge into the doc is required.

As for the two formats: If you are indexing with java code, choose Java
Binary. If you are using a non-java language you can use JSON. The rare
case of JSON from java would be if your data was already in JSON format...
then it depends on whether solr is limiting you (do work on the indexers
and use java bin so it has less parsing to do) or your indexing machines
are limiting you (use JSON so your indexers don't have to do the
conversion). Like many things in search "It depends" :)

On Thu, Sep 29, 2022 at 4:07 AM Shankar R <ia...@gmail.com> wrote:

> Hi,
>  We are having nearly 70-80 millions of data which need to be indexed in
> solr 8.6.1.
>  We want to choose between Java BInary format or direct JSON format.
>  Our source data is DBMS which is a structured data.
>
> Regards
> Ravi
>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)