You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by LOPEZ-CORTES Mariano-ext <ma...@pole-emploi.fr> on 2018/02/15 07:34:06 UTC

Reading data from Oracle

Hello

We have to delete our Solr collection and feed it periodically from an Oracle database (up to 40M rows).

We've done the following test: From a java program, we read chunks of data from Oracle and inject to Solr (via Solrj).

The problem : It is really really slow (1'5 nights).

Is there one faster method to do that ?

Thanks in advance.

Re: Reading data from Oracle

Posted by Michal Hlavac <mi...@hlavki.eu>.

Did you try to use ConcurrentUpdateSolrClient instead of HttpSolrClient?

m.

On štvrtok, 15. februára 2018 8:34:06 CET LOPEZ-CORTES Mariano-ext wrote:
> Hello
> 
> We have to delete our Solr collection and feed it periodically from an Oracle database (up to 40M rows).
> 
> We've done the following test: From a java program, we read chunks of data from Oracle and inject to Solr (via Solrj).
> 
> The problem : It is really really slow (1'5 nights).
> 
> Is there one faster method to do that ?
> 
> Thanks in advance.

Re: Reading data from Oracle

Posted by Shawn Heisey <ap...@elyograg.org>.

On 2/15/2018 12:34 AM, LOPEZ-CORTES Mariano-ext wrote:
> We've done the following test: From a java program, we read chunks of data from Oracle and inject to Solr (via Solrj).
>
> The problem : It is really really slow (1'5 nights).
>
> Is there one faster method to do that ?

Are you indexing with a single thread?  The way to speed up indexing is
to index with many threads or processes simultaneously.

Using ConcurrentUpdateSolrClient as Michal mentioned is one way to get
multi-threading, but if you go this route, your program will never know
about any indexing errors.  Errors will be logged, but your program
won't know about them.  This client is good for initial bulk indexing,
but when things become more automated, you're probably going to want
automatic detection and notification when there's a problem.

If you care about error handling, then you're going to have to handle
multiple threads or processes in your own program.  If you don't care
about error handling, then go ahead and use ConcurrentUpdateSolrClient. 
But if your database is the bottleneck, that will not make things faster.

For something later in the thread:

One common problem with dataimport and large tables is that almost every
JDBC driver will read the entire result of the SELECT statement into
memory before providing that information to the program that did the
SELECT.  For large tables, this information can be larger than the Java
heap, and that will cause the program to encounter an OutOfMemoryError. 
To solve this, you will need to ask Oracle how to disable this behavior
with their JDBC driver.

For MySQL, the solution is to set the batchSize parameter in DIH to -1,
which results in DIH setting a JDBC fetch size of Integer.MIN_VALUE ...
which tells the MySQL driver to stream results instead of putting them
all into memory.  For Microsoft SQL server, you need a URL parameter for
the JDBC url, or simply to upgrade the JDBC driver to a newer version
that doesn't do this by default.  I have not been able to figure out how
to get the Oracle driver to do it.  Chances are that it will be a JDBC
url parameter.

https://wiki.apache.org/solr/DataImportHandlerFaq

Thanks,
Shawn