You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Julian Perry <ju...@limitless.co.uk> on 2015/03/27 00:19:39 UTC

Build index from Oracle, adding fields

Hi

I have looked and cannot see any clear answers to this on
the Interwebs.


I have an index with, say, 10 fields.

I load that index directly from Oracle - data-config.xml using
JDBC.  I can load 10 million rows very quickly.  This direct
way of loading from Oracle straight into SOLR is fantastic -
really efficient and saves writing loads of import/export code
(e.g. via a CSV file).

Of those 10 fields - two of them (set to multiValued) come from
a separate table and there are anything from 1 to 10 rows per
row from the main table.

I can use a nested entity to extract the child rows for each of
the 10m rows in the main table - but then SOLR generates 10m
separate SQL calls - and the load time goes from a few minutes
to several days.

On smaller tables - just a few thousand rows - I can use a
second nested entity with a JDBC call - but not for very large
tables.

Could I load the data in two steps:
1)  load the main 10m rows
2)  load into the existing index by adding the data from a
     second SQL call into fields for each existing row (i.e.
     an UPDATE instead of an INSERT).

I don't know what syntax/option might achieve that.  There
is incremental loading - but I think that replaces whole rows
rather then updating individual fields.  Or maybe it does
do both?

Any other techniques that would be fast/efficient?

Help!

-- 
Cheers
Jules.

Re: Build index from Oracle, adding fields

Posted by Julian Perry <ju...@limitless.co.uk>.
On 27/03/2015 12:42, Shawn Heisey wrote:
> If that's not practical, then the only real option you have is to drop
> back to one entity, and build a single SELECT statement (using JOIN and
> some form of CONCAT) that will gather all the information from all the
> tables at the same time, and combine multiple values together into one
> SQL result field with some kind of delimiter.  Then you can use the
> RegexTransformer's "splitBy" functionality to turn the concatenated data
> back into multiple values for your multi-valued field.  Database servers
> tend to be REALLY good at JOIN operations, so the database would be
> doing the heavy lifting.

I did try that in fact (and do it with one of my other indexes).

However, with this index the sub-select can return 200 rows of
200 characters - and that blows up in Oracle as the field is
over 4000 characters long (and the work-around for that is to
use clob's - but that has its own performance problems).

Currently I am doing this by exporting a CSV file and
processing it with a C program - and then reading the CSV with
SOLR :(

-- 
Cheers
Jules.


Re: Build index from Oracle, adding fields

Posted by Shawn Heisey <ap...@elyograg.org>.
On 3/26/2015 5:19 PM, Julian Perry wrote:
> I have an index with, say, 10 fields.
>
> I load that index directly from Oracle - data-config.xml using
> JDBC.  I can load 10 million rows very quickly.  This direct
> way of loading from Oracle straight into SOLR is fantastic -
> really efficient and saves writing loads of import/export code
> (e.g. via a CSV file).
>
> Of those 10 fields - two of them (set to multiValued) come from
> a separate table and there are anything from 1 to 10 rows per
> row from the main table.
>
> I can use a nested entity to extract the child rows for each of
> the 10m rows in the main table - but then SOLR generates 10m
> separate SQL calls - and the load time goes from a few minutes
> to several days.
>
> On smaller tables - just a few thousand rows - I can use a
> second nested entity with a JDBC call - but not for very large
> tables.
>
> Could I load the data in two steps:
> 1)  load the main 10m rows
> 2)  load into the existing index by adding the data from a
>     second SQL call into fields for each existing row (i.e.
>     an UPDATE instead of an INSERT).
>
> I don't know what syntax/option might achieve that.  There
> is incremental loading - but I think that replaces whole rows
> rather then updating individual fields.  Or maybe it does
> do both?

If those child tables do not have a large number of entries, you can
configure caching on the inner entities so that the information doesn't
need to actually be requested from the database server.  If there are a
large number of entries, then that may not be possible due to memory
constraints.

https://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

If that's not practical, then the only real option you have is to drop
back to one entity, and build a single SELECT statement (using JOIN and
some form of CONCAT) that will gather all the information from all the
tables at the same time, and combine multiple values together into one
SQL result field with some kind of delimiter.  Then you can use the
RegexTransformer's "splitBy" functionality to turn the concatenated data
back into multiple values for your multi-valued field.  Database servers
tend to be REALLY good at JOIN operations, so the database would be
doing the heavy lifting.

https://wiki.apache.org/solr/DataImportHandler#RegexTransformer

Solr does have an equivalent concept to SQL's UPDATE, but there are
enough caveats to using it that it may not be a good option:

https://wiki.apache.org/solr/Atomic_Updates

Thanks,
Shawn