You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joseph_Tucker <jo...@homehardware.ca> on 2019/07/05 11:43:20 UTC

Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

What is the best way - performance wise - to index data from multiple
databases?
I'm potentially going to have around 50 different data sources grabbing
unique data
Here's what I've roughly designed:

<entity name="root" dataSource="db1" >
     <entity name="child1" dataSource="db1" >
     <entity name="child2" dataSource="db2" >
     <entity name="child3" dataSource="db3" >
     <entity name="child4" dataSource="db4" >
     <entity name="child5" dataSource="db5" >
     <entity name="child6" dataSource="db6" >
     <entity name="child7" dataSource="db7" >
     ... 
     <entity name="child50" dataSource="db50">
</entity>

I've excluded fields but each entity would have a number of fields within.
The issue I'm seeing here is the full-index is exceedingly slow. Is there a
better way to go about this?




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

Posted by Shawn Heisey <ap...@elyograg.org>.
On 7/11/2019 9:04 AM, Joseph_Tucker wrote:
> Looks like I've managed to get some semblance of this working.
> The indexes are much faster, but the RAM usage by SolrJ is quite high. Is it
> normal to see around 6GB of RAM usage?
> (My test is indexing 250,000 records with the 50 child entities)

Whatever max heap value you tell Java it can have, it will eventually 
use.  That's how Java's memory model works.  You can try lowering the 
max heap, to see whether it actually needs that much memory.  If the 
program really does require all the heap it's allowed, reducing the max 
heap size will cause the program to throw errors and probably behave in 
an unpredictable manner.

Many JDBC drivers will load the entire result set from a database query 
into memory by default, which can explain very high memory use.  You 
would need to research your specific JDBC driver to see if it does this, 
and if so, learn how to have it stream the results instead of storing them.

Thanks,
Shawn

Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

Posted by Joseph_Tucker <jo...@homehardware.ca>.
Thanks For the help.

Looks like I've managed to get some semblance of this working. 
The indexes are much faster, but the RAM usage by SolrJ is quite high. Is it
normal to see around 6GB of RAM usage?
(My test is indexing 250,000 records with the 50 child entities)

In short, I'm running through a loop against a DB 50 times (to mimic 50
entities) and adding the results to a Map, then using that map to loop
through and commit values to Solr.


Jörn Franke wrote
> Ideally you use scripts that can use JVM/Java - in this way you can always
> use the latest SolrJ client library but also other libraries that are
> relevant (eg Tika for unstructured content).
> This does not have to be Java directly but can be based also on Scala or
> JVM script languages, such as Groovy.
> 
> There are also wrappers for Python etc, but those may not always leverage
> the latest version of the library.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

Posted by Jörn Franke <jo...@gmail.com>.
Ideally you use scripts that can use JVM/Java - in this way you can always use the latest SolrJ client library but also other libraries that are relevant (eg Tika for unstructured content).
This does not have to be Java directly but can be based also on Scala or JVM script languages, such as Groovy.

There are also wrappers for Python etc, but those may not always leverage the latest version of the library.

> Am 08.07.2019 um 14:23 schrieb Joseph_Tucker <jo...@homehardware.ca>:
> 
> Thanks again.
> 
> I guess I'll have to start researching how to create such custom indexing
> scripts and determine which language would be best based on the environment
> I'm using (Azure in this case). 
> 
> Appreciate the help greatly 
> 
> 
> 
> 
> Charlie Hull-3 wrote
>>> On 05/07/2019 14:33, Joseph_Tucker wrote:
>>> Thanks for your help / suggestion.
>>> 
>>> I'm not sure I completely follow in this case.
>>> SolrJ looks like a method to allow Java applications to talk to Solr, or
>>> any
>>> other third party application would simply be a communication method
>>> between
>>> Solr and the language of your choosing.
>>> 
>>> I guess what I'm after is, how would using SolrJ improve performance when
>>> indexing?
>> 
>> It's not just about improving performance (although DIH is single 
>> threaded, so you could obtain a marked indexing performance gain using a 
>> client such as SolrJ).  With DIH you will embed a lot of SQL code into 
>> Solr's configuration files, and the more sources you add the more 
>> complicated, hard to debug and unmaintainable it's going to be. You 
>> should thus consider writing a proper indexing script in Java, Python or 
>> whatever language you are most familiar with - this has always been our 
>> approach.
>> 
>> Best
>> 
>> 
>> Charlie
>> 
>>> 
>>> *** I could be wrong in my assumptions as I'm still learning a great deal
>>> about Solr. ***
>>> 
>>> I appreciate your help
>>> 
>>> Regards,
>>> 
>>> Joe
>>> 
>>> 
>>> 
>>> --
>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> 
>> 
>> -- 
>> Charlie Hull
>> Flax - Open Source Enterprise Search
>> 
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.flax.co.uk
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
You may also want to look at the existing systems, such as
https://nifi.apache.org/

Regards,
   Alex.

On Mon, 8 Jul 2019 at 08:23, Joseph_Tucker
<jo...@homehardware.ca> wrote:
>
> Thanks again.
>
> I guess I'll have to start researching how to create such custom indexing
> scripts and determine which language would be best based on the environment
> I'm using (Azure in this case).
>
> Appreciate the help greatly
>
>
>
>
> Charlie Hull-3 wrote
> > On 05/07/2019 14:33, Joseph_Tucker wrote:
> >> Thanks for your help / suggestion.
> >>
> >> I'm not sure I completely follow in this case.
> >> SolrJ looks like a method to allow Java applications to talk to Solr, or
> >> any
> >> other third party application would simply be a communication method
> >> between
> >> Solr and the language of your choosing.
> >>
> >> I guess what I'm after is, how would using SolrJ improve performance when
> >> indexing?
> >
> > It's not just about improving performance (although DIH is single
> > threaded, so you could obtain a marked indexing performance gain using a
> > client such as SolrJ).  With DIH you will embed a lot of SQL code into
> > Solr's configuration files, and the more sources you add the more
> > complicated, hard to debug and unmaintainable it's going to be. You
> > should thus consider writing a proper indexing script in Java, Python or
> > whatever language you are most familiar with - this has always been our
> > approach.
> >
> > Best
> >
> >
> > Charlie
> >
> >>
> >> *** I could be wrong in my assumptions as I'm still learning a great deal
> >> about Solr. ***
> >>
> >> I appreciate your help
> >>
> >> Regards,
> >>
> >> Joe
> >>
> >>
> >>
> >> --
> >> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >
> >
> > --
> > Charlie Hull
> > Flax - Open Source Enterprise Search
> >
> > tel/fax: +44 (0)8700 118334
> > mobile:  +44 (0)7767 825828
> > web: www.flax.co.uk
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

Posted by Joseph_Tucker <jo...@homehardware.ca>.
Thanks again.

I guess I'll have to start researching how to create such custom indexing
scripts and determine which language would be best based on the environment
I'm using (Azure in this case). 

Appreciate the help greatly 




Charlie Hull-3 wrote
> On 05/07/2019 14:33, Joseph_Tucker wrote:
>> Thanks for your help / suggestion.
>>
>> I'm not sure I completely follow in this case.
>> SolrJ looks like a method to allow Java applications to talk to Solr, or
>> any
>> other third party application would simply be a communication method
>> between
>> Solr and the language of your choosing.
>>
>> I guess what I'm after is, how would using SolrJ improve performance when
>> indexing?
> 
> It's not just about improving performance (although DIH is single 
> threaded, so you could obtain a marked indexing performance gain using a 
> client such as SolrJ).  With DIH you will embed a lot of SQL code into 
> Solr's configuration files, and the more sources you add the more 
> complicated, hard to debug and unmaintainable it's going to be. You 
> should thus consider writing a proper indexing script in Java, Python or 
> whatever language you are most familiar with - this has always been our 
> approach.
> 
> Best
> 
> 
> Charlie
> 
>>
>> *** I could be wrong in my assumptions as I'm still learning a great deal
>> about Solr. ***
>>
>> I appreciate your help
>>
>> Regards,
>>
>> Joe
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 
> 
> -- 
> Charlie Hull
> Flax - Open Source Enterprise Search
> 
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

Posted by Charlie Hull <ch...@flax.co.uk>.
On 05/07/2019 14:33, Joseph_Tucker wrote:
> Thanks for your help / suggestion.
>
> I'm not sure I completely follow in this case.
> SolrJ looks like a method to allow Java applications to talk to Solr, or any
> other third party application would simply be a communication method between
> Solr and the language of your choosing.
>
> I guess what I'm after is, how would using SolrJ improve performance when
> indexing?

It's not just about improving performance (although DIH is single 
threaded, so you could obtain a marked indexing performance gain using a 
client such as SolrJ).  With DIH you will embed a lot of SQL code into 
Solr's configuration files, and the more sources you add the more 
complicated, hard to debug and unmaintainable it's going to be. You 
should thus consider writing a proper indexing script in Java, Python or 
whatever language you are most familiar with - this has always been our 
approach.

Best


Charlie

>
> *** I could be wrong in my assumptions as I'm still learning a great deal
> about Solr. ***
>
> I appreciate your help
>
> Regards,
>
> Joe
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

Posted by Joseph_Tucker <jo...@homehardware.ca>.
Thanks for your help / suggestion.

I'm not sure I completely follow in this case.
SolrJ looks like a method to allow Java applications to talk to Solr, or any
other third party application would simply be a communication method between
Solr and the language of your choosing. 

I guess what I'm after is, how would using SolrJ improve performance when
indexing?

*** I could be wrong in my assumptions as I'm still learning a great deal
about Solr. ***

I appreciate your help 

Regards,

Joe



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I don't think you should be designing this around DIH. It was never planned
for complex scenarios. Or particularly fault tollerant, which you may need.

Either use SolrJ or a third party tools that integrate with Solr.

Regards,
     Alex

On Fri, Jul 5, 2019, 7:43 AM Joseph_Tucker, <jo...@homehardware.ca>
wrote:

> What is the best way - performance wise - to index data from multiple
> databases?
> I'm potentially going to have around 50 different data sources grabbing
> unique data
> Here's what I've roughly designed:
>
> <entity name="root" dataSource="db1" >
>      <entity name="child1" dataSource="db1" >
>      <entity name="child2" dataSource="db2" >
>      <entity name="child3" dataSource="db3" >
>      <entity name="child4" dataSource="db4" >
>      <entity name="child5" dataSource="db5" >
>      <entity name="child6" dataSource="db6" >
>      <entity name="child7" dataSource="db7" >
>      ...
>      <entity name="child50" dataSource="db50">
> </entity>
>
> I've excluded fields but each entity would have a number of fields within.
> The issue I'm seeing here is the full-index is exceedingly slow. Is there a
> better way to go about this?
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>