You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Srinivas Kashyap <sr...@bamboorose.com.INVALID> on 2020/05/22 16:39:45 UTC

Indexing huge data onto solr

Hi All,

We are runnnig solr 8.4.1. We have a database table which has more than 100 million of records. Till now we were using DIH to do full-import on the tables. But for this table, when we do full-import via DIH it is taking more than 3-4 days to complete and also it consumes fair bit of JVM memory while running.

Are there any speedier/alternates ways to load data onto this solr core.

P.S: Only initial data import is problem, further updates/additions to this core is being done through SolrJ.

Thanks,
Srinivas
________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.

Re: Indexing huge data onto solr

Posted by Erick Erickson <er...@gmail.com>.

It Depends (tm). Often, you can create a single (albeit, perhaps complex)
SQL query that does this for you and just process the response.

I’ve also seen situations where it’s possible to hold one of the tables 
in memory on the client and just use that rather than a separate query.

It depends on the characteristics of your particular database, your DBA
could probably help.

Best,
Erick

> On May 25, 2020, at 11:56 PM, Srinivas Kashyap <sr...@bamboorose.com.INVALID> wrote:
> 
> Hi Erick,
> 
> Thanks for the below response. The link which you provided holds good if you have single entity where you can join the tables and index it. But in our scenario, we have nested entities joining different tables as shown below:
> 
> db-data-config.xml:
> 
> <document>
> <parent entity>
>                <child entity> (table 1 join table 2)
>                <child entity> (table 3 join table 4)
>                <child entity> (table 5 join table 6)
>                <child entity> (table 7 join table 8)
> </parent entity>
> </document>
> 
> Do you have any recommendations for it to run multiple sql’s and make it as single solr document that can be sent over solrJ for indexing?
> 
> Say parent entity has 100 documents, should I iterate over each one of parent tuples and execute the child entity sql’s(with where condition of parent) to create one solr document? Won’t it be more load on database by executing more sqls? Is there an optimum solution?
> 
> Thanks,
> Srinivas
> From: Erick Erickson <er...@gmail.com>
> Sent: 22 May 2020 22:52
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing huge data onto solr
> 
> You have a lot more control over the speed and form of importing data if
> you just do the initial load in SolrJ. Here’s an example, taking the Tika
> parts out is easy:
> 
> https://lucidworks.com/post/indexing-with-solrj/<https://lucidworks.com/post/indexing-with-solrj>
> 
> It’s especially instructive to comment out just the call to CloudSolrClient.add(doclist…); If
> that _still_ takes a long time, then your DB query is the root of the problem. Even with 100M
> records, I’d be really surprised if Solr is the bottleneck, but the above test will tell you
> where to go to try to speed things up.
> 
> Best,
> Erick
> 
>> On May 22, 2020, at 12:39 PM, Srinivas Kashyap <sr...@bamboorose.com.INVALID>> wrote:
>> 
>> Hi All,
>> 
>> We are runnnig solr 8.4.1. We have a database table which has more than 100 million of records. Till now we were using DIH to do full-import on the tables. But for this table, when we do full-import via DIH it is taking more than 3-4 days to complete and also it consumes fair bit of JVM memory while running.
>> 
>> Are there any speedier/alternates ways to load data onto this solr core.
>> 
>> P.S: Only initial data import is problem, further updates/additions to this core is being done through SolrJ.
>> 
>> Thanks,
>> Srinivas
>> ________________________________
>> DISCLAIMER:
>> E-mails and attachments from Bamboo Rose, LLC are confidential.
>> If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
>> No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
>> 
>> Disclaimer
>> 
>> The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.
>> 
>> This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.

RE: Indexing huge data onto solr

Posted by Srinivas Kashyap <sr...@bamboorose.com.INVALID>.

Hi Erick,

Thanks for the below response. The link which you provided holds good if you have single entity where you can join the tables and index it. But in our scenario, we have nested entities joining different tables as shown below:

db-data-config.xml:

<document>
<parent entity>
                <child entity> (table 1 join table 2)
                <child entity> (table 3 join table 4)
                <child entity> (table 5 join table 6)
                <child entity> (table 7 join table 8)
</parent entity>
</document>

Do you have any recommendations for it to run multiple sql’s and make it as single solr document that can be sent over solrJ for indexing?

Say parent entity has 100 documents, should I iterate over each one of parent tuples and execute the child entity sql’s(with where condition of parent) to create one solr document? Won’t it be more load on database by executing more sqls? Is there an optimum solution?

Thanks,
Srinivas
From: Erick Erickson <er...@gmail.com>
Sent: 22 May 2020 22:52
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data onto solr

You have a lot more control over the speed and form of importing data if
you just do the initial load in SolrJ. Here’s an example, taking the Tika
parts out is easy:

https://lucidworks.com/post/indexing-with-solrj/<https://lucidworks.com/post/indexing-with-solrj>

It’s especially instructive to comment out just the call to CloudSolrClient.add(doclist…); If
that _still_ takes a long time, then your DB query is the root of the problem. Even with 100M
records, I’d be really surprised if Solr is the bottleneck, but the above test will tell you
where to go to try to speed things up.

Best,
Erick

> On May 22, 2020, at 12:39 PM, Srinivas Kashyap <sr...@bamboorose.com.INVALID>> wrote:
>
> Hi All,
>
> We are runnnig solr 8.4.1. We have a database table which has more than 100 million of records. Till now we were using DIH to do full-import on the tables. But for this table, when we do full-import via DIH it is taking more than 3-4 days to complete and also it consumes fair bit of JVM memory while running.
>
> Are there any speedier/alternates ways to load data onto this solr core.
>
> P.S: Only initial data import is problem, further updates/additions to this core is being done through SolrJ.
>
> Thanks,
> Srinivas
> ________________________________
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
> No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
>
> Disclaimer
>
> The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.
>
> This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.

Re: Indexing huge data onto solr

Posted by matthew sporleder <ms...@gmail.com>.

I can index (without nested entities ofc ;) ) 100M records in about
6-8 hours on a pretty low-powered machine using vanilla DIH -> mysql
so it is probably worth looking at why it is going slow before writing
your own indexer (which we are finally having to do)

On Fri, May 22, 2020 at 1:22 PM Erick Erickson <er...@gmail.com> wrote:
>
> You have a lot more control over the speed and form of importing data if
> you just do the initial load in SolrJ. Here’s an example, taking the Tika
> parts out is easy:
>
> https://lucidworks.com/post/indexing-with-solrj/
>
> It’s especially instructive to comment out just the call to CloudSolrClient.add(doclist…); If
> that _still_ takes a long time, then your DB query is the root of the problem. Even with 100M
> records, I’d be really surprised if Solr is the bottleneck, but the above test will tell you
> where to go to try to speed things up.
>
> Best,
> Erick
>
> > On May 22, 2020, at 12:39 PM, Srinivas Kashyap <sr...@bamboorose.com.INVALID> wrote:
> >
> > Hi All,
> >
> > We are runnnig solr 8.4.1. We have a database table which has more than 100 million of records. Till now we were using DIH to do full-import on the tables. But for this table, when we do full-import via DIH it is taking more than 3-4 days to complete and also it consumes fair bit of JVM memory while running.
> >
> > Are there any speedier/alternates ways to load data onto this solr core.
> >
> > P.S: Only initial data import is problem, further updates/additions to this core is being done through SolrJ.
> >
> > Thanks,
> > Srinivas
> > ________________________________
> > DISCLAIMER:
> > E-mails and attachments from Bamboo Rose, LLC are confidential.
> > If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
> > No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
> >
> > Disclaimer
> >
> > The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.
> >
> > This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.
>

Re: Indexing huge data onto solr

Posted by Erick Erickson <er...@gmail.com>.

You have a lot more control over the speed and form of importing data if
you just do the initial load in SolrJ. Here’s an example, taking the Tika
parts out is easy:

https://lucidworks.com/post/indexing-with-solrj/

It’s especially instructive to comment out just the call to CloudSolrClient.add(doclist…); If
that _still_ takes a long time, then your DB query is the root of the problem. Even with 100M
records, I’d be really surprised if Solr is the bottleneck, but the above test will tell you
where to go to try to speed things up.

Best,
Erick

> On May 22, 2020, at 12:39 PM, Srinivas Kashyap <sr...@bamboorose.com.INVALID> wrote:
> 
> Hi All,
> 
> We are runnnig solr 8.4.1. We have a database table which has more than 100 million of records. Till now we were using DIH to do full-import on the tables. But for this table, when we do full-import via DIH it is taking more than 3-4 days to complete and also it consumes fair bit of JVM memory while running.
> 
> Are there any speedier/alternates ways to load data onto this solr core.
> 
> P.S: Only initial data import is problem, further updates/additions to this core is being done through SolrJ.
> 
> Thanks,
> Srinivas
> ________________________________
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
> No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
> 
> Disclaimer
> 
> The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.
> 
> This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.