You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Srinivas Kashyap <sr...@tradestonesoftware.com> on 2018/02/15 09:00:47 UTC

Solr performance issue

Hi,

I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the child entities in data-config.xml. And i'm using the same for full-import only. And in the beginning of my implementation, i had written delta-import query to index the modified changes. But my requirement grew and i have 17 child entities for a single parent entity now. When doing delta-import for huge data, the number of requests being made to datasource(database)  became more and CPU utilization was 100% when concurrent users started modifying the data. For this instead of calling delta-import which imports based on last index time, I did full-import('SortedMapBackedCache' ) based on last index time.

Though the parent entity query would return only records that are modified, the child entity queries pull all the data from the database and the indexing happens 'in-memory' which is causing the JVM memory go out of memory.

Is there a way to specify in the child query entity to pull the record related to parent entity in the full-import mode.

Thanks and Regards,
Srinivas Kashyap

DISCLAIMER: 
E-mails and attachments from TradeStone Software, Inc. are confidential.
If you are not the intended recipient, please notify the sender immediately by
replying to the e-mail, and then delete it without making copies or using it
in any way. No representation is made that this email or any attachments are
free of viruses. Virus scanning is recommended and is the responsibility of
the recipient.

Re: Solr performance issue

Posted by Erick Erickson <er...@gmail.com>.
Srinivas:

Not an answer to your question, but when DIH starts getting this
complicated, I start to seriously think about SolrJ, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

IN particular, it moves the heavy lifting of acquiring the data from a
Solr node (which I'm assuming also has to index docs) to "some
client". It also let's you play some tricks with the code to make
things faster.

Best,
Erick

On Thu, Feb 15, 2018 at 1:00 AM, Srinivas Kashyap
<sr...@tradestonesoftware.com> wrote:
> Hi,
>
> I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the child entities in data-config.xml. And i'm using the same for full-import only. And in the beginning of my implementation, i had written delta-import query to index the modified changes. But my requirement grew and i have 17 child entities for a single parent entity now. When doing delta-import for huge data, the number of requests being made to datasource(database)  became more and CPU utilization was 100% when concurrent users started modifying the data. For this instead of calling delta-import which imports based on last index time, I did full-import('SortedMapBackedCache' ) based on last index time.
>
> Though the parent entity query would return only records that are modified, the child entity queries pull all the data from the database and the indexing happens 'in-memory' which is causing the JVM memory go out of memory.
>
> Is there a way to specify in the child query entity to pull the record related to parent entity in the full-import mode.
>
> Thanks and Regards,
> Srinivas Kashyap
>
> DISCLAIMER:
> E-mails and attachments from TradeStone Software, Inc. are confidential.
> If you are not the intended recipient, please notify the sender immediately by
> replying to the e-mail, and then delete it without making copies or using it
> in any way. No representation is made that this email or any attachments are
> free of viruses. Virus scanning is recommended and is the responsibility of
> the recipient.

Re: Solr performance issue

Posted by Shawn Heisey <ap...@elyograg.org>.
On 2/15/2018 2:00 AM, Srinivas Kashyap wrote:
> I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the child entities in data-config.xml. And i'm using the same for full-import only. And in the beginning of my implementation, i had written delta-import query to index the modified changes. But my requirement grew and i have 17 child entities for a single parent entity now. When doing delta-import for huge data, the number of requests being made to datasource(database)  became more and CPU utilization was 100% when concurrent users started modifying the data. For this instead of calling delta-import which imports based on last index time, I did full-import('SortedMapBackedCache' ) based on last index time.
>
> Though the parent entity query would return only records that are modified, the child entity queries pull all the data from the database and the indexing happens 'in-memory' which is causing the JVM memory go out of memory.

Can you provide your DIH config file (with passwords redacted) and the
precise URL you are using to initiate dataimport?  Also, I would like to
know what field you have defined as your uniqueKey.  I may have more
questions about the data in your system, depending on what I see.

That cache implementation should only cache entries from the database
that are actually requested.  If your query is correctly defined, it
should not pull all records from the DB table.

> Is there a way to specify in the child query entity to pull the record related to parent entity in the full-import mode.

If I am understanding your question correctly, this is one of the fairly
basic things that DIH does.  Look at this config example in the
reference guide:

https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#configuring-the-dih-configuration-file

In the entity named feature in that example config, the query string
uses ${item.ID} to reference the ID column from the parent entity, which
is item.

I should warn you that a cached entity does not always improve
performance.  This is particularly true if the lookup into the cache is
the information that goes to your uniqueKey field.  When the lookup is
by uniqueKey, every single row requested from the database will be used
exactly once, so there's not really any point to caching it.

Thanks,
Shawn