You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by matthew sporleder <ms...@gmail.com> on 2020/05/14 15:36:45 UTC

nested entities and DIH indexing time

It appears that adding entities to my entities in my data import
config is slowing down my import process by a lot.  Is there a good
way to speed this up?  I see the ID's are individually queried instead
of using IN() or similar normal techniques to make things faster.

Just looking for some tips.  I prefer this architecture to the way we
currently do it with complex SQL, inserting weird strings, and then
splitting on them (gross but faster).

Re: nested entities and DIH indexing time

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/14/2020 3:14 PM, matthew sporleder wrote:> Can a non-nested entity 
write into existing docs, or do they always> have to produce 
document-per-entity?
This is the only thing I found on this topic, and it is on a third-party 
website, so I can't say much about how accurate it is:

https://stackoverflow.com/questions/21006045/can-solr-dih-do-atomic-updates

I have never used a ScriptTransformer, so I do not know how to actually 
do this.

Your schema would have to be compatible with atomic updates.

Thanks,
Shawn

Re: nested entities and DIH indexing time

Posted by matthew sporleder <ms...@gmail.com>.

On Thu, May 14, 2020 at 4:46 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> On 5/14/2020 9:36 AM, matthew sporleder wrote:
> > It appears that adding entities to my entities in my data import
> > config is slowing down my import process by a lot.  Is there a good
> > way to speed this up?  I see the ID's are individually queried instead
> > of using IN() or similar normal techniques to make things faster.
> >
> > Just looking for some tips.  I prefer this architecture to the way we
> > currently do it with complex SQL, inserting weird strings, and then
> > splitting on them (gross but faster).
>
> When you have nested entities, this is how DIH works.  A separate SQL
> query for the inner entity is made for each row returned on the outer
> entity.  Nested entities tend to be extremely slow for this reason.
>
> The best way to work around this is to make the database server do the
> heavy lifting -- using JOIN or other methods so that you only need one
> entity and one SQL query.  Doing this will mean that you'll need to
> split the data after import, using either the DIH config or the analysis
> configuration in the schema.
>
> Thanks,
> Shawn

This is too bad because it is very clean and the JOIN/CONCAT/SPLIT
method is very gross.

I was also hoping to use different delta queries for each nested entity.

Can a non-nested entity write into existing docs, or do they always
have to produce document-per-entity?

Re: nested entities and DIH indexing time

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/14/2020 9:36 AM, matthew sporleder wrote:
> It appears that adding entities to my entities in my data import
> config is slowing down my import process by a lot.  Is there a good
> way to speed this up?  I see the ID's are individually queried instead
> of using IN() or similar normal techniques to make things faster.
> 
> Just looking for some tips.  I prefer this architecture to the way we
> currently do it with complex SQL, inserting weird strings, and then
> splitting on them (gross but faster).

When you have nested entities, this is how DIH works.  A separate SQL 
query for the inner entity is made for each row returned on the outer 
entity.  Nested entities tend to be extremely slow for this reason.

The best way to work around this is to make the database server do the 
heavy lifting -- using JOIN or other methods so that you only need one 
entity and one SQL query.  Doing this will mean that you'll need to 
split the data after import, using either the DIH config or the analysis 
configuration in the schema.

Thanks,
Shawn