You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Amrit Sarkar (JIRA)" <ji...@apache.org> on 2018/10/11 17:48:00 UTC

[jira] [Updated] (SOLR-12854) Document steps to improve delta import via DataImportHandler

     [ https://issues.apache.org/jira/browse/SOLR-12854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amrit Sarkar updated SOLR-12854:
--------------------------------
    Issue Type: Improvement  (was: Bug)

> Document steps to improve delta import via DataImportHandler 
> -------------------------------------------------------------
>
>                 Key: SOLR-12854
>                 URL: https://issues.apache.org/jira/browse/SOLR-12854
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - DataImportHandler
>    Affects Versions: 7.5
>            Reporter: Amrit Sarkar
>            Priority: Major
>
> Delta imports in DataImportHandler is sometimes slower than full imports where the delta import makes multiple queries compare to full import and hence making it time complex. Listed in: https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
> In the mailing list; http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-td4338162.html one of the Solr users have noted a workaround which works perfectly and improves delta import performance, where we need to specify ${dataimporter.last_index_time} in the delta_import_query, and not delta_sql_query.
> {code}
> I found a hacky way to limit the number of 
> times deltaImportQuery was executed.
> As designed, solr executes deltaQuery to get a list of ids that need to be indexed. For each of those, it executes deltaImportQuery, which is typically very similar to the full query.
> I constructed a deltaQuery to purposely only return 1 row. E.g.
> deltaQuery = "SELECT id FROM table WHERE rownum=1" // written for 
> oracle, likely requires a different syntax for other dbs. Also, it occurred 
> to you could probably include the date>= '${dataimporter.last_index_time}' 
> filter here so this returns 0 rows if no data has changed
> Since deltaImportQuery now *only gets called once I needed to add the filter logic to *deltaImportQuery *to only select the changed rows (that logic is normally in *deltaQuery). E.g.
> deltaImportQuery = [normal import query] WHERE date >= 
> '${dataimporter.last_index_time}'
> {code}
> A number of other users have adopted the strategy and DIH delta import performance has improved, and henceforth documenting this strategy as TIP will help other users too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org