You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Mark Waddle (JIRA)" <ji...@apache.org> on 2010/10/26 17:52:23 UTC

[jira] Created: (SOLR-2200) DIH DocBuilder - Improve perf. on large delta deletes

DIH DocBuilder - Improve perf. on large delta deletes
-----------------------------------------------------

                 Key: SOLR-2200
                 URL: https://issues.apache.org/jira/browse/SOLR-2200
             Project: Solr
          Issue Type: Improvement
          Components: contrib - DataImportHandler
    Affects Versions: 1.4.1
            Reporter: Mark Waddle


In collectDelta, the procedure that collects the PKs for the documents that should be updated or deleted for an entity, iterates over the entire deltaSet for every deleted document. This is very expensive when you are updating and deleting millions of documents in one delta-import.
Considering that the comparison between deleted and delta is on the PK, lets build the deltaSet as a HashMap instead of a HashSet to enable quick key lookups and remove the need for repeated iterations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (SOLR-2200) DIH DocBuilder - Improve perf. on large delta deletes

Posted by "Mark Waddle (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Waddle updated SOLR-2200:
------------------------------

    Attachment: SOLR-2200.patch

Uploading patch to improve performance for delta-imports with a significant number of deletions.

> DIH DocBuilder - Improve perf. on large delta deletes
> -----------------------------------------------------
>
>                 Key: SOLR-2200
>                 URL: https://issues.apache.org/jira/browse/SOLR-2200
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4.1
>            Reporter: Mark Waddle
>         Attachments: SOLR-2200.patch
>
>
> In collectDelta, the procedure that collects the PKs for the documents that should be updated or deleted for an entity, iterates over the entire deltaSet for every deleted document. This is very expensive when you are updating and deleting millions of documents in one delta-import.
> Considering that the comparison between deleted and delta is on the PK, lets build the deltaSet as a HashMap instead of a HashSet to enable quick key lookups and remove the need for repeated iterations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Resolved: (SOLR-2200) DIH DocBuilder - Improve perf. on large delta deletes

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved SOLR-2200.
-------------------------------

    Resolution: Fixed

Committed revisions 1029325 (trunk), 1029328 (3x).

Thanks Mark!

> DIH DocBuilder - Improve perf. on large delta deletes
> -----------------------------------------------------
>
>                 Key: SOLR-2200
>                 URL: https://issues.apache.org/jira/browse/SOLR-2200
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4.1
>            Reporter: Mark Waddle
>            Assignee: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2200.patch
>
>
> In collectDelta, the procedure that collects the PKs for the documents that should be updated or deleted for an entity, iterates over the entire deltaSet for every deleted document. This is very expensive when you are updating and deleting millions of documents in one delta-import.
> Considering that the comparison between deleted and delta is on the PK, lets build the deltaSet as a HashMap instead of a HashSet to enable quick key lookups and remove the need for repeated iterations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2200) DIH DocBuilder - Improve perf. on large delta deletes

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192590#comment-13192590 ] 

Robert Muir commented on SOLR-2200:
-----------------------------------

click the subversion commits tab.
                
> DIH DocBuilder - Improve perf. on large delta deletes
> -----------------------------------------------------
>
>                 Key: SOLR-2200
>                 URL: https://issues.apache.org/jira/browse/SOLR-2200
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4.1
>            Reporter: Mark Waddle
>            Assignee: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2200.patch
>
>
> In collectDelta, the procedure that collects the PKs for the documents that should be updated or deleted for an entity, iterates over the entire deltaSet for every deleted document. This is very expensive when you are updating and deleting millions of documents in one delta-import.
> Considering that the comparison between deleted and delta is on the PK, lets build the deltaSet as a HashMap instead of a HashSet to enable quick key lookups and remove the need for repeated iterations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (SOLR-2200) DIH DocBuilder - Improve perf. on large delta deletes

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2200:
------------------------------

    Fix Version/s: 4.0
                   3.1
         Assignee: Robert Muir

> DIH DocBuilder - Improve perf. on large delta deletes
> -----------------------------------------------------
>
>                 Key: SOLR-2200
>                 URL: https://issues.apache.org/jira/browse/SOLR-2200
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4.1
>            Reporter: Mark Waddle
>            Assignee: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2200.patch
>
>
> In collectDelta, the procedure that collects the PKs for the documents that should be updated or deleted for an entity, iterates over the entire deltaSet for every deleted document. This is very expensive when you are updating and deleting millions of documents in one delta-import.
> Considering that the comparison between deleted and delta is on the PK, lets build the deltaSet as a HashMap instead of a HashSet to enable quick key lookups and remove the need for repeated iterations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2200) DIH DocBuilder - Improve perf. on large delta deletes

Posted by "Mark Waddle (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192569#comment-13192569 ] 

Mark Waddle commented on SOLR-2200:
-----------------------------------

Hi Robert,

I apologize for my ignorance, but why can't I see these changes in the current dev/trunk? Am I looking in the wrong place?

Mark
                
> DIH DocBuilder - Improve perf. on large delta deletes
> -----------------------------------------------------
>
>                 Key: SOLR-2200
>                 URL: https://issues.apache.org/jira/browse/SOLR-2200
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4.1
>            Reporter: Mark Waddle
>            Assignee: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2200.patch
>
>
> In collectDelta, the procedure that collects the PKs for the documents that should be updated or deleted for an entity, iterates over the entire deltaSet for every deleted document. This is very expensive when you are updating and deleting millions of documents in one delta-import.
> Considering that the comparison between deleted and delta is on the PK, lets build the deltaSet as a HashMap instead of a HashSet to enable quick key lookups and remove the need for repeated iterations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-2200) DIH DocBuilder - Improve perf. on large delta deletes

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925043#action_12925043 ] 

Robert Muir commented on SOLR-2200:
-----------------------------------

Mark, thanks for your contribution.

Seems like a no-brainer to me, and all tests pass with the patch.

I'd like to commit this unless anyone has objections.

> DIH DocBuilder - Improve perf. on large delta deletes
> -----------------------------------------------------
>
>                 Key: SOLR-2200
>                 URL: https://issues.apache.org/jira/browse/SOLR-2200
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4.1
>            Reporter: Mark Waddle
>         Attachments: SOLR-2200.patch
>
>
> In collectDelta, the procedure that collects the PKs for the documents that should be updated or deleted for an entity, iterates over the entire deltaSet for every deleted document. This is very expensive when you are updating and deleting millions of documents in one delta-import.
> Considering that the comparison between deleted and delta is on the PK, lets build the deltaSet as a HashMap instead of a HashSet to enable quick key lookups and remove the need for repeated iterations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org