You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@subversion.apache.org by "Julian Foad (JIRA)" <ji...@apache.org> on 2017/01/18 15:31:26 UTC
[jira] [Comment Edited] (SVN-4669) Merge with much subtree mergeinfo takes hours

    [ https://issues.apache.org/jira/browse/SVN-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828274#comment-15828274 ] 

Julian Foad edited comment on SVN-4669 at 1/18/17 3:31 PM:
-----------------------------------------------------------

The pattern I am seeing is:
in find_reintegrate_merge() > calculate_left_hand_side()
* get_history_as_mergeinfo_catalog()
  queries entire history of each *target* subtree-with-mergeinfo
* find_unmerged_mergeinfo()
  queries entire history of each *source* subtree subtree-with-mergeinfo

Then later, within do_mergeinfo_aware_dir_merge():
* reporter->finish_report() > filter_self_referential_mergeinfo()
  queries entire history of each *target* subtree-with-mergeinfo
* record_mergeinfo_for_dir_merge()
  queries *partial* history of each *source* subtree-with-mergeinfo

The distinction I wanted to make between approaches (1) and (2) is that by (1) "make the queries ... efficient" I mean using general resource management techniques within the client merge code and/or at lower levels; whereas by (2) "analyse ... eliminate" I mean analysing the higher-level merge code in detail to determine which of these queries it really needs in which cases, and restructuring it so it doesn't make the queries that it doesn't really need.

Approach (1) examples include:
* remembering the results of a previous query for later re-use,
* considering if a different query (such as 'log') could obtain the histories of many paths at once more efficiently than querying each path separately,
  and possibly even
* adding new protocol elements and server-side processing to better support the query pattern.

Approach (2) examples include
* determining if not all of the results may eventually be needed, and if so then delaying the queries until needed so that only the needed subset of them is actually performed,
* storing the first entire-source-histories result and changing the later partial-source-histories query to operate on that stored result.


was (Author: julianfoad):
The distinction I wanted to make between approaches (1) and (2) is that by (1) "make the queries ... efficient" I mean using general resource management techniques within the client merge code and/or at lower levels; whereas by (2) "analyse ... eliminate" I mean analysing the higher-level merge code in detail to determine which of these queries it really needs in which cases, and restructuring it so it doesn't make the queries that it doesn't really need.

The pattern I am seeing is:
in find_reintegrate_merge() > calculate_left_hand_side()
* get_history_as_mergeinfo_catalog()
  queries entire history of each *target* subtree-with-mergeinfo
* find_unmerged_mergeinfo()
  queries entire history of each *source* subtree subtree-with-mergeinfo

Then later, within do_mergeinfo_aware_dir_merge():
* reporter->finish_report() > filter_self_referential_mergeinfo()
  queries entire history of each *target* subtree-with-mergeinfo
* record_mergeinfo_for_dir_merge()
  queries *partial* history of each *source* subtree-with-mergeinfo

Approach (1) examples include:
* remembering the results of a previous query for later re-use,
* considering if a different query (such as 'log') could obtain the histories of many paths at once more efficiently than querying each path separately,
  and possibly even
* adding new protocol elements and server-side processing to better support the query pattern.

Approach (2) examples include
* determining if not all of the results may eventually be needed, and if so then delaying the queries until needed so that only the needed subset of them is actually performed,
* storing the first entire-source-histories result and changing the later partial-source-histories query to operate on that stored result.

> Merge with much subtree mergeinfo takes hours
> ---------------------------------------------
>
>                 Key: SVN-4669
>                 URL: https://issues.apache.org/jira/browse/SVN-4669
>             Project: Subversion
>          Issue Type: Bug
>    Affects Versions: 1.9.5
>            Reporter: Julian Foad
>              Labels: performance
>
> When there is explicit mergeinfo on thousands of subtrees, a merge at the subtree root can take several hours.
> The merge code makes multiple (4 in my test) synchronous svn_ra_get_location_segments() queries to the server for each subtree with mergeinfo. If, for example, 3600 subtrees have mergeinfo and each query takes 0.25 second, that adds up to an hour.
> This is related to but different from the memory usage issue SVN-4667. The results of most of these queries are stored temporarily and account for only 10% of the total memory used in my test.
> Possible approaches to improving the merge code are: (1) make the queries much more efficient; and (2) analyse how the results are used and eliminate unnecessary queries.
> The work-flow approach to improving the experience is: get the number of subtrees with mergeinfo down to none or very few.
> (WANdisco's internal issue id: SVNB-1952.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)